125
Deriving Knowledge from Data at Scale

Barga Data Science lecture 10

Embed Size (px)

Citation preview

Page 1: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Models in Production

Deriving Knowledge from Data at Scale

Putting an ML Model into Production

bull AB Testing

Deriving Knowledge from Data at Scale

Controlled Experiments in One Slide

Concept is Trivial

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Best Practice AA Test

Run AA tests

before

Deriving Knowledge from Data at Scale

Best Practice Ramp-up

Ramp-up

Deriving Knowledge from Data at Scale

Best Practice Run Experiments at 5050

Deriving Knowledge from Data at Scale

Cost based learning

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 2: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Models in Production

Deriving Knowledge from Data at Scale

Putting an ML Model into Production

bull AB Testing

Deriving Knowledge from Data at Scale

Controlled Experiments in One Slide

Concept is Trivial

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Best Practice AA Test

Run AA tests

before

Deriving Knowledge from Data at Scale

Best Practice Ramp-up

Ramp-up

Deriving Knowledge from Data at Scale

Best Practice Run Experiments at 5050

Deriving Knowledge from Data at Scale

Cost based learning

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 3: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Putting an ML Model into Production

bull AB Testing

Deriving Knowledge from Data at Scale

Controlled Experiments in One Slide

Concept is Trivial

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Best Practice AA Test

Run AA tests

before

Deriving Knowledge from Data at Scale

Best Practice Ramp-up

Ramp-up

Deriving Knowledge from Data at Scale

Best Practice Run Experiments at 5050

Deriving Knowledge from Data at Scale

Cost based learning

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 4: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Controlled Experiments in One Slide

Concept is Trivial

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Best Practice AA Test

Run AA tests

before

Deriving Knowledge from Data at Scale

Best Practice Ramp-up

Ramp-up

Deriving Knowledge from Data at Scale

Best Practice Run Experiments at 5050

Deriving Knowledge from Data at Scale

Cost based learning

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 5: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Best Practice AA Test

Run AA tests

before

Deriving Knowledge from Data at Scale

Best Practice Ramp-up

Ramp-up

Deriving Knowledge from Data at Scale

Best Practice Run Experiments at 5050

Deriving Knowledge from Data at Scale

Cost based learning

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 6: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Best Practice Ramp-up

Ramp-up

Deriving Knowledge from Data at Scale

Best Practice Run Experiments at 5050

Deriving Knowledge from Data at Scale

Cost based learning

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 7: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Best Practice Run Experiments at 5050

Deriving Knowledge from Data at Scale

Cost based learning

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 8: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Cost based learning

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 9: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning

weighting method

false negatives FNtry to avoid

false negatives

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 10: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Preprocess Classify

metaCostSensitiveClassifier

set the FN to 100 FP to 10

tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 11: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Imbalanced Class DistributionWEKA cost sensitive learning

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 12: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 13: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

curatedcompletely specify a problem measure progress

paired with a metric target SLAs scoreboard

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 14: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

This isnrsquot easyhellip

bull Building high quality gold sets is a challenge

bull It is time consuming

bull It requires making difficult and long lasting

choices and the rewards are delayedhellip

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 15: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

enforce a few principles

1 Distribution parity

2 Testing blindness

3 Production parity

4 Single metric

5 Reproducibility

6 Experimentation velocity

7 Data is gold

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 16: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Test set blindness

bull Reproducibility and Data is gold

bull Experimentation velocity

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 17: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip

1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)

2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)

3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 18: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation

5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 19: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets

a One set should have the deployed features (computed from the raw data) This provides the production yardstick

b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns

c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set

7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 20: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem

a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)

b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters

c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID

9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 21: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is

11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 22: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 23: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Greatest Challenge in Machine Learning

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 24: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

gender age smoker eye color

male 19 yes green

female 44 yes gray

male 49 yes blue

male 12 no brown

female 37 no brown

female 60 no brown

male 44 no blue

female 27 yes brown

female 51 yes green

female 81 yes gray

male 22 yes brown

male 29 no blue

lung cancer

no

yes

yes

no

no

yes

no

no

yes

no

no

no

male 77 yes gray

male 19 yes green

female 44 no gray

yes

no

no

Train ML Model

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 25: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The greatest challenge in Machine LearningLack of Labelled Training Datahellip

What to Do

bull Controlled Experiments ndash get feedback from user to serve as labels

bull Mechanical Turk ndash pay people to label data to build training set

bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 26: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

What if you cant get labeled Training Data

Traditional Supervised Learning

bull Promotion on bookseller rsquos web page

bull Customers can rate books

bull Will a new customer like this book

bull Training set observations on previous customers

bull Test set new customers

Whathappensif only few customers rate a book

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Model

Test Data

Prediction

Training Data

Attributes

Target

Label

copy 2013 Datameer Inc All rights reserved

Age Income LikesBook

24 60K +

65 80K -

60 95K -

35 52K +

20 45K +

43 75K +

26 51K +

52 47K -

47 38K -

25 22K -

33 47K +

Age Income LikesBook

22 67K

39 41K

Age Income LikesBook

22 67K +

39 41K -

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 27: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Semi-Supervised Learning

Can we make use of the unlabeled data

In theory no

but we can make assumptions

PopularAssumptions

bull Clustering assumption

bull Low density assumption

bull Manifold assumption

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 28: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumption

bull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 29: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 30: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 31: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The ClusteringAssumption

Clustering

bull Partition instances into groups (clusters) of similar

instances

bull Many different algorithms k-Means EM etc

Clustering Assumptionbull The two classification targets are distinct clusters

bull Simple semi-supervised learning cluster then

perform majority vote

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 32: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 33: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 34: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 35: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 36: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 37: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Generative Models

Mixture of Gaussiansbull Assumption the data in each cluster is generated

by a normal distribution

bull Find most probable location and shape of clusters

given data

Expectation-Maximization

bull Two step optimization procedure

bull Keeps estimates of cluster assignment probabilities

for each instance

bull Might converge to local optimum

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 38: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

BeyondMixtures of Gaussians

Expectation-Maximization

bull Can be adjusted to all kinds of mixture models

bull Eg use Naive Bayes as mixture model for text classification

Self-Training

bull Learn model on labeled instances only

bull Apply model to unlabeled instances

bull Learn new model on all instances

bull Repeat until convergence

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 39: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 40: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 41: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Assumption

bull The area between the two classes has low density

bull Does not assume any specific form of cluster

Support Vector Machine

bull Decision boundary is linear

bull Maximizes margin to closest instances

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 42: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 43: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 44: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The Low DensityAssumption

Semi-Supervised SVMbull Minimize distance to labeled and

unlabeled instancesbull Parameter to fine-tune influence of

unlabeled instancesbull Additional constraint keep class balance correct

Implementationbull Simple extension of SVM

bull But non-convex optimization problem

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 45: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 46: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 47: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 48: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 49: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Semi-Supervised SVM

Stochastic Gradient Descentbull One run over the data in random order

bull Each misclassified or unlabeled instance moves

classifier a bit

bull Steps get smaller over time

Implementation on Hadoopbull Mapper send data to reducer in random order

bull Reducer update linear classifier for unlabeled

or misclassified instances

bull Many random runs to find best one

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 50: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

The ManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifoldbull One can perform learning in a more meaningful

low-dimensional spacebull Avoids curse of dimensionality

Similarity Graphs

bull Idea compute similarity scores between instances

bull Create network where the nearest

neighbors are connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 51: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more meaningful

low-dimensional space

bull Avoids curse of dimensionality

Similarity Graphsbull Idea compute similarity scores between instances

bull Create a network where the nearest neighbors are

connected

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 52: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

TheManifoldAssumption

The Assumption

bull Training data is (roughly) contained in a low

dimensional manifold

bull One can perform learning in a more

meaningful low-dimensional space

bull Avoids curse of dimensionality

SimilarityGraphs

bull Idea compute similarity scores between instances

bullCreate network where the nearest neighbors

are connected

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 53: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 54: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 55: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 56: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Label Propagation

Main Ideabull Propagate label information to neighboring instances

bull Then repeat until convergence

bull Similar to PageRank

Theorybull Known to converge under weak conditions

bull Equivalent to matrix inversion

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 57: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Conclusion

Semi-Supervised Learningbull Only few training instances have labels

bull Unlabeled instances can still provide valuable signal

Different assumptions lead to different approachesbull Cluster assumption generative models

bull Low density assumption semi-supervised support vector machines

bull Manifold assumption label propagation

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 58: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

10 Minute Breakhellip

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 59: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Controlled Experiments

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 60: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull A

bull B

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 61: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

OEC

Overall Evaluation Criterion

Picking a good OEC is key

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 62: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 63: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Lesson 2 GET THE DATA

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 64: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 65: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Lesson 2 Get the data

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 66: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Lesson 3 Prepare to be humbledLeft Elevator Right Elevator

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 67: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Lesson 1

bull Lesson 2

bull Lesson 3

15 Bing

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 68: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull HiPPO stop the project

From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 69: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

TED talk

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 70: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Must run statistical tests to confirm differences are not due to chance

bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 71: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 72: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if you think theyrsquore about the same

A B

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 73: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull A was 85 better

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 74: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

A

B

Differences A has taller search box (overall size is the same) has magnifying glass icon

ldquopopular searchesrdquo

B has big search button

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 75: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 76: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 77: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

A B

bull Raise your right hand if you think A Wins

bull Raise your left hand if you think B Wins

bull Donrsquot raise your hand if they are the about the same

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 78: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

get the data prepare to be humbled

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 79: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Any statistic that appears interesting is almost certainly a mistake

If something is ldquoamazingrdquo find the flaw

Examples

If you have a mandatory birth date field and people think itrsquos

unnecessary yoursquoll find lots of 111111 or 010101

If you have an optional drop down do not default to the first

alphabetical entry or yoursquoll have lots jobs = Astronaut

The previous Office example assumes click maps to revenue

Seemed reasonable but when the results look so extreme find

the flaw (conversion rate is not the same see why)

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 80: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Data Trumps Intuition

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 81: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Sir Ken Robinson

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 82: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull OEC = Overall Evaluation Criterion

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 83: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 84: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

It is difficult to get a man to understand something when his

salary depends upon his not understanding it

-- Upton Sinclair

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 85: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Hubris

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 86: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s

bull In 19th-century Europe childbed fever killed more than a million women

bull Measurement the mortality rate for women giving birth was

bull 15 in his ward staffed by doctors and students

bull 2 in the ward at the hospital attended by midwives

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 87: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Cultural Stage 2Insight through Measurement and Control

bull He tried to control all differences

bull Birthing positions ventilation diet even the way laundry was done

bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him

bull Insight

bull Doctors were performing autopsies each morning on cadavers

bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians

bull He experiments with cleansing agents

bull Chlorine and lime was effective death rate fell from 18 to 1

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 88: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Semmelweis Reflex

bull Semmelweis Reflex

2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 89: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Fundamental Understanding

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 90: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

HubrisMeasure and

Control

Accept Results

avoid

Semmelweis

Reflex

Fundamental

Understanding

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 91: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Controlled Experiments in one slide

bull Examples yoursquore the decision maker

bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 92: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 93: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Real Data for the city of Oldenburg

Germany

bull X-axis stork population

bull Y-axis human population

What your mother told you about babies and

storks when you were three is still not right

despite the strong correlational ldquoevidencerdquo

Ornitholigische Monatsberichte 193644(2)

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 94: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Women have smaller palms and live 6 years longer

on average

Buthellipdonrsquot try to bandage your hands

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 95: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

causal

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 96: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

If you dont know where you are going any road will take you there

mdashLewis Carroll

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 97: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 98: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

before

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 99: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 100: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 101: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

bull Hippos kill more humans than any other (non-human) mammal (really)

bull OEC

Get the data

bull Prepare to be humbled

The less data the stronger the opinionshellip

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 102: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Out of Class Reading

Eight (8) page conference paper

40 page journal versionhellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 103: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 104: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Course ProjectDue Oct 25th

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 105: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Open Discussion on Course Projecthellip

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 106: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 107: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 108: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Gallery of Experiments

Contributed by the community

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 109: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Azure Machine Learning Studio

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 110: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Sample

Experiments

To help you get started

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 111: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Experiment

Tools that you can use in your

experiment For feature

selection large set of machine

learning algorithms

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 112: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 113: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Using

classificatio

n

algorithms

Evaluating

the model

Splitting to

Training

and Testing

Datasets

Getting

Data

For the

Experiment

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 114: Barga Data Science lecture 10

Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 115: Barga Data Science lecture 10

Deriving Knowledge from Data at ScaleCustomer Churn Model

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 116: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deployed web service endpoints

that can be consumed by applications

and for batch processing

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 117: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 118: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand the

Data

Pre-processing

Feature andor

Target

construction

1 Define the objective and quantify it with a metric ndash optionally with constraints

if any This typically requires domain knowledge

2 Collect and understand the data deal with the vagaries and biases in the data

acquisition (missing data outliers due to errors in the data collection process

more sophisticated biases due to the data collection procedure etc

3 Frame the problem in terms of a machine learning problem ndash classification

regression ranking clustering forecasting outlier detection etc ndash some

combination of domain knowledge and ML knowledge is useful

4 Transform the raw data into a ldquomodeling datasetrdquo with features weights

targets etc which can be used for modeling Feature construction can often

be improved with domain knowledge Target must be identical (or a very

good proxy) of the quantitative metric identified step 1

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 119: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Feature selection

Model training

Model scoring

Evaluation

Train Test split

5 Train test and evaluate taking care to control

biasvariance and ensure the metrics are

reported with the right confidence intervals

(cross-validation helps here) be vigilant

against target leaks (which typically leads to

unbelievably good test metrics) ndash this is the

ML heavy step

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 120: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Define

Objective

Access and

Understand

the data

Pre-processing

Feature andor

Target

construction

Feature selection

Model training

Model scoring

Evaluation

Train Test split

6 Iterate steps (2) ndash (5) until the test metrics are satisfactory

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 121: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Access Data

Pre-processing

Feature

construction

Model scoring

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 122: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 123: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 124: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Book

Recommendation

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip

Page 125: Barga Data Science lecture 10

Deriving Knowledge from Data at Scale

Thatrsquos all for our coursehellip