15
Fall 2017 CptS 483:04 Introduction to Data Science Machine (Statistical) Learning Overview Assefaw Gebremedhin A lecture based on Chapter 1 of “Machine Learning: A probabilistic perspective”, Kevin Murphy, MIT Press, 2012

CptS 483:04 Introduction to Data Science - SCADS · Fall 2017 CptS 483:04 Introduction to Data Science Machine (Statistical) Learning Overview Assefaw Gebremedhin A lecture based

Embed Size (px)

Citation preview

Fall 2017

CptS 483:04 Introduction to Data Science

Machine (Statistical) Learning Overview

Assefaw Gebremedhin

A lecture based on Chapter 1 of “Machine Learning: A probabilistic perspective”, Kevin Murphy, MIT Press, 2012

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Machine Learning

•  ML is a set of methods that can automatically detect patterns in data, and then use the uncovered patterns •  to predict future data or •  to perform other kinds of decision making under uncertainty

•  In ML, uncertainty comes in many forms •  What is the best prediction about the future given some past data? •  What is the best model to explain some data? •  What measurement should I perform next?

•  A probabilistic approach is often taken to deal with these (a view adopted by Murphy’s book). •  This is similar to statistics (statistical learning), but slightly different in emphasis and terminology

•  ML: greater emphasis on large-scale applications and prediction accuracy •  SL : emphasizes models and their interpretability, precision and uncertainty

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Types of machine learning

• Predictive (supervised learning) •  Learn a mapping from inputs x to outputs y, given a labeled set of input-output

pairs. • Descriptive (unsupervised learning)

•  Given data, find “interesting patterns” in the data. • Reinforcement learning

•  Useful for learning how to act or behave when given occasional reward or punishment signals.

We will cover some supervised learning and unsupervised learning in this course, but no reinforcement learning.

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Types of machine learning

• Predictive (supervised learning) •  Learn a mapping from inputs x to outputs y, given a labeled set of input-output

pairs D. •  The set D is called the training set. •  In the simplest setting, each training input xi is a d-dimensional vector of

numbers (e.g. ht and wt of a person). These are called features or attributes. •  In general xi could be a complex structured object (e.g. an image, an email

message, a time series, a molecular shape, a graph, etc). •  The output (response) variable can in principle be anything, but commonly yi

is categorical or nominal from some finite set, or yi is a real-valued scalar. •  yi categorical, the problem is called classification (or pattern recognition) •  yi real-valued, the problem is called regression

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Types of machine learning

• Descriptive (unsupervised learning) •  Given data, find “interesting patterns” in the data •  Sometimes called knowledge discovery •  Much less well-defined problem, since we are not told what kinds of patterns

to look for, and there is no obvious error metric

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Examples of supervised learning: 1. classification

Training data represented as an N by D design matrix

Formalization of classification problem: Assume y = f(x), for some unknown f Make predictions using estimate y’ = f’(x)

Ambiguous cases: The need for probabilistic predictions p(y|x,D) Find “best guess” using y’ = argmax_c p(y=c|x,D) (Mode of the distribution p(y|x,D) aka MAP estimate)

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Classification cont’d: “Best guess”

•  Consider the case where p(y’|x,D) is far from 1.0 •  In such a case, it might be better to say “I don’t know” instead of returning

an answer we don’t really trust •  This is particularly important in fields such as medicine or finance where

we may be risk averse • Another application where it is important to assess risk is when playing TV

game shows, such as Jeopardy •  The Watson story

•  Similarly, Google has a system known as Smart Ad Selection System •  Predict the probability that you will click on an add based on your search history and

other user and ad-specific features (Click-through rate)

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Real-world applications of classification

• Document classification •  Classify a document into one of C classes •  Email spam filtering is one special case

•  Image classification •  Handwriting recognition (MINST dataset) •  Object detection

•  Face detection •  Face recognition

• Classifying flowers

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Classifying iris flowers (Ronald Fisher)

setosa versicolor virginica

Four useful features: sepal length and width, and petal length and width

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Classifying iris flowers Scatter plot (EDA)

red: setosas. Distinguishable by petal length.

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Examples of supervised learning: 2. Regression

• Same as classification except response variable is continuous • Example

Linear Polynomial (degree 2)

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Examples of real-world regression problems

• Predict tomorrow’s stock market price given current market conditions and other possible side information • Predict the age of a viewer watching a given video on YouTube • Predict the location in 3d space of a robot arm end effector, given

control signals (torques) sent to its various motors • Predict the amount of prostate specific antigen (PSA) in the body as a

function of a number of different clinical measurements • Predict the temperature at any location inside a building using

weather data, time, door sensors, etc

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Unsupervised learning

•  Given output data, no input data, goal is to “discover interesting structure” (KD) •  Unlike supervised learning, we are not told what the desired output is for each input •  Task can be formalized as density estimation; i.e. we want to build models of the form

p(xi|t) •  Two differences from the supervised case:

•  Unconditional density estimation •  We write p(xi|t), instead of p(yi|xi, t) in supervised case, which is conditional density estimation

•  xi is vector of features (need to create multivariate probability models) •  In contrast, in supervised learning, yi is just a single variable we are trying to predict. This means that for most supervised learning problems, we can use univariate probability models (with input-dependent parameters), which simplifies the problem

•  Unsupervised learning is more typical of human and animal learning •  It is also more widely applicable than SL, since it does not require a human expert to

manually label data •  Quote from Geoff Hinton, ML professor at U. Toronto

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Examples of unsupervised learning

• Discovering clusters •  Model-based (brings up model

selection) •  Ad hoc algorithms (data mining)

• Discovering latent factors •  Dimensionality reduction

•  Principal component analysis

• Discovering graph structure • Matrix completion

•  Image inpainting •  Collaborative filtering •  Market basket analysis

Assefaw Gebremedhin: Introduction to Data Science, http://scads.eecs.wsu.edu

Examples of application areas where PCA is very useful

•  In biology, it is common to use PCA to interpret gene microarray data, to account for the fact that each measurement is usually the result of many genes which are correlated in their behavior because they belong to different biological pathways •  In natural language processing, it is common to use a variant of PCA called

latent semantic analysis for document retrieval •  In signal processing (e.g. acoustic or neural signals), it is common to use

ICA (a variant of PCA) to separate signals into their different sources •  In computer graphics, it is common to project motion capture data to a low

dimensional space, and use it to create animations