54
CSCE 587 Final Review

CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Embed Size (px)

Citation preview

Page 1: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

CSCE 587 Final Review

Page 2: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Rules of Engagement

• Questions can be of the form• 1) Explain/show property X• 2) Which properties are required for Y• 3) Apply method Z• 4) How do methods A and B compare• 5) I have problem C what methods are

applicable?• 6) Multiple guess? (Probably not )

Page 3: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Support Vector Machines

Page 4: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

SVMs

• Type of task svm is appropriate for?• How does svm differ from other analytical

approaches?• How does the svm classify in higher

dimensions that are present in the input data?• What is a support vector

Page 5: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Time Series Analysis

Page 6: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Modeling a Time Series• Let's model the time series as

Yt =Tt +St +Rt, t=1,...,n.

• Tt: Trend term– Sales of iPads steadily increased over the last few years: trending upward.

• St: The seasonal term (short term periodicity)– Retail sales fluctuate in a regular pattern over the course of a year.

• Typically, sales increase from September through December and decline in January and February.

• Rt: Random fluctuation– Noise, or regular high frequency patterns in fluctuation

6Module 4: Analytics Theory/Methods

Page 7: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Stationary Sequences

Many time series analyses (Basic Box-Jenkins in particular) assume stationary sequences:

– Mean, variance and autocorrelation structure do not change over time

– In practice, this often means you must de-trend and seasonally adjust the data

– ARIMA in principle can make the data (more) stationary with differencing

7

Page 8: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Module 4: Analytics Theory/Methods 8

De-trending• In this example, we see a

linear trend, so we fit a linear model– T*t = mYt + b

• The de-trended series is then – Y1

t = Yt – T*t

• In some cases, may have to fit a non-linear model– Quadratic– Exponential

Page 9: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Module 4: Analytics Theory/Methods 9

Seasonal Adjustment• Often, we know the "season"

– For both retail sales and CO2 concentration, we can model the period as being a year, with variation at the month level

• Simple ad-hoc adjustment: take several years of data, calculate the average value for each month, and subtract that from Y1

t

Y2t = Y1

t – S*t

Page 10: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

ARMA Model

• The simplest Box-Jenkins Model• Combination of two process models

– Autoregressive: Yt is a linear combination of its last p values

– Moving average: Yt is a constant value plus the effects of a dampened white noise process over the last q time steps

10Module 4: Analytics Theory/Methods

Page 11: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

ARIMA Model

• ARIMA adds a differencing term, d, to make the series more stationary – rule of thumb:

• linear trend can be removed by d=1• quadratic trend by d=2, and so on…

A combination of AR and MA modelsThe general non-seasonal model is known as ARIMA (p, d, q):

p is the number of autoregressive termsd is the number of differencesq is the number of moving average terms

11Module 4: Analytics Theory/Methods

Page 12: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Reasons to Choose (+) Cautions (-)Minimal data collection

Only have to collect the series itselfDo not need to input drivers

No meaningful drivers: prediction based only on past performance

No explanatory valueCan't do "what-if" scenariosCan't stress test

Designed to handle the inherent autocorrelation of lagged time series

Compared to simple linear regressionOnce you've seasonally/trend adjusted

It's an "art form" to select appropriate parameters

Suitable for short term predictions only

Time Series Analysis - Reasons to Choose (+) & Cautions (-)

Module 4: Analytics Theory/Methods12

Page 13: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Hadoop

Page 14: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

• MapReduce: Key-Value particulars?• HDFS vs NFS?• Limitations of MapReduce?

Page 15: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

K-means Clustering

Page 16: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

K-Means Clustering - What is it?• Used for clustering numerical data, usually a set of

measurements about objects of interest.

• Input: numerical. There must be a distance metric defined over the variable space.– Euclidian distance

• Output: The centers of each discovered cluster, and the assignment of each input datum to a cluster.– Centroid

16Module 4: Analytics Theory/Methods

Page 17: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Picking KHeuristic: find the "elbow" of the within-sum-of-squares (wss) plot as a function of K.

K: # of clustersni: # points in ith clusterci: centroid of ith clusterxij: jth point of ith cluster

"Elbows" at k=2,4,6

17Module 4: Analytics Theory/Methods

Page 18: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Reasons to Choose (+) Cautions (-) Easy to implement Doesn't handle categorical variablesEasy to assign new data to existing clusters Which is the nearest cluster center?

Sensitive to initialization (first guess)

Concise output Coordinates the K cluster centers

Variables should all be measured on similar or compatible scales Not scale-invariant!K (the number of clusters) must be known or decided a priori Wrong guess: possibly poor resultsTends to produce "round" equi-sized clusters. Not always desirable

K-Means Clustering - Reasons to Choose (+) and Cautions (-)

Module 4: Analytics Theory/Methods18

Page 19: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Hypothesis Testing

CSCE 587

Page 20: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Background

Assumptions:– Samples reflect an underlying distribution– If two distributions are different, the samples

should reflect this.– A difference should be testable

Page 21: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Intuition: Difference of Means

Module 3: Basic Data Analytic Methods Using R 21

If m1 = m2, this area is large

Page 22: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Test Selection

• One sample test:– compare sample to population– Actually: sample mean vs population mean– Does the sample match the population

• Two sample test: – compare two samples– Sample distribution of the difference of the means

Page 23: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

One Sample t-Test

Test: One Sample t-test• Used only for tests of the population mean• Null hypothesis: the means are the same• Compare the mean of the sample with the mean of the

population• What do we know about the population?

nsobservatioSDt

sample

populationsample

#/

Page 24: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Confidence Intervals

Module 3: Basic Data Analytic Methods Using R 24

If x is your estimate of some unknown value μ, the P% confidence interval

is the interval around x that μ will fall in, with probability P.

Example: • Gaussian data N(μ, σ) • x is the estimate of μ

• based on n samples

μ falls in the interval

x ± 2σ/√n

with approx. 95% probability("95% confidence")

Page 25: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Association Rules

CSCE 587

Page 26: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Apriori Algorithm - What is it?Support

• Earliest of the association rule algorithms• Frequent itemset: a set of items L that appears

together "often enough“:– Formally: meets a minimum support criterion– Support: the % of transactions that contain L

• Apriori Property: Any subset of a frequent itemset is also frequent– It has at least the support of its superset

26Module 4: Analytics Theory/Methods

Page 27: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Apriori Algorithm (Continued)Confidence

• Iteratively grow the frequent itemsets from size 1 to size K (or until we run out of support).– Apriori property tells us how to prune the search

space• Frequent itemsets are used to find rules X->Y with

a minimum confidence:– Confidence: The % of transactions that contain X,

which also contain Y• Output: The set of all rules X -> Y with minimum

support and confidence27Module 4: Analytics Theory/Methods

Page 28: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Lift and Leverage

28Module 4: Analytics Theory/Methods

Page 29: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Reasons to Choose (+) Cautions (-) Easy to implement Requires many database scansUses a clever observation to prune the search space

• Apriori property

Exponential time complexity

Easy to parallelize Can mistakenly find spurious (or coincidental) relationships

• Addressed with Lift and Leverage measures

Apriori - Reasons to Choose (+) and Cautions (-)

Module 4: Analytics Theory/Methods29

Page 30: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Linear Regression

CSCE 587

Page 31: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Linear Regression -What is it?

• Used to estimate a continuous value as a linear (additive) function of other variables– Income as a function of years of education, age, gender– House price as function of median home price in neighborhood,

square footage, number of bedrooms/bathrooms– Neighborhood house sales in the past year based on unemployment,

stock price etc.

• Input variables can be continuous or discrete.• Output:

– A set of coefficients that indicate the relative impact of each driver. – A linear expression for predicting outcome as a function of drivers.

31Module 4: Analytics Theory/Methods

Page 32: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Technical Description

• Solve for the bi

– Ordinary Least Squares• storage quadratic in number of variables• must invert a matrix

• Categorical variables are expanded to a set of indicator variables, one for each possible value.

32Module 4: Analytics Theory/Methods

Page 33: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Reasons to Choose (+) Cautions (-)Concise representation (the coefficients) Does not handle missing values well

Robust to redundant variables, correlated variables

Lose some explanatory value

Assumes that each variable affects the outcome linearly and additively

Variable transformations and modeling variable interactions can alleviate thisA good idea to take the log of monetary amounts or any variable with a wide dynamic range

Explanatory valueRelative impact of each variable on the outcome

Can't handle variables that affect the outcome in a discontinuous way

Step functionsEasy to score data Doesn't work well with discrete drivers that

have a lot of distinct valuesFor example, ZIP code

Linear Regression - Reasons to Choose (+) and Cautions (-)

Module 4: Analytics Theory/Methods33

Page 34: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Logistic Regression

CSCE 587

Page 35: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

( ) log(1 )

pLOGIT p z

p

exp( )

1 exp( )

zp

z

exp( ) ln

(1 ) 1 exp

zpLOGIT p z p

p z

The Logistic Curve

z (log odds)

p (p

roba

bilit

y)

Page 36: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

The Logistic Regression Model

0 1 1 2 2 K K

P Yln

1-P YX X X

predictor variables

YP1

YPln is the log(odds) of the outcome.

dichotomous outcome

Page 37: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Relationship between Odds & Probability

Probability eventOdds event =

1-Probability event

Odds eventProbability event

1+Odds event

Page 38: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Reasons to Choose (+) Cautions (-)Explanatory value:

Relative impact of each variable on the outcome in a more complicated way than linear regression

Does not handle missing values well

Robust with redundant variables, correlated variablesLose some explanatory value

Assumes that each variable affects the log-odds of the outcome linearly and additively

Variable transformations and modeling variable interactions can alleviate thisA good idea to take the log of monetary amounts or any variable with a wide dynamic range

Concise representation with the the coefficients

Cannot handle variables that affect the outcome in a discontinuous way.

Step functions

Easy to score data Doesn't work well with discrete drivers that have a lot of distinct values

For example, ZIP code

Returns good probability estimates of an event

Preserves the summary statistics of the training data"The probabilities equal the counts"

Logistic Regression - Reasons to Choose (+) and Cautions (-)

Module 4: Analytics Theory/Methods38

Page 39: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Naïve Bayes

CSCE 587

Page 40: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Bayesian Classification

• Problem statement:– Given features X1,X2,…,Xn

– Predict a label Y

Page 41: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

The Bayes Classifier• Use Bayes Rule!

• Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label

Normalization Constant

Likelihood Prior

Page 42: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Model Parameters

• The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way too many parameters:– We’ll run out of space– We’ll run out of time– And we’ll need tons of training data (which is usually not

available)

Page 43: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

The Naïve Bayes Model• The Naïve Bayes Assumption: Assume that all features are

independent given the class label Y• Equationally speaking:

• (We will discuss the validity of this assumption later)

Page 44: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Why is this useful?

• # of parameters for modeling P(X1,…,Xn|Y):

– 2(2n-1)

• # of parameters for modeling P(X1|Y),…,P(Xn|Y)

– 2n

Page 45: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Reasons to Choose (+) Cautions (-)Handles missing values quite well Numeric variables have to be discrete

(categorized) IntervalsRobust to irrelevant variables Sensitive to correlated variables

"Double-counting"Easy to implement Not good for estimating probabilities

Stick to class label or yes/noEasy to score data

Resistant to over-fitting

Computationally efficientHandles very high dimensional problemsHandles categorical variables with a lot of levels

Naïve Bayesian Classifier - Reasons to Choose (+) and Cautions (-)

Module 4: Analytics Theory/Methods45

Page 46: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Decision Trees

CSCE 587

Page 47: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Decision Tree – Example of Visual Structure

Gender

Income Age

Yes NoYes No

MaleFemale

<=40 >40>45,000<=45,000

Internal Node – decision on variable

Leaf Node – class label

Branch – outcome of test

Income Age

Female Male

47Module 4: Analytics Theory/Methods

Page 48: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

General Algorithm

• To construct tree T from training set S– If all examples in S belong to some class in C, or S is

sufficiently "pure", then make a leaf labeled C.– Otherwise:

• select the “most informative” attribute A • partition S according to A’s values• recursively construct sub-trees T1, T2, ..., for the subsets of S

• The details vary according to the specific algorithm – CART, ID3, C4.5 – but the general idea is the same

48Module 4: Analytics Theory/Methods

Page 49: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Step 1: Pick the Most “Informative" Attribute

• Entropy-based methods are one common way

• H = 0 if p(c) = 0 or 1 for any class– So for binary classification, H=0 is a "pure" node

• H is maximum when all classes are equally probable– For binary classification, H=1 when classes are 50/50

49Module 4: Analytics Theory/Methods

Page 50: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Step 1: Pick the Most “Informative" Attribute Information Gain

• The information that you gain, by knowing the value of an attribute

• So the "most informative" attribute is the attribute with the highest InfoGain

50Module 4: Analytics Theory/Methods

Page 51: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Reasons to Choose (+) Cautions (-)Takes any input type (numeric, categorical)

In principle, can handle categorical variables with many distinct values (ZIP code)

Decision surfaces can only be axis-aligned

Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the training data

Naturally handles variable interaction A "deep" tree is probably over-fit Because each split reduces the training data for subsequent splits

Handles variables that have non-linear effect on outcome

Not good for outcomes that are dependent on many variables

Related to over-fit problem, above

Computationally efficient to build Doesn't naturally handle missing values;However most implementations include a method for dealing with this

Easy to score data In practice, decision rules can be fairly complex

Many algorithms can return a measure of variable importance

In principle, decision rules are easy to understand

Decision Tree Classifier - Reasons to Choose (+) & Cautions (-)

Module 4: Analytics Theory/Methods51

Page 52: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Conclusion

CSCE 587

Page 53: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Typical Questions Recommended Method

Do I want class probabilities, rather than just class labels?

Logistic regressionDecision Tree

Do I want insight into how the variables affect the model?

Logistic regressionDecision Tree

Is the problem high-dimensional? Naïve Bayes

Do I suspect some of the inputs are correlated? Decision TreeLogistic Regression

Do I suspect some of the inputs are irrelevant? Decision TreeNaïve Bayes

Are there categorical variables with a large number of levels?

Naïve BayesDecision Tree

Are there mixed variable types? Decision TreeLogistic Regression

Is there non-linear data or discontinuities in the inputs that will affect the outputs?

Decision Tree

Which Classifier Should I Try?

Module 4: Analytics Theory/Methods53

Page 54: CSCE 587 Final Review. Rules of Engagement Questions can be of the form 1) Explain/show property X 2) Which properties are required for Y 3) Apply method

Diagnostics

• Hold-out data– How well does the model classify new instances?

• Cross-validation• ROC curve/AUC

54Module 4: Analytics Theory/Methods