33
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Embed Size (px)

Citation preview

Page 1: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 4a, February 11, 2014, SAGE 3101

Introduction to Analytic Methods, Types of Data Mining for Analytics

Page 2: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Contents• Patterns/ Relations via “Data mining”

– A new dataset coming on Friday!• Interpreting results• Saving the models• Proceeding with applying the models

2

Page 3: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Data Mining• Classification (Supervised Learning)

– Classifiers are created using labeled training samples– Training samples created by ground truth / experts– Classifier later used to classify unknown samples

• Clustering (Unsupervised Learning)– Grouping objects into classes so that similar objects are in the

same class and dissimilar objects are in different classes– Discover overall distribution patterns and relationships between

attributes• Association Rule Mining

– Initially developed for market basket analysis– Goal is to discover relationships between attributes– Uses include decision support, classification and clustering

• Other Types of Mining– Outlier Analysis– Concept / Class Description– Time Series Analysis

Page 4: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Models/ types• Trade-off between Accuracy and

Understandability• Models range from “easy to understand” to

incomprehensible– Decision trees– Rule induction– Regression models– Neural Networks

4

Harder

Page 5: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Patterns and Relationships• Linear and multi-variate – ‘global methods’

– Fits.. – assumed linearity

algorithmic-based analysis – the start of ~ non-parametric analysis ~ ‘local methods’

• Nearest Neighbor– Training.. (supervised)

• K-means– Clustering.. (un-supervised)

5

Page 6: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

The Dataset(s)• Simple multivariate.csv (

http://escience.rpi.edu/data/DA )

• Some new ones; nyt and sales and the fb100 (.mat files) – but more of that on Friday

6

Page 7: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Linear and least-squares> multivariate <- read.csv("~/Documents/teaching/DataAnalytics/data/multivariate.csv")

> attach(multivariate)

> mm<-lm(Homeowners~Immigrants)

> mmCall:

lm(formula = Homeowners ~ Immigrants)

Coefficients:

(Intercept) Immigrants

107495 -6657 7

Page 8: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Linear fit?> plot(Homeowners ~ Immigrants)

> abline(cm[1],cm[2])

8

Page 9: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Suitable?> summary(mm)

Call:

lm(formula = Homeowners ~ Immigrants)

Residuals:

1 2 3 4 5 6 7

-24718 25776 53282 -33014 14161 -17378 -18109

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 107495 114434 0.939 0.391

Immigrants -6657 9714 -0.685 0.524

Residual standard error: 34740 on 5 degrees of freedom

Multiple R-squared: 0.08586, Adjusted R-squared: -0.09696

F-statistic: 0.4696 on 1 and 5 DF, p-value: 0.52369

Page 10: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Analysis – i.e. Science question

• We want to see if there is a correlation immigrant population and the mean income, the overall population, the percentage of people who own their own homes, and the population density.

• To do so we solve the set of 7 linear equations of the form:

• %_immigrant = a x Income + b x Population + c x Homeowners/Population + d x Population/area + e

10

Page 11: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Multi-variate> HP<- Homeowners/Population

> PD<-Population/area

> mm<-lm(Immigrants~Income+Population+HP+PD)

> summary(mm)

Call:

lm(formula = Immigrants ~ Income + Population + HP + PD)

Residuals:

1 2 3 4 5 6 7

0.02681 0.29635 -0.22196 -0.71588 -0.13043 -0.09438 0.83948

11

Page 12: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Multi-variateCoefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.455e+01 6.964e+00 3.525 0.0719 .

Income -1.130e-04 5.520e-05 -2.047 0.1772

Population 5.444e-05 1.884e-05 2.890 0.1018

hp -6.534e-02 1.751e-02 -3.731 0.0649 .

pd -1.774e-01 1.364e-01 -1.301 0.3231

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8309 on 2 degrees of freedom

Multiple R-squared: 0.892, Adjusted R-squared: 0.6761

F-statistic: 4.131 on 4 and 2 DF, p-value: 0.204312

Page 13: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Multi-variate> cm<-coef(mm)

> cm (Intercept) Income Population hp pd

2.454544e+01 -1.130049e-04 5.443904e-05 -6.533818e-02 -1.773908e-01

These linear model coefficients can be used with the predict.lm function to make predictions for new input variables. E.g. for the likely immigrant % given an income, population, %homeownership and population density

Oh, and you would probably try less variables?

13

Page 14: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

When it gets complex…• Let the data help you!

14

Page 15: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

K-nearest neighbors (knn)• Can be used in both regression and

classification– Is supervised, i.e. training set and test set

• KNN is a method for classifying objects based on closest training examples in the feature space.

• An object is classified by a majority vote of its neighbors. K is always a positive integer. The neighbors are taken from a set of objects for which the correct classification is known.

• It is usual to use the Euclidean distance, though other distance measures such as the Manhattan distance could in principle be used instead. 15

Page 16: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Algorithm• The algorithm on how to compute the K-nearest

neighbors is as follows:– Determine the parameter K = number of nearest

neighbors beforehand. This value is all up to you.– Calculate the distance between the query-instance and all

the training samples. You can use any distance algorithm.

– Sort the distances for all the training samples and determine the nearest neighbor based on the K-th minimum distance.

– Since this is supervised learning, get all the categories of your training data for the sorted value which fall under K.

– Use the majority of nearest neighbors as the prediction value.

16

Page 17: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Distance metrics• Euclidean distance is the most common use of

distance. When people talk about distance, this is what they are referring to. Euclidean distance, or simply 'distance', examines the root of square differences between the coordinates of a pair of objects. This is most generally known as the Pythagorean theorem.

• The taxicab metric is also known as rectilinear distance, L1 distance or L1 norm, city block distance, Manhattan distance, or Manhattan length, with the corresponding variations in the name of the geometry. It represents the distance between points in a city road grid. It examines the absolute differences between the coordinates of a pair of objects.

17

Page 18: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

More generally• The general metric for distance is the Minkowski

distance. When lambda is equal to 1, it becomes the city block distance, and when lambda is equal to 2, it becomes the Euclidean distance. The special case is when lambda is equal to infinity (taking a limit), where it is considered as the Chebyshev distance.

• Chebyshev distance is also called the Maximum value distance, defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. In other words, it examines the absolute magnitude of the differences between the coordinates of a pair of objects. 18

Page 19: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Choice of k?• Don’t you hate it when the instructions read:

the choice of ‘k’ is all up to you ??

• Loop over different k, evaluate results…

19

Page 20: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Summing up ‘knn’• Advantages

– Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)

– Effective if the training data is large

• Disadvantages– Need to determine value of parameter K (number of

nearest neighbors)– Distance based learning is not clear which type of

distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?

– Computation cost is quite high because we need to compute distance of each query instance to all training samples. Some indexing (e.g. K-D tree) may reduce this computational cost.

20

Page 21: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

K-means• Unsupervised classification, i.e. no classes

known beforehand

• Types:– Hierarchichal: Successively determine new

clusters from previously determined clusters (parent/child clusters).

– Partitional: Establish all clusters at once, at the same level.

21

Page 22: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Distance Measure• Clustering is about finding “similarity”.• To find how similar two objects are, one

needs distance measure.• Similar objects (same cluster) should be

close to one another (short distance).

Page 23: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Distance Measure• Many ways to define distance measure. • Some elements may be close according to

one distance measure and further away according to another.

• Select a good distance measure is an important step in clustering.

Page 24: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Some Distance Functions

• Euclidean distance (2-norm): the most commonly used, also called “crow distance”.

• Manhattan distance (1-norm): also called “taxicab distance”.

• In general: Minkowski Metric (p-norm):

Page 25: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

K-Means Clustering• Separate the objects (data points) into K clusters. • Cluster center (centroid) = the average of all the

data points in the cluster.• Assigns each data point to the cluster whose

centroid is nearest (using distance function.)

Page 26: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

K-Means Algorithm

1. Place K points into the space of the objects being clustered. They represent the initial group centroids.

2. Assign each object to the group that has the closest centroid.

3. Recalculate the positions of the K centroids.

4. Repeat Steps 2 & 3 until the group centroids no longer move.

Page 27: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

K-Means Algorithm:Example Output

Page 28: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Describe v. Predict

28

Page 29: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

K-means"Age","Gender","Impressions","Clicks","Signed_In"

36,0,3,0,1

73,1,3,0,1

30,0,3,0,1

49,1,3,0,1

47,1,11,0,1

47,0,11,1,1

(nyt datasets)

Model e.g.: If Age<45 and Impressions >5 then Gender=female (0)

Age ranges? 41-45, 46-50, etc?29

Page 30: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Decision tree classifier

30

Page 31: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

We’ll do more with evaluating • Next week

• This Friday – lab and Assignment 3 available

• NOTE: NO CLASS TUESDAY *follows Monday schedule*

• Thus next Friday will be a hybrid class – part lecture part lab 31

Page 32: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Tentative assignments

• Assignment 3: Preliminary and Statistical Analysis. Due ~ week 5. 15% (15% written and 0% oral; individual);

• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 6. 15% (10% written and 5% oral; individual);

• Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual);

• Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual);

• Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).

32

Page 33: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4a, February 11, 2014, SAGE 3101 Introduction to Analytic Methods, Types of Data Mining for Analytics

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

33