26
Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution – Noncommercial – Share Alike 3.0 License: http://creativecommons.org/licenses/by-nc-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact [email protected] with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use .

Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Embed Size (px)

Citation preview

Page 1: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012

License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution – Noncommercial – Share Alike 3.0 License: http://creativecommons.org/licenses/by-nc-sa/3.0/

We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material.

Copyright holders of content included in this material should contact [email protected] with any questions, corrections, or clarification regarding the use of content.

For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.

Page 2: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Attribution Keyfor more information see: http://open.umich.edu/wiki/AttributionPolicy

Use + Share + Adapt

Make Your Own Assessment

Creative Commons – Attribution License

Creative Commons – Attribution Share Alike License

Creative Commons – Attribution Noncommercial License

Creative Commons – Attribution Noncommercial Share Alike License

GNU – Free Documentation License

Creative Commons – Zero Waiver

Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in your jurisdiction may differ

Public Domain – Expired: Works that are no longer protected due to an expired copyright term.

Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 105)

Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain.

Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your jurisdiction may differ

Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that your use of the content is Fair.

To use this content you should do your own independent analysis to determine whether or not your use will be Fair.

{ Content the copyright holder, author, or law permits you to use, share and adapt. }

{ Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. }

{ Content Open.Michigan has used under a Fair Use determination. }

Page 3: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Descriptive Statisticsquantitatively describe the main features of a collection of data.

My salary

is $45,000.

It’s a

middle

salary in

my

company

Staff. Jones

Benefits are

highly related

to working

age

What should I make of all this???!!!

How do salaries vary across the

company?

HR

manageremployee

Page 4: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Mean > mean(x); > mean(x,trim=a)

Median > median(x)

Mode > sort(table(x))

Standard deviation > sd(x)

Variance > var(x)

the median absolute deviation

> mad(c(x))

interquartile range > IQR(x)

Range > range(x)

Descriptive Statistics in R

Page 5: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Data Dimensions

> length(x)

[1] 1000-------------------------> nrow(X)

[1] 2030> ncol(X)

[1] 100000> dim(X)

[1] 2034 100000

Matrix X….

….

….

Page 6: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Vectorization in R

Matrix X

> apply( X, MARGIN=1, FUN= mean)

> apply( X, MARGIN=2, FUN= mean)

Page 7: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

boxplot(X)

• Good for small data sets

• Easy to compare groups side by side

• 1.5*IQR defines outlier

Page 8: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

The Big Six

Minimum, 1st Q, Median, Mean, 3rd Q, Maximum

> summary(X)

Page 9: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

R tries to understand you

> summary(X)

Page 10: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Histograms: > hist(X)

Page 11: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Correlation

> cor(wt,mpg) [1] -0.8676594> plot(x=wt,y=mpg)

Page 12: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Scatterplot Matrix

• Iris dataset• 150 flowers• 5 variables Goingslo, flickr

Page 13: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Scatterplot Matrix

> pairs(data)

Page 14: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

> coplot(lat ~ long | depth)

Page 15: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Linear Regression

Why?What? Prediction of future or unknown observations

Assessment of relationship between variables

General description of data structure

Page 16: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Variable Selection

Why? Simplification Elimination of multicollinearity and noise Time and money saving

How? Testing-based Variable Selection Methods- Backward, Forward, Stepwise

Criterion-based Procedures

What? AIC = n ln(RSS/n) + 2(p)

Page 17: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Example: U.S. State Fact and Figures

Life Expectancy

Population, Income, Illiteracy, Murder, HS Grad, Frost, Area

> g <- lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area, data = statedata)> summary(g)

Selected R code Linear Regression

AIC

> step(g)

> anova(g)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***Population 5.180e-05 2.919e-05 1.775 0.0832 . Income -2.180e-05 2.444e-04 -0.089 0.9293 Illiteracy 3.382e-02 3.663e-01 0.092 0.9269 Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 * Frost -5.735e-03 3.143e-03 -1.825 0.0752 . Area -7.383e-08 1.668e-06 -0.044 0.9649

Analysis of Variance TableResponse: Life.Exp Df Sum Sq Mean Sq F value Pr(>F) Population 1 0.4089 0.4089 0.7372 0.395434 Income 1 11.5946 11.5946 20.9028 4.218e-05 ***Illiteracy 1 19.4207 19.4207 35.0116 5.228e-07 ***Murder 1 27.4288 27.4288 49.4486 1.308e-08 ***HS.Grad 1 4.0989 4.0989 7.3895 0.009494 ** Frost 1 2.0488 2.0488 3.6935 0.061426 . Area 1 0.0011 0.0011 0.0020 0.964908 Residuals 42 23.2971 0.5547AIC = n ln(RSS/n) + 2(p)

Page 18: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Continued: U.S. State Fact and Figures

Start: AIC=-22.18Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area

Df Sum of Sq RSS AIC- Area 1 0.0011 23.298 -24.182- Income 1 0.0044 23.302 -24.175- Illiteracy 1 0.0047 23.302 -24.174<none> 23.297 -22.185- Population 1 1.7472 25.044 -20.569- Frost 1 1.8466 25.144 -20.371- HS.Grad 1 2.4413 25.738 -19.202- Murder 1 23.1411 46.438 10.305

Step: AIC=-24.18Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost

Df Sum of Sq RSS AIC- Illiteracy 1 0.0038 23.302 -26.174- Income 1 0.0059 23.304 -26.170<none> 23.298 -24.182- Population 1 1.7599 25.058 -22.541- Frost 1 2.0488 25.347 -21.968- HS.Grad 1 2.9804 26.279 -20.163- Murder 1 26.2721 49.570 11.569

Page 19: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Continued: U.S. State Fact and Figures

Step: AIC=-28.16Life.Exp ~ Population + Murder + HS.Grad + Frost

Df Sum of Sq RSS AIC<none> 23.308 -28.161- Population 1 2.064 25.372 -25.920- Frost 1 3.122 26.430 -23.877- HS.Grad 1 5.112 28.420 -20.246- Murder 1 34.816 58.124 15.528

Coefficients:(Intercept) Population Murder HS.Grad Frost 7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03

Effect on Response Variable ofOne Unit Change of Predict Variable

Page 20: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

What is Principal Component Analysis (PCA)?

Two general approaches of reducing variables : feature selection and feature extraction

Feature Selection : “Akaike Information Criterion”(AIC), BIC or Back-Substitution

Feature extraction : “Principal Component Analysis”(PCA) is most widely used

Create several artificial variables Built-in functions in R = Convenient!

Page 21: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Actual Pima Data

pregnant

glucosediastoli

ctriceps insulin bmi

diabetes

age test

1 6 148 72 35 0 33.6 0.627 50 12 1 85 66 29 0 26.6 0.351 31 03 8 183 64 0 0 23.3 0.672 32 14 1 89 66 23 94 28.1 0.167 21 05 0 137 40 35 168 43.1 2.288 33 16 5 116 74 0 0 25.6 0.201 30 0

….

( Imagine a data set with many more (~1000) columns )

(Imagine a Linear Regression: Which variables affect diabetes in what ways?)

Page 22: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

PCA Example: Pima Indians

The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.

9 Variables (8 continuous, 1 categorical) pregnant: Number of times pregnant Glucose : Plasma glucose concentration at 2 hours in an oral glucose

tolerance test Diastolic : Diastolic blood pressure (mm Hg) Triceps : Triceps skin fold thickness (mm) Insulin : 2-Hour serum insulin (mu U/ml) Bmi : Body mass index (weight in kg/(height in metres squared)) Diabetes : Diabetes pedigree function Age : Age (years) Test : diabetes (coded 0 if negative, 1 if positive)

Next Slide: PCA Implementation

Page 23: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

What principal components might look like:

PC1 : 1*Insulin + 0.01*Glucose + ..PC2 : 1*Glucose + 0.12*Age +

0.12*DiastolicBP + ..PC3 : 0.92 * DiastolicBP + 0.31*Triceps

Principal components : What are they composed of?(less important)

Difference with Linear Regression

Page 24: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

-Goal: obtain summary about data in lower dimensions

-- How many dimensions?

- R code in the next slide:

-0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00

-0.1

0-0

.05

0.0

00

.05

0.1

0

PC1

PC

2

+

+

+

+

+ +

+

+

+

+

+

+

+

+

+

++

+

+

++ +

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+ +

+

+

+

+++ +

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+

+

++

+

+

+

+

+

+++

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+ +

+

+

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

+ +++

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+ +

+

+

+

+

++

+

+ +

++

+

+

+

+

+++ +

+

+++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

+++

+ +

+

+

++

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

++

+

+ +

+

+++

++

+

+

+

+

+

+

++

+

+

++

+

+

+

++

+

+

+

+

++

+

+

+

+ ++

+

+

+

++

+

+

+ +

+

+

+

+

+

+

+

+

+

++

+

++

+

++

++

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

++

+

++

+ +++

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

++

+

+

+

++

++

+

+

+

+

+

+

++

++

+

+

+++

+

+

+

+

+++

+

++

+

+

+

+

+

+

+

+

+

+

++

+

++

+

++

+

+

+

+ ++

+

+

+

+

+

+

+

++ +

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+++

+

+ +

+

+

+

++

++

++

++

++ +

++

+

++

+

++

+

+

+

++

++

+

+

+

+

+

+

++

++

++

+

+

+

+

+ ++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

++

+

+

+

+

+++

+

+

+

++

+

++

++++

+

+

++

+

++

+

+

+

++ +

++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

++

++

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+++

+ +

++

++

+

++

+

+++

+

+

+

+

+ +

+

+

+

+

+

+

+

++

++

++

+

+

+

+ +

+

++

+

+

+

+

+

+

+

+

+

++

++

+

+

-4000 -3000 -2000 -1000 0

-15

00

-10

00

-50

00

50

01

00

0

pregnant

glucose

diastolictricepsinsulin

bmiage

Page 25: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Brief : R-Code

> data.pca <- prcomp(data[,-9]); summary(data.pca);Importance of components: PC1 PC2 PC3 PC4 PC5 PC6

PC7Standard deviation 116.002 30.5411 19.7630 14.0777 10.6155 6.76973 2.78575Proportion of Variance 0.889 0.0616 0.0258 0.0131 0.00744 0.00303 0.00051Cumulative Proportion 0.889 0.950 0.976 0.9890 0.996 0.999 1.00000 > data.pca

Rotation: PC1 PC2 PC3 PC4 PC5 PC6 PC7pregnant 0.002 -0.02 0.02 0.05 2e-01 -0.005 -1e+00glucose -0.098 -0.97 -0.14 -0.12 -9e-02 0.051 -9e-04Diastolic -0.016 -0.14 0.92 0.26 -2e-01 0.076 1e-03triceps -0.061 0.06 0.31 -0.88 3e-01 0.221 4e-04insulin -0.993 0.09 -0.02 0.07 -2e-04 -0.006 -1e-03bmi -0.014 -0.05 0.13 -0.19 2e-02 -0.971 3e-03age 0.004 -0.14 0.13 0.30 9e-01 -0.015 2e-01

> barplot(totalrep, main="Representation of Principal Components", xlab="Principal Component", ylab="% of Total Variance")

> biplot(data.pca, xlabs=rep('+',768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0);

Page 26: Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the

Representation of Principal Components

Principal Component

% o

f To

tal V

ari

an

ce

0.0

0.1

0.2

0.3

0.4

0.5