What factors are most responsible for height? Outcome = (Model) + Error

Preview:

Citation preview

What factors are most responsible for height?

Outcome = (Model) + Error

Analytics & History: 1st Regression Line

The first “Regression Line”

Galton’s Notebook on Families & Height

X1 X2 X3 Y

Galton’s Family Height Dataset

> getwd()[1] "C:/Users/johnp_000/Documents"

> setwd()

Dataset Input

Function FilenameObject

h <- read.csv("GaltonFamilies.csv")

str() summary()

Data Types: Numbers and Factors/Categorical

Outline

• One Variable: Univariate• Dependent / Outcome Variable

• Two Variables: Bivariate• Outcome and each Predictor

• All Four Variables: Multivariate

Steps

Continuous

Categorical

Histogram

Scatter

Boxplot

Child’s Height

LinearRegression

Dad’s Height

Gender

ContinuousY

X1, X2

X3

TypeVariable

Mom’s Height

Frequency Distribution, Histogram

hist(h$child)

Area = 1

Density Plot

plot(density(h$childHeight))

hist(h$childHeight,freq=F, breaks =25, ylim = c(0,0.14))curve(dnorm(x, mean=mean(h$childHeight), sd=sd(h$childHeight)), col="red", add=T)

Mode, Bimodal

Industry Pct.Research 24%Higher Education 7%Information Technology 9%Computer Software 7%Financial Services 6%Banking 2%Pharmaceuticals 4%Biotechnology 4%Market Research 3%Management Consulting 3%Total 69%

Hadley Wickham

Asst. Professor of Statistics at Rice University

ggplot2plyrreshaperggobiprofr

Industries / Organizations Creating and Using R

http://ggplot2.org/

ggplot2library(ggplot2)h.gg <- ggplot(h, aes(child)) h.gg + geom_histogram(binwidth = 1 ) + labs(x = "Height", y = "Frequency")h.gg + geom_density()

ggplot2h.gg <- ggplot(h, aes(child)) + theme(legend.position = "right")h.gg + geom_density() + labs(x = "Height", y = "Frequency")h.gg + geom_density(aes(fill=factor(gender)), size=2)

Steps

Continuous

Categorical

Histogram

Scatter

Boxplot

Child’s Height

LinearRegression

Dad’s Height

Gender

ContinuousY

X1, X2

X3

TypeVariable

Mom’s Height

Correlation and Regression

1. Calculate the difference between the mean and each person’s score for the first variable (x).

2. Calculate the difference between the mean and their value for the second variable (y).

3. Multiply these “error” values.4. Add these values to get the cross product deviations.5. The covariance is the average of cross-product deviations

Covariance

1cov( , ) i ix x y y

Nx y

1cov( , ) i ix x y y

Nx y

Covariance

Y

X

Persons 2,3, and 5 look to have similar magnitudes from their means

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

-4-3-2-1012345

254417

441021418221

4)4)(62()2)(60()1)(41()2)(41()3)(40(

1))((

)cov(

.

.....

.....N

yyxxy,x ii

Covariance

• Calculate the error [deviation] between the mean and each subject’s score for the first variable (x).

• Calculate the error [deviation] between the mean and their score for the second variable (y).

• Multiply these error values.• Add these values and you get the cross product deviations.• The covariance is the average cross-product deviations:

• Covariance depends upon the units of measurement• Normalize the data• Divide by the standard deviations of both variables.

• The standardized version of covariance is known as the correlation coefficient

Standardizing the Covariance

Correlation

?cor

cor(h$father, h$child)

0.2660385

Scatterplot Matrix: pairs()

Correlations Matrix library(car) scatterplotMatrix(heights)

ggplot2

Steps

Continuous

Categorical

Histogram

Scatter

Boxplot

Child’s Height

LinearRegression

Dad’s Height

Gender

ContinuousY

X1, X2

X3

TypeVariable

Mom’s Height

Box Plot

Children’s Height vs. Genderboxplot(h$child~gender,data=h, col=(c("pink","lightblue")), main="Children's Height by Gender", xlab="Gender", ylab="")

Descriptive Stats: Box Plot

69.23

64.10

5.13 ======

Subset Malesmen<- subset(h, gender=='male')

Subset Femaleswomen <- subset(h, gender==‘female')

Children’s Height: Males

qqnorm(men$childHeight)qqline(men$childHeight)

hist(men$childHeight)

Children’s Height: Females

qqnorm(women$child)qqline(women$child)

hist(women$child)

ggplot2 library(ggplot2)h.bb <- ggplot(h, aes(factor(gender), child))h.bb + geom_boxplot()h.bb + geom_boxplot(aes(fill = factor(gender)))

Recommended