Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Bivariate data (two quantitative variables)Hypothetical scatter plots
1 2 3 4 5
2.5
3.5
4.5
5.5
Corr. coeff=0
1 2 3 4 5
2.5
3.5
4.5
Corr. coeff=0.4
1 2 3 4 5
2.5
3.5
4.5
5.5
Corr. coeff=-0.2
1 2 3 4 5
3.0
4.0
5.0
Corr. coeff=-0.5
1 2 3 4 5
3.0
4.0
5.0
Corr. coeff=0.6
1 2 3 4 5
3.0
4.0
5.0
Corr. coeff=0.8
1 2 3 4 5
3.03.54.04.55.0
Corr. coeff=-0.7
1 2 3 4 5
3.0
4.0
5.0
Corr. coeff=-0.9
1 2 3 4 5
3.0
4.0
5.0
Corr. coeff=0.9
1 2 3 4 5
3.0
4.0
5.0
Corr. coeff=0.95
1 2 3 4 5
3.0
4.0
5.0
Corr. coeff=-0.95
1 2 3 4 53.0
4.0
5.0
Corr. coeff=-0.99
Correlation coefficient
Aim: establish and estimate association between two variables.
Formula: rYZ =
∑ni=1(yi − y)(zi − z)
(n − 1)√
S2Y S2
Z
. (Pearson’s)
Property: −1 ≤ rYZ ≤ 1.
Confidence intervals and test of the hypothesis ρ = 0 useassumption (Y ,Z ) bivariate normal with correlation coefficient ρ.
If variables not normal, other coefficients used:Kendall’s correlation coefficient τ use rank of observations, insteadof values.Spearman’s correlation coefficient rS is also computed from ranks.
An example
60 80 100 120 140 160 180 200
34
56
78
Calories and price of some brands of beer
Calories
Price
An example
60 80 100 120 140 160 180 200
34
56
78
Calories and price of some brands of beer
Calories
Price
rYZ : 0.419937 (Pearson)95% confidence interval:0.2271418–0.5810573
An example
60 80 100 120 140 160 180 200
34
56
78
Calories and price of some brands of beer
Calories
Price
rYZ : 0.419937 (Pearson)95% confidence interval:0.2271418–0.5810573P(> |rYZ ||ρ = 0) = 6.31·10−5
τ : 0.3488677 (Kendall)P(> |τ |) = 3.39 · 10−6
rS : 0.5008197 (Spearman)P(> |rS |) = 1.053 · 10−6
An example
60 80 100 120 140 160 180 200
34
56
78
Calories and price of some brands of beer
Calories
Price
rYZ : 0.419937 (Pearson)95% confidence interval:0.2271418–0.5810573P(> |rYZ ||ρ = 0) = 6.31·10−5
τ : 0.3488677 (Kendall)P(> |τ |) = 3.39 · 10−6
rS : 0.5008197 (Spearman)P(> |rS |) = 1.053 · 10−6
In this case results are similar with all coefficients.
Linear regressionWith bivariate data, we can choose to predict Y on the basis of X :
Y = α + βX + ε (ε error).
For each value xi of X , there are:
yi (observed value) and yi = α + βxi (predicted value).
α and β are chosen to minimize∑n
i=1(yi − yi )2.
20 30 40 50 60 70 80
100
120
140
160
180
200
220
A regression
eta
pressione
obs. - pred.
Linear regressionWith bivariate data, we can choose to predict Y on the basis of X :
Y = α + βX + ε (ε error).
For each value xi of X , there are:
yi (observed value) and yi = α + βxi (predicted value).
α and β are chosen to minimize∑n
i=1(yi − yi )2.
20 30 40 50 60 70 80
100
120
140
160
180
200
220
A regression
eta
pressione
obs. - pred.β =
∑ni=1(yi − y)(xi − x)∑n
i=1(xi − x)2
α = y − βx .
Linear regressionWith bivariate data, we can choose to predict Y on the basis of X :
Y = α + βX + ε (ε error).
For each value xi of X , there are:
yi (observed value) and yi = α + βxi (predicted value).
α and β are chosen to minimize∑n
i=1(yi − yi )2.
20 30 40 50 60 70 80
100
120
140
160
180
200
220
A regression
eta
pressione
obs. - pred.β =
∑ni=1(yi − y)(xi − x)∑n
i=1(xi − x)2
α = y − βx .
Formulae similar to correlation,but interpretation very different.