Bivariate data (two quantitative variables)anal1/matstat_biotech/1011/diario/lez_regr.pdfWith bivariate data, we can choose to predict Y on the basis of X: Y = + X + " ("error): For

Bivariate data (two quantitative variables)Hypothetical scatter plots

1 2 3 4 5

2.5

3.5

4.5

5.5

Corr. coeff=0

1 2 3 4 5

2.5

3.5

4.5

Corr. coeff=0.4

1 2 3 4 5

2.5

3.5

4.5

5.5

Corr. coeff=-0.2

1 2 3 4 5

3.0

4.0

5.0

Corr. coeff=-0.5

1 2 3 4 5

3.0

4.0

5.0

Corr. coeff=0.6

1 2 3 4 5

3.0

4.0

5.0

Corr. coeff=0.8

1 2 3 4 5

3.03.54.04.55.0

Corr. coeff=-0.7

1 2 3 4 5

3.0

4.0

5.0

Corr. coeff=-0.9

1 2 3 4 5

3.0

4.0

5.0

Corr. coeff=0.9

1 2 3 4 5

3.0

4.0

5.0

Corr. coeff=0.95

1 2 3 4 5

3.0

4.0

5.0

Corr. coeff=-0.95

1 2 3 4 53.0

4.0

5.0

Corr. coeff=-0.99

Correlation coefficient

Aim: establish and estimate association between two variables.

Formula: rYZ =

∑ni=1(yi − y)(zi − z)

(n − 1)√

S2Y S2

Z

. (Pearson’s)

Property: −1 ≤ rYZ ≤ 1.

Confidence intervals and test of the hypothesis ρ = 0 useassumption (Y ,Z ) bivariate normal with correlation coefficient ρ.

If variables not normal, other coefficients used:Kendall’s correlation coefficient τ use rank of observations, insteadof values.Spearman’s correlation coefficient rS is also computed from ranks.

An example

60 80 100 120 140 160 180 200

34

56

78

Calories and price of some brands of beer

Calories

Price

An example

60 80 100 120 140 160 180 200

34

56

78


Calories

Price

rYZ : 0.419937 (Pearson)95% confidence interval:0.2271418–0.5810573

An example

60 80 100 120 140 160 180 200

34

56

78


Calories

Price

rYZ : 0.419937 (Pearson)95% confidence interval:0.2271418–0.5810573P(> |rYZ ||ρ = 0) = 6.31·10−5

τ : 0.3488677 (Kendall)P(> |τ |) = 3.39 · 10−6

rS : 0.5008197 (Spearman)P(> |rS |) = 1.053 · 10−6

An example

60 80 100 120 140 160 180 200

34

56

78


Calories

Price

rYZ : 0.419937 (Pearson)95% confidence interval:0.2271418–0.5810573P(> |rYZ ||ρ = 0) = 6.31·10−5

τ : 0.3488677 (Kendall)P(> |τ |) = 3.39 · 10−6

rS : 0.5008197 (Spearman)P(> |rS |) = 1.053 · 10−6

In this case results are similar with all coefficients.

Linear regressionWith bivariate data, we can choose to predict Y on the basis of X :

Y = α + βX + ε (ε error).

For each value xi of X , there are:

yi (observed value) and yi = α + βxi (predicted value).

α and β are chosen to minimize∑n

i=1(yi − yi )2.

20 30 40 50 60 70 80

100

120

140

160

180

200

220

A regression

eta

pressione

obs. - pred.






i=1(yi − yi )2.

20 30 40 50 60 70 80

100

120

140

160

180

200

220

A regression

eta

pressione

obs. - pred.β =

∑ni=1(yi − y)(xi − x)∑n

i=1(xi − x)2

α = y − βx .






i=1(yi − yi )2.

20 30 40 50 60 70 80

100

120

140

160

180

200

220

A regression

eta

pressione

obs. - pred.β =

∑ni=1(yi − y)(xi − x)∑n

i=1(xi − x)2

α = y − βx .

Formulae similar to correlation,but interpretation very different.

Documents

Bivariate data (two quantitative variables)anal1/matstat_biotech/1011/diario/lez_regr.pdfWith bivariate data, we can choose to predict Y on the basis of X: Y = + X + " ("error): For