Multivariate Data Summary

Multivariate DataSummary

Linear Regression and Correlation

Pearson’s correlation coefficient r.

n

ii

n

ii

n

iii

yyxx

xy

yyxx

yyxx

SS

Sr

1

2

1

2

1

Slope and Intercept of the Least Squares line

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1 Slope

xS

Syxbya

xx

xy Intercept

Scatter Plot Patterns

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140 0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

• Circular

• No relationship between X and Y

• Unable to predict Y from X

Ellipsoidal

• Positive relationship between X and Y

• Increases in X correspond to increases in Y (but not always)

• Major axis of the ellipse has positive slope

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = 0.0r = +0.7

r = +0.9 r = +1.0

Ellipsoidal

• Negative relationship between X and Y

• Increases in X correspond to decreases in Y (but not always)

• Major axis of the ellipse has negative slope slope

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140 0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.7

r = -0.9 r = -1.0

Non-Linear Patterns

0

200

400

600

800

1000

1200

-20 -10 0 10 20 30 40 50

-20

0

20

40

60

80

100

120

0 10 20 30 40 50

r can take on arbitrary values between -1 and +1 if the pattern is non-linear depending or how well your can fit a straight line to the pattern

The Coefficient of Determination

n

ii

n

ii

yy

yyr

1

2

1

2

2

ˆ

An important Identity in Statistics

(Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)

n

iii

n

ii

n

ii yyyyyy

1

2

1

2

1

2 ˆˆ

lainedUnExplainedTotal SSSSSS exp

It can also be shown:

= proportion variability in Y explained by X.

= the coefficient of determination

n

ii

n

ii

yy

yyr

1

2

1

2

2

ˆ

Categorical Data

Techniques for summarizing, displaying and graphing

The frequency tableThe bar graph

Suppose we have collected data on a categorical variable X having k categories – 1, 2, … , k.

To construct the frequency table we simply count for each category (i) of X, the number of cases falling in that category (fi)

To plot the bar graph we simply draw a bar of height fi above each category (i) of X.

Example

In this example data has been collected for n = 34,188 subjects.

• The purpose of the study was to determine the relationship between the use of Antidepressants, Mood medication, Anxiety medication, Stimulants and Sleeping pills.

• In addition the study interested in examining the effects of the independent variables (gender, age, income, education and role) on both individual use of the medications and the multiple use of the medications.

The variables were: 1. Antidepressant use, 2. Mood medication use, 3. Anxiety medication use, 4. Stimulant use and 5. Sleeping pills use.6. gender, 7. age, 8. income, 9. education and 10. Role –

i. Parent, worker, partnerii. Parent, partneriii. Parent, workeriv. worker, partner

v. worker onlyvi. Parent onlyvii. Partner onlyviii. No roles

Frequency Table for Age

Age - (G)

5349 15.7 15.7 15.7

6758 19.8 19.8 35.5

6420 18.8 18.8 54.3

5528 16.2 16.2 70.5

4400 12.9 12.9 83.4

5663 16.6 16.6 100.0

34118 100.0 100.0

20-29

30-39

40-49

50-59

60-69

70+

Total

ValidFrequency Percent Valid Percent

CumulativePercent

20-29 30-39 40-49 50-59 60-69 70+

Age - (G)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Co

un

t

Bar Graph for Age

Frequency Table for Role

role

6614 19.4 24.5 24.5

1068 3.1 4.0 28.5

1351 4.0 5.0 33.5

5427 15.9 20.1 53.6

5711 16.7 21.2 74.7

456 1.3 1.7 76.4

3262 9.6 12.1 88.5

3097 9.1 11.5 100.0

26986 79.1 100.0

7132 20.9

34118 100.0

parent, partner, worker

parent, partner

parent, worker

partner, worker

worker only

parent only

partner only

no roles

Total

Valid

SystemMissing

Total

Frequency Percent Valid PercentCumulative

Percent

parent, partner, workerparent, partner

parent, workerpartner, worker

worker onlyparent only

partner onlyno roles

role

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Co

un

t

Bar Graph for Role

The pie chart• An alternative to the bar chart

• Draw a circle (a pie)

• Divide the circle into segments with area of each segment proportional to fi or pi = fi /n

Example• In this study the population are individuals who

received a head injury. (n = 22540)• The variable is the mechanism that caused the head

injury (InjMech) with categories:– MVA (Motor vehicle accident)

– Falls

– Violence

– Other VA (Other vehicle accidents)

– Accidents (industrial accident)

– Other (all other mechanisms for head injury)

Graphical and Tabular Display of Categorical Data.

• The frequency table

• The bar graph

• The pie chart

The frequency table

InjMech

565 2.5 2.5 2.5

4875 21.6 21.6 24.1

13565 60.2 60.2 84.3

765 3.4 3.4 87.7

2338 10.4 10.4 98.1

432 1.9 1.9 100.0

22540 100.0 100.0

Accdents

Falls

MVA

other

other VA

Violence

Total

ValidFrequency Percent Valid Percent

CumulativePercent

The bar graph

MVAFalls

Violenceother VA

Accdentsother

InjMech

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Val

ue

f

Cases weighted by f

The pie chartMVA

Falls

Violence

other VA

Accdents

other

Cases weighted by f

Multivariate Categorical Data

The two way frequency table

The 2 statistic

Techniques for examining dependence amongst two categorical

variables

Situation

• We have two categorical variables R and C.

• The number of categories of R is r.

• The number of categories of C is c.

• We observe n subjects from the population and count

xij = the number of subjects for which R = i and

C = j.

• R = rows, C = columns

Example

Both Systolic Blood pressure (C) and Serum Chlosterol (R) were meansured for a sample of n = 1237 subjects.

The categories for Blood Pressure are:

<126 127-146 147-166 167+

The categories for Chlosterol are:

<200 200-219 220-259 260+

Table: two-way frequency

Serum Cholesterol

Systolic Blood pressure <127 127-146 147-166 167+ Total

< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439

260+ 67 99 46 33 245

Total 388 527 204 118 1237

Example

This comes from the drug use data.

The two variables are:

1. Age (C) and

2. Antidepressant Use (R)

measured for a sample of n = 33,957 subjects.

Two-way Frequency Table

Took anti-depressants - 12 mo * Age - (G) Crosstabulation

Count

322 523 570 522 265 249 2451

5007 6201 5822 4982 4114 5380 31506

5329 6724 6392 5504 4379 5629 33957

YES

NO

Took anti-depressants- 12 mo

Total

20-29 30-39 40-49 50-59 60-69 70+

Age - (G)

Total

Age - (G)

20-29 30-39 40-49 50-59 60-69 70+6.04% 7.78% 8.92% 9.48% 6.05% 4.42%

Percentage antidepressant use vs Age

Antidepressant Use vs Age

0.0%

5.0%

10.0%

20-29 30-39 40-49 50-59 60-69 70+

The 2 statistic for measuring dependence

amongst two categorical variables

DefineTotal row

1

thc

jiji ixR

1

column Totalc

thj ij

i

C x j

n

CRE ji

ij

= Expected frequency in the (i,j) th cell in the case of independence.

Columns

1 2 3 4 5 Total

1 x11 x12 x13 x14 x15 R1

2 x21 x22 x23 x24 x25 R2

3 x31 x32 x33 x34 x35 R3

4 x41 x42 x43 x44 x45 R4

Total C1 C2 C3 C4 C5 N

Total row 1

thc

jiji ixR

1

column Totalc

thj ij

i

C x j

Columns

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4

Total C1 C2 C3 C4 C5 n

n

CRE ji

ij

Justification if i jij

R CE

n then ij j

i

E C

R n

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4


Proportion in column j for row i

overall proportion in column j

and if i jij

R CE

n then ij i

j

E R

C n

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4


Proportion in row i for column j

overall proportion in row i

The 2 statistic

r

i

c

j ij

ijij

E

Ex

1 1

2

2

Eij= Expected frequency in the (i,j) th cell in the case of independence.

xij= observed frequency in the (i,j) th cell

Example: studying the relationship between Systolic Blood pressure and Serum Cholesterol

In this example we are interested in whether Systolic Blood pressure and Serum Cholesterol are related or whether they are independent.

Both were measured for a sample of n = 1237 cases

Serum Cholesterol


< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439

260+ 67 99 46 33 245

Total 388 527 204 118 1237

Observed frequencies

Serum Cholesterol


< 200 96.29 130.79 50.63 29.29 307200-219 77.16 104.8 40.47 23.47 246220-259 137.70 187.03 72.40 41.88 439

260+ 76.85 104.38 40.04 23.37 245

Total 388 527 204 118 1237

Expected frequencies

In the case of independence the distribution across a row is the same for each rowThe distribution down a column is the same for each column

Table Expected frequencies, Observed frequencies, Standardized Residuals

Serum Systolic Blood pressure

Cholesterol <127 127-146 147-166 167+ Total <200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35 200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72 220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17 260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99 Total 388 527 204 118 1237

2 = 20.85

ij

ijijij

E

Exr

Standardized residuals

ij

ijijij

E

Exr

85.20

1 1

2

1 1

2

2

r

i

c

jij

r

i

c

j ij

ijij rE

Ex

The 2 statistic

Example

This comes from the drug use data.

The two variables are:

1. Role (C) and

2. Antidepressant Use (R)

measured for a sample of n = 33,957 subjects.

Two-way Frequency Table

Percentage antidepressant use vs Role

Took anti-depressants - 12 mo * role Crosstabulation

Count

344 101 201 275 455 63 224 414 2077

6268 967 1150 5150 5249 392 3036 2679 24891

6612 1068 1351 5425 5704 455 3260 3093 26968

YES

NO

Took anti-depressants- 12 mo

Total

parent,partner,worker

parent,partner parent, worker

partner,worker worker only parent only partner only no roles

role

Total

Role parent, partner, worker

parent, partner

parent, worker

partner, worker

worker only parent only

partner only no roles

5.20% 9.46% 14.88% 5.07% 7.98% 13.85% 6.87% 13.39%

Antidepressant Use vs Role

0.0%

5.0%

10.0%

15.0%

20.0%

parent,partner,worker

parent,partner

parent,worker

partner,worker

workeronly

parentonly

partneronly

no roles

2 = 381.961

Calculation of 2

1 2 3 4 5 6 7 8 Total

YES 344 101 201 275 455 63 224 414 2077NO 6268 967 1150 5150 5249 392 3036 2679 24891

Total 6612 1068 1351 5425 5704 455 3260 3093 26968

The Raw data

Expected frequencies1 2 3 4 5 6 7 8 Total (R i )

YES 509.24 82.25 104.05 417.82 439.31 35.04 251.08 238.21 2077NO 6102.76 985.75 1246.95 5007.18 5264.69 419.96 3008.92 2854.79 24891

Total (C j ) 6612 1068 1351 5425 5704 455 3260 3093 26968

ij

ijijij

E

Exr

i jij

R CE

n

The Residuals

The calculation of 2

ij

ijijij

E

Exr

1 2 3 4 5 6 7 8

YES -7.32 2.07 9.50 -6.99 0.75 4.72 -1.71 11.39NO 2.12 -0.60 -2.75 2.02 -0.22 -1.36 0.49 -3.29

2

2 2 381.961ij ij

iji j i j ij

x Er

E

Example

• In this example n = 57407 individuals who had been victimized twice by crimes

• Rows = crime of first vicitmization

• Cols = crimes of second victimization

Table 1: Frequencies Second Victimization in Pair

Ra A Ro PP/PS PL B HL MV Total Ra 26 50 11 6 82 39 48 11 273 A 65 2997 238 85 2553 1083 1349 216 8586

First Ro 12 279 197 36 459 197 221 47 1448 Victimization PP/PS 3 102 40 61 243 115 101 38 703

in pair PL 75 2628 413 229 12137 2658 3689 687 22516 B 52 1117 191 102 2649 3210 1973 301 9595 HL 42 1251 206 117 3757 1962 4646 391 1237 MV 3 221 51 24 678 301 367 269 1914 Total 278 8645 1347 660 22558 9565 12394 1960

Table 2: Standardized residuals Second Victimization in Pair

Ra A Ro PP/PS PL B HL MV Ra 21.5 1.4 1.8 1.6 -2.4 -1.0 -1.9 0.6 A 3.6 47.4 2.6 -1.4 -14.1 -9.2 -11.7 -4.5

First Ro 1.9 4.1 28.0 4.7 -4.6 -2.8 -5.2 -0.3 Victimization PP/PS -0.2 -0.4 5.8 18.6 -2.0 -0.2 -4.1 2.9

in pair PL -3.3 -13.1 -5.0 -1.9 35.0 -17.9 -16.8 -2.9 B 0.8 -8.6 -2.3 -0.8 -18.3 40.3 -2.2 -1.5 HL -2.3 -14.2 -4.9 -2.1 -15.8 -2.2 38.2 -1.5 MV -2.1 -4.0 0.9 0.4 -2.7 -1.0 -2.3 25.2

11,430 (highly significant)

Next Topic:

Brief introduction to Statistical Packages

Documents

Multivariate Data Summary