Upload
raymond-rice
View
45
Download
0
Embed Size (px)
DESCRIPTION
Multivariate Data Summary. Linear Regression and Correlation. Pearson’s correlation coefficient r. Slope and Intercept of the Least Squares line. r = 0.0. Scatter Plot Patterns. r = +0.7. r = +0.9. r = +1.0. r = -0.7. r = -0.9. r = -1.0. Non-Linear Patterns. - PowerPoint PPT Presentation
Citation preview
Multivariate DataSummary
Linear Regression and Correlation
Pearson’s correlation coefficient r.
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
Slope and Intercept of the Least Squares line
n
ii
n
iii
xx
xy
xx
yyxx
S
Sb
1
2
1 Slope
xS
Syxbya
xx
xy Intercept
Scatter Plot Patterns
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140 0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Circular
• No relationship between X and Y
• Unable to predict Y from X
Ellipsoidal
• Positive relationship between X and Y
• Increases in X correspond to increases in Y (but not always)
• Major axis of the ellipse has positive slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.0r = +0.7
r = +0.9 r = +1.0
Ellipsoidal
• Negative relationship between X and Y
• Increases in X correspond to decreases in Y (but not always)
• Major axis of the ellipse has negative slope slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140 0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.7
r = -0.9 r = -1.0
Non-Linear Patterns
0
200
400
600
800
1000
1200
-20 -10 0 10 20 30 40 50
-20
0
20
40
60
80
100
120
0 10 20 30 40 50
r can take on arbitrary values between -1 and +1 if the pattern is non-linear depending or how well your can fit a straight line to the pattern
The Coefficient of Determination
n
ii
n
ii
yy
yyr
1
2
1
2
2
ˆ
An important Identity in Statistics
(Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)
n
iii
n
ii
n
ii yyyyyy
1
2
1
2
1
2 ˆˆ
lainedUnExplainedTotal SSSSSS exp
It can also be shown:
= proportion variability in Y explained by X.
= the coefficient of determination
n
ii
n
ii
yy
yyr
1
2
1
2
2
ˆ
Categorical Data
Techniques for summarizing, displaying and graphing
The frequency tableThe bar graph
Suppose we have collected data on a categorical variable X having k categories – 1, 2, … , k.
To construct the frequency table we simply count for each category (i) of X, the number of cases falling in that category (fi)
To plot the bar graph we simply draw a bar of height fi above each category (i) of X.
Example
In this example data has been collected for n = 34,188 subjects.
• The purpose of the study was to determine the relationship between the use of Antidepressants, Mood medication, Anxiety medication, Stimulants and Sleeping pills.
• In addition the study interested in examining the effects of the independent variables (gender, age, income, education and role) on both individual use of the medications and the multiple use of the medications.
The variables were: 1. Antidepressant use, 2. Mood medication use, 3. Anxiety medication use, 4. Stimulant use and 5. Sleeping pills use.6. gender, 7. age, 8. income, 9. education and 10. Role –
i. Parent, worker, partnerii. Parent, partneriii. Parent, workeriv. worker, partner
v. worker onlyvi. Parent onlyvii. Partner onlyviii. No roles
Frequency Table for Age
Age - (G)
5349 15.7 15.7 15.7
6758 19.8 19.8 35.5
6420 18.8 18.8 54.3
5528 16.2 16.2 70.5
4400 12.9 12.9 83.4
5663 16.6 16.6 100.0
34118 100.0 100.0
20-29
30-39
40-49
50-59
60-69
70+
Total
ValidFrequency Percent Valid Percent
CumulativePercent
20-29 30-39 40-49 50-59 60-69 70+
Age - (G)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Co
un
t
Bar Graph for Age
Frequency Table for Role
role
6614 19.4 24.5 24.5
1068 3.1 4.0 28.5
1351 4.0 5.0 33.5
5427 15.9 20.1 53.6
5711 16.7 21.2 74.7
456 1.3 1.7 76.4
3262 9.6 12.1 88.5
3097 9.1 11.5 100.0
26986 79.1 100.0
7132 20.9
34118 100.0
parent, partner, worker
parent, partner
parent, worker
partner, worker
worker only
parent only
partner only
no roles
Total
Valid
SystemMissing
Total
Frequency Percent Valid PercentCumulative
Percent
parent, partner, workerparent, partner
parent, workerpartner, worker
worker onlyparent only
partner onlyno roles
role
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Co
un
t
Bar Graph for Role
The pie chart• An alternative to the bar chart
• Draw a circle (a pie)
• Divide the circle into segments with area of each segment proportional to fi or pi = fi /n
Example• In this study the population are individuals who
received a head injury. (n = 22540)• The variable is the mechanism that caused the head
injury (InjMech) with categories:– MVA (Motor vehicle accident)
– Falls
– Violence
– Other VA (Other vehicle accidents)
– Accidents (industrial accident)
– Other (all other mechanisms for head injury)
Graphical and Tabular Display of Categorical Data.
• The frequency table
• The bar graph
• The pie chart
The frequency table
InjMech
565 2.5 2.5 2.5
4875 21.6 21.6 24.1
13565 60.2 60.2 84.3
765 3.4 3.4 87.7
2338 10.4 10.4 98.1
432 1.9 1.9 100.0
22540 100.0 100.0
Accdents
Falls
MVA
other
other VA
Violence
Total
ValidFrequency Percent Valid Percent
CumulativePercent
The bar graph
MVAFalls
Violenceother VA
Accdentsother
InjMech
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Val
ue
f
Cases weighted by f
The pie chartMVA
Falls
Violence
other VA
Accdents
other
Cases weighted by f
Multivariate Categorical Data
The two way frequency table
The 2 statistic
Techniques for examining dependence amongst two categorical
variables
Situation
• We have two categorical variables R and C.
• The number of categories of R is r.
• The number of categories of C is c.
• We observe n subjects from the population and count
xij = the number of subjects for which R = i and
C = j.
• R = rows, C = columns
Example
Both Systolic Blood pressure (C) and Serum Chlosterol (R) were meansured for a sample of n = 1237 subjects.
The categories for Blood Pressure are:
<126 127-146 147-166 167+
The categories for Chlosterol are:
<200 200-219 220-259 260+
Table: two-way frequency
Serum Cholesterol
Systolic Blood pressure <127 127-146 147-166 167+ Total
< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439
260+ 67 99 46 33 245
Total 388 527 204 118 1237
Example
This comes from the drug use data.
The two variables are:
1. Age (C) and
2. Antidepressant Use (R)
measured for a sample of n = 33,957 subjects.
Two-way Frequency Table
Took anti-depressants - 12 mo * Age - (G) Crosstabulation
Count
322 523 570 522 265 249 2451
5007 6201 5822 4982 4114 5380 31506
5329 6724 6392 5504 4379 5629 33957
YES
NO
Took anti-depressants- 12 mo
Total
20-29 30-39 40-49 50-59 60-69 70+
Age - (G)
Total
Age - (G)
20-29 30-39 40-49 50-59 60-69 70+6.04% 7.78% 8.92% 9.48% 6.05% 4.42%
Percentage antidepressant use vs Age
Antidepressant Use vs Age
0.0%
5.0%
10.0%
20-29 30-39 40-49 50-59 60-69 70+
The 2 statistic for measuring dependence
amongst two categorical variables
DefineTotal row
1
thc
jiji ixR
1
column Totalc
thj ij
i
C x j
n
CRE ji
ij
= Expected frequency in the (i,j) th cell in the case of independence.
Columns
1 2 3 4 5 Total
1 x11 x12 x13 x14 x15 R1
2 x21 x22 x23 x24 x25 R2
3 x31 x32 x33 x34 x35 R3
4 x41 x42 x43 x44 x45 R4
Total C1 C2 C3 C4 C5 N
Total row 1
thc
jiji ixR
1
column Totalc
thj ij
i
C x j
Columns
1 2 3 4 5 Total
1 E11 E12 E13 E14 E15 R1
2 E21 E22 E23 E24 E25 R2
3 E31 E32 E33 E34 E35 R3
4 E41 E42 E43 E44 E45 R4
Total C1 C2 C3 C4 C5 n
n
CRE ji
ij
Justification if i jij
R CE
n then ij j
i
E C
R n
1 2 3 4 5 Total
1 E11 E12 E13 E14 E15 R1
2 E21 E22 E23 E24 E25 R2
3 E31 E32 E33 E34 E35 R3
4 E41 E42 E43 E44 E45 R4
Total C1 C2 C3 C4 C5 n
Proportion in column j for row i
overall proportion in column j
and if i jij
R CE
n then ij i
j
E R
C n
1 2 3 4 5 Total
1 E11 E12 E13 E14 E15 R1
2 E21 E22 E23 E24 E25 R2
3 E31 E32 E33 E34 E35 R3
4 E41 E42 E43 E44 E45 R4
Total C1 C2 C3 C4 C5 n
Proportion in row i for column j
overall proportion in row i
The 2 statistic
r
i
c
j ij
ijij
E
Ex
1 1
2
2
Eij= Expected frequency in the (i,j) th cell in the case of independence.
xij= observed frequency in the (i,j) th cell
Example: studying the relationship between Systolic Blood pressure and Serum Cholesterol
In this example we are interested in whether Systolic Blood pressure and Serum Cholesterol are related or whether they are independent.
Both were measured for a sample of n = 1237 cases
Serum Cholesterol
Systolic Blood pressure <127 127-146 147-166 167+ Total
< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439
260+ 67 99 46 33 245
Total 388 527 204 118 1237
Observed frequencies
Serum Cholesterol
Systolic Blood pressure <127 127-146 147-166 167+ Total
< 200 96.29 130.79 50.63 29.29 307200-219 77.16 104.8 40.47 23.47 246220-259 137.70 187.03 72.40 41.88 439
260+ 76.85 104.38 40.04 23.37 245
Total 388 527 204 118 1237
Expected frequencies
In the case of independence the distribution across a row is the same for each rowThe distribution down a column is the same for each column
Table Expected frequencies, Observed frequencies, Standardized Residuals
Serum Systolic Blood pressure
Cholesterol <127 127-146 147-166 167+ Total <200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35 200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72 220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17 260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99 Total 388 527 204 118 1237
2 = 20.85
ij
ijijij
E
Exr
Standardized residuals
ij
ijijij
E
Exr
85.20
1 1
2
1 1
2
2
r
i
c
jij
r
i
c
j ij
ijij rE
Ex
The 2 statistic
Example
This comes from the drug use data.
The two variables are:
1. Role (C) and
2. Antidepressant Use (R)
measured for a sample of n = 33,957 subjects.
Two-way Frequency Table
Percentage antidepressant use vs Role
Took anti-depressants - 12 mo * role Crosstabulation
Count
344 101 201 275 455 63 224 414 2077
6268 967 1150 5150 5249 392 3036 2679 24891
6612 1068 1351 5425 5704 455 3260 3093 26968
YES
NO
Took anti-depressants- 12 mo
Total
parent,partner,worker
parent,partner parent, worker
partner,worker worker only parent only partner only no roles
role
Total
Role parent, partner, worker
parent, partner
parent, worker
partner, worker
worker only parent only
partner only no roles
5.20% 9.46% 14.88% 5.07% 7.98% 13.85% 6.87% 13.39%
Antidepressant Use vs Role
0.0%
5.0%
10.0%
15.0%
20.0%
parent,partner,worker
parent,partner
parent,worker
partner,worker
workeronly
parentonly
partneronly
no roles
2 = 381.961
Calculation of 2
1 2 3 4 5 6 7 8 Total
YES 344 101 201 275 455 63 224 414 2077NO 6268 967 1150 5150 5249 392 3036 2679 24891
Total 6612 1068 1351 5425 5704 455 3260 3093 26968
The Raw data
Expected frequencies1 2 3 4 5 6 7 8 Total (R i )
YES 509.24 82.25 104.05 417.82 439.31 35.04 251.08 238.21 2077NO 6102.76 985.75 1246.95 5007.18 5264.69 419.96 3008.92 2854.79 24891
Total (C j ) 6612 1068 1351 5425 5704 455 3260 3093 26968
ij
ijijij
E
Exr
i jij
R CE
n
The Residuals
The calculation of 2
ij
ijijij
E
Exr
1 2 3 4 5 6 7 8
YES -7.32 2.07 9.50 -6.99 0.75 4.72 -1.71 11.39NO 2.12 -0.60 -2.75 2.02 -0.22 -1.36 0.49 -3.29
2
2 2 381.961ij ij
iji j i j ij
x Er
E
Example
• In this example n = 57407 individuals who had been victimized twice by crimes
• Rows = crime of first vicitmization
• Cols = crimes of second victimization
Table 1: Frequencies Second Victimization in Pair
Ra A Ro PP/PS PL B HL MV Total Ra 26 50 11 6 82 39 48 11 273 A 65 2997 238 85 2553 1083 1349 216 8586
First Ro 12 279 197 36 459 197 221 47 1448 Victimization PP/PS 3 102 40 61 243 115 101 38 703
in pair PL 75 2628 413 229 12137 2658 3689 687 22516 B 52 1117 191 102 2649 3210 1973 301 9595 HL 42 1251 206 117 3757 1962 4646 391 1237 MV 3 221 51 24 678 301 367 269 1914 Total 278 8645 1347 660 22558 9565 12394 1960
Table 2: Standardized residuals Second Victimization in Pair
Ra A Ro PP/PS PL B HL MV Ra 21.5 1.4 1.8 1.6 -2.4 -1.0 -1.9 0.6 A 3.6 47.4 2.6 -1.4 -14.1 -9.2 -11.7 -4.5
First Ro 1.9 4.1 28.0 4.7 -4.6 -2.8 -5.2 -0.3 Victimization PP/PS -0.2 -0.4 5.8 18.6 -2.0 -0.2 -4.1 2.9
in pair PL -3.3 -13.1 -5.0 -1.9 35.0 -17.9 -16.8 -2.9 B 0.8 -8.6 -2.3 -0.8 -18.3 40.3 -2.2 -1.5 HL -2.3 -14.2 -4.9 -2.1 -15.8 -2.2 38.2 -1.5 MV -2.1 -4.0 0.9 0.4 -2.7 -1.0 -2.3 25.2
11,430 (highly significant)
Next Topic:
Brief introduction to Statistical Packages