Upload
alec
View
179
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Categorical Data Analysis PGRM 14. What is categorical data?. The measurement scale for the response consists of a number of categories. Data Analysis considered:. Response variable(s) is categorical Explanatory variable(s) may be categorical or continuous. - PowerPoint PPT Presentation
Citation preview
Statistics
in
ScienceStatistics
in
Science
Categorical Data Analysis
PGRM 14
Statistics
in
ScienceStatistics
in
Science
What is categorical data?
The measurement scale for the responseconsists of a number of categories
Variable Measurement Scale
Farm system Dairy, Beef, Tillage etc.
Mortality Dead, alive
Food textureVery soft, Soft, Hard,
Very hard
Litter size 0, 1, 2, 3 and >3
Statistics
in
ScienceStatistics
in
Science
Data Analysis considered:
• Response variable(s)is categorical
• Explanatory variable(s) may be categorical or continuous
Example: Does Post-operative survival (categorical response) depend on the explanatory variables?
Sex (categorical)
Age (continuous)
Example: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system.
Farm system (categorical)
Attitude to EU (categorical/ordinal)?
(Two response variables - no explanatory variables)
Could one of these be regarded as explanatory?
Statistics
in
ScienceStatistics
in
Science
Measurement scales for categorical data Nominal - no underlying order
Variable Measurement Scale
Farm system Dairy, Beef, Tillage etc.Weed Species Stellaria media, Poa annua, etc.
Ordinal - underlying order in the scaleVariable Measurement Scale
Food texture Very soft, Soft, Hard, Very hard
Disease diagnosis Very likely, Likely, Unlikely
Education Primary, Secondary, Tertiary
Interval - underlying numerical distance between scale points
Variable Measurement Scale
Litter size 0, 1, 2, 3 and >3
Age class <1, 1-2, 2-3.5, 3.5-5, >5
Education years in education
Statistics
in
ScienceStatistics
in
Science
Tablesreporting categoricaldata
1-, 2- & 3-way
Statistics
in
ScienceStatistics
in
Science
Tables reporting count data: single level
Example:A geneticist carries out a crossing experiment between F1 hybrids of a wild type and a mutant genotype and obtains an F2 progeny of 90 offspring with the following characteristics.
Wild Type Mutant Total
80 10 90
Evidence that a wild type is dominant, giving on average 8:1 offspring phenotype in its favour?
Statistics
in
ScienceStatistics
in
Science
Tables for count data: two-way
Example:A sample 124 mice was divided into two groups, 84 receiving a standard dose of pathogenic bacteria followed by an antiserum and a control group of 40 not receiving the antiserum. After 3 weeks the numbers dead and alive in each group were counted.
OutcomeDead Alive Total % dead
antiserum 19 65 84 23control 18 22 40 45Total 37 87 124
Association betweenmortality and treatment?
Statistics
in
ScienceStatistics
in
Science
Tables for count data: two-wayExample (Snedecor & Cochran):
The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate.
• Has the higher concentration given a significantly different percentage kill?
• Is there a relationship between concentration and mortality?
Concentration of sodium oleate (%)
0.65 1.10 1.6 2.1 Total
Dead 55 62 100 72 289Alive 22 13 12 5 52Total 77 75 112 77 341% Dead 71.4 82.7 89.3 93.5 84.8
Statistics
in
ScienceStatistics
in
Science
Is this the relationship?
?
Note:categorical response
interval categorical explanatory variable
Statistics
in
ScienceStatistics
in
Science
Tables for count data: two-wayExample (Cornfield 1962)
Blood pressure (BP) was measured on a sample of males aged 40-59, who were also classified by whether they developed coronary heart disease (CHD) in a 6-year follow-up period.
BP:interval categorical variablein 8 classes
CHD:CHD or No-CHD
BP CHD No CHD Total % CHD
<117 3 153 156 1.9
117 - 126 17 235 252 6.7
127 - 136 12 272 284 4.2
137 - 146 16 255 271 5.9
147 - 156 12 127 139 8.6
157 - 166 8 77 85 9.4
167 - 186 16 83 99 16.2
>186 8 35 43 18.6
Total 92 1237 1329
1.Is the incidence of CHD independent of BP?
2.Is there a simple relationship between the probability of CHD and the level of BP?
Statistics
in
ScienceStatistics
in
Science
CHD v BP relationship
Statistics
in
ScienceStatistics
in
Science
3-way tableExample: Grouped binomial (response has 2 categories) data -
patterns of psychotropic drug consumption in a sample from West London (Murray et al 1981, Psy Med 11,551-60)
Sex Age Group Psych. case On drugs Total
M 1 No 9 531M 2 No 16 500M 3 No 38 644M 4 No 26 275M 5 No 9 90M 1 Yes 12 171M 2 Yes 16 125M 3 Yes 31 121M 4 Yes 16 56M 5 Yes 10 26F 1 No 12 588F 2 No 42 596F 3 No 96 765F 4 No 52 327F 5 No 30 179F 1 Yes 33 210F 2 Yes 47 189F 3 Yes 71 242F 4 Yes 45 98F 5 Yes 21 60
Statistics
in
ScienceStatistics
in
Science
Non-tabulated dataExample: Individual Legousia plants were monitored in an experiment to see whether they survived after 3 months.
Survived -yes is scored 1Survived -no scored 0.
Also recorded were:
CO2 treatment – 2 levels low and high
Density of Legousia
Density of companion species
Height of the plant (mm)two weeks after planting.
Most individuals will have a unique profile in these three additional variables and so tabulation of the data by them is not feasible. The individual data is presented
Statistics
in
ScienceStatistics
in
Science
Non-tabulated data
1. Is survival related to the explanatory variables:CO2, Height, density-self, density-companions?
2. Can the probability of survival be predicted from the subject’s profile?
Density
Subject Surv CO2 Ht Leg. Comp
1 0 L 35 20 30
2 1 L 68 22 27
3 1 H 43 16 33
4 0 L 27 4 16
… … … … … …
… … … … … …
Response
Statistics
in
ScienceStatistics
in
Science
Fixed and non-fixed margins
• One margin fixed: Samples of fixed size are selected for one or more categories and individuals are classified by the other category(s).
• No margin fixed: Individuals in a single sample are simultaneously classified by several categorical variables.
Difference between these depends on the experimental design and how this specified the data should be collected.
Method of analysis is the same.
Statistics
in
ScienceStatistics
in
Science
Asking the right question
• Data summarized by counts
• Questions usually relate to %s(equivalently proportions)
Statistics
in
ScienceStatistics
in
Science
Hypotheses for Categorical Data
• Categorical data is summarised by counting individuals falling into the various combinations of categories
• Hypotheses relate to:the probability of an individual being in a particular category
• These probabilities are estimated by the observed proportions in the data
• Using a sample proportion, p, from a sample of size n, to estimate a population proportion the standard error is
√(p(1 – p)/n)eg with p = 0.5, n = 1100,
2×SE = 0.03the often mentioned 3% margin of error
Statistics
in
ScienceStatistics
in
Science
Example Outcome
Dead Alive Total % deadantiserum 19 65 84 23control 18 22 40 45Total 37 87 124
Does % dead depend on antiserum?
Equivalently:
1. Is there an association between mortality and antiserum?
2. Is mortality independent of anitserum?
Statistics
in
ScienceStatistics
in
Science
Example
• As usual we set up a null hypothesis and measure the extent to which the data conflicts with this
• Here H0:
prob of death for anti = prob of death for control
• equivalently H0:
– no association between mortality and antiserum– Mortality and antiserum are independent
Outcome
Dead Alive Total % deadantiserum 19 65 84 23control 18 22 40 45Total 37 87 124
Statistics
in
ScienceStatistics
in
Science
Example
Expected counts when H0 is true:
The overall % dead (37/124)would apply to antiserum & control
For the 84 antiserum this would give(84×37)/124 dead and (84×87)/124 alive
For the 40 control this would give(40×37)/124 dead and (40×87)/124 alive
Outcome
Dead Alive Total % deadantiserum 19 65 84 23control 18 22 40 45Total 37 87 124
E = (row total)(column total)/(table total)
Statistics
in
ScienceStatistics
in
Science
Observed and expected countsOutcome
Dead Alive Total % deadantiserum 19 65 84 23control 18 22 40 45Total 37 87 124
OutcomeDead Alive Total % dead
antiserum 25.1 58.9 84 29.9control 11.9 28.1 40 29.8Total 37 87 124
Note: some rounding error
Observed
Expected
Statistics
in
ScienceStatistics
in
Science
Chi-squared statistic : X2
• X2 measures difference between observed counts, O, and expected (when H0 holds) counts, E
• If LARGE provides evidence against H0, ie evidence
for an association (dependence) of mortality on anitserum.
• X2 = ∑(O – E)2/E
• Here SAS/FREQ gives: X2 = 6.48
p = Prob(X2 > 6.48 when H0 is true) = 0.0109
• Conclusion:there is evidence (p < 0.05) that mortality depends on antiserum
Statistics
in
ScienceStatistics
in
Science
Practical Exercise
Use Excel to calculate X2 and p
Lab Session 5 exercise 5.1 (a)
Statistics
in
ScienceStatistics
in
Science
Table of antiserum by dead
antiserum dead
Frequency Expected Row Pct 0 1 Total
antiserum 65 58.935 77.38
19 25.065 22.62
84
control 22 28.065 55.00
18 11.935 45.00
40
Total 87 37 124
SAS/FREQ OUTPUT
X2 = ∑(O – E)2/E
O = Frequency
E = Expected
Row Percents make most sense here(% alive/dead in each antiserum group)
Description of cell contents
Statistics
in
ScienceStatistics
in
Science
Statistic DF Value Prob
Chi-Square 1 6.4833 0.0109
Likelihood Ratio Chi-Square 1 6.2846 0.0122
Continuity Adj. Chi-Square 1 5.4583 0.0195
Mantel-Haenszel Chi-Square 1 6.4310 0.0112
Phi Coefficient 0.2287
Contingency Coefficient 0.2229
Cramer's V 0.2287
SAS/FREQ OUTPUT
X2 = ∑(O – E)2/EDF = (r–1)×(c-1)
Ignore!
Statistics
in
ScienceStatistics
in
Science
P = 0.001 with X2 = 6.48
6.48
Area 0.05
Area0.001
68% values < 1(not shown)
Statistics
in
ScienceStatistics
in
Science
Table of status by concstatus(Outcome) conc(Sodium oleate concentration (%))
Total
FrequencyExpectedCell Chi-SquareCol Pct 0.65 1.1 1.6 2.1
Alive 2211.7428.9617
28.57
1311.4370.2136
17.33
1217.0791.5105
10.71
511.7423.8711
6.49
52
Dead 5565.2581.6125
71.43
6263.5630.0384
82.27
10094.9210.2718
89.29
7265.2580.6965
93.51
289
Total 77 75 112 77 341
Aphid example (SAS/FREQ OUTPUT)
X2 = 17.18p = 0.0007 (3 df)
Note the largest contributions (O – E)2/E to X2 (8.96 & 3.87) are in top corners
Statistics
in
ScienceStatistics
in
Science
Locating the concentration effectTable of Outcome by Sodium
Outcome Sodium oleate(%)TotalFrequency
Expected0.65 1.1
Alive22
28.5713
17.3335
Dead55
71.4362
82.67117
Total 77 75 152
Table of Outcome by Sodium
Outcome Sodium oleate(%)TotalFrequency
Expected1.6 2.1
Alive12
10.715
6.4917
Dead100
89.2972
93.51172
Total 112 77 189
X2 = 2.71p = 0.10
X2 = 0.99p = 0.32
Statistics
in
ScienceStatistics
in
Science
Locating the concentration effect
X2 = 12.83p = 0.0003
Table of Outcome by Sodium
OutcomeSodium
oleate(%)
TotalFrequencyCol Pct <1.5% >1.5%
Alive 3523.03
178.99
52
Dead 11776.97
17291.01
289
Total 152 189 341
Statistics
in
ScienceStatistics
in
Science
SAS – data format for FREQ procedureConcentration of sodium oleate (%)
0.65 1.10 1.6 2.1 Total
Dead 55 62 100 72 289Alive 22 13 12 5 52Total 77 75 112 77 341% Dead 71.4 82.7 89.3 93.5 84.8
Conc status number
0.65 d 55
0.65 a 22
1.10 d 62
1.10 a 13
1.60 d 100
1.60 a 12
2.10 d 72
2.10 a 5
2 cols identify the cell
Final column is the ‘response’
– the frequency count for the cell
Statistics
in
ScienceStatistics
in
Science
Validity of chi-squared (2) test
• Test is based on an approximation leading to use of
the 2 distribution to calculate p-values
• With several DF and E 5 approximation is ok
• If E < 1 in any cell approximation may be bad
• With a number of cells in the table perhaps a third or quarter can have E between 1 & 5 without serious
departures from 2 based p-values. (PGRM pg 14-11)
• In cases where good approximation is in doubt use Fisher’s exact test (SAS/FREQ tables option exact)
Statistics
in
ScienceStatistics
in
Science
Code: SAS/FREQproc freq data = conc;
weight number;
tables status*conc
/ chisq cellchi2 expected
norow nopercent nocum;
quit;
Option To Dochisq Test statistics (chi-squared etc)
cellchi2 Contribution to X2 from each cell
expected Expected values for each cell
norow nopercent
Omit row/overall percentages
nocum Omit cumulative frequencies
Statistics
in
ScienceStatistics
in
Science
Practical Exercise
SAS/FREQ procedure
Lab Session 5 exercise 5.1 (b) – (d)
Statistics
in
ScienceStatistics
in
Science
Logistic Regression
Statistics
in
ScienceStatistics
in
Science
Is this the relationship?
?
Note:categorical response
interval categorical explanatory variable
Statistics
in
ScienceStatistics
in
Science
Why logistic and not just 2?• For sparse data
(eg where individuals will have unique profiles)
• With many categorical explanatory variables
• With quantitative explanatory variables
In the case of a continuous response we have looked to see if the mean, , can be expressed as
= a + bx
With categorical data we want an expression for p (the probability of the response in one of the 2 response categories) but
p = a + bxmay give values outside the range 0 to 1!
eg p = 0.1 + 0.2x gives p = 1.1 for x = 5
Statistics
in
ScienceStatistics
in
Science
A solution: TRANSFORM• Use the transformation:
p = exp(a + bx)/(1 + exp(a + bx))
• i.e. log(p/(1 – p)) = a + bxlog(Odds) = a + bx
where Odds = p/(1 – p)
Note:exp(x) = ex
Plot is for:a = 0, b = 1
LOGIT:logit(p) = log(p/(1-p))
Statistics
in
ScienceStatistics
in
Science
SAS/GPLOTlogit(p) = −0.119 + 1.25 conc
Logistic Estimate of Death Probability
p
0.6
0.7
0.8
0.9
1.0
Sodium oleate (%)
0.6 1.0 1.4 1.8 2.2
Statistics
in
ScienceStatistics
in
Science
LD50 – lethal dose for 50%
p = 0.5
p /(1 – p) = 1
logit(p) = 0 (since log(1) = 0, WNF!)
0 = −0.119 + 1.25 conc
conc = 0.119/1.25 = 0.095
Odd Ratio (OR)Increasing conc by 1% increases logit(p) by 1.25
log(Odds2) – log(Odds1) = 1.25
log(OR) = 1.25
OR = exp(1.25) = 3.49
log(a) – log(b) = log(a/b)
Statistics
in
ScienceStatistics
in
Science
SAS/GENMODconc dead total
0.65 53 771.10 57 751.60 95 1122.10 73 77
proc genmod data = log;model dead/total = conc / predlink = logitdist = binomial;outputout = ppredicted = p;
run;
Term Function
dead/total the proportion to be estimated
conc the explanatory variable
pred include predicted p’s in OUTPUT
link = logit for modelling log(p/(1-p)) the log(ODDS)
dist = binomial the data consists of counts out of a total
out = p output will also go to a data set work.p
predicted = pin work.p a column named p will contain predicted values
Statistics
in
ScienceStatistics
in
Science
Practical Exercise
SAS/GENMOD of Logistic Regression
Lab Session 5 exercise 5.2 (a) – (g)
Statistics
in
ScienceStatistics
in
Science
Modelling needs biological insight!
Statistics
in
ScienceStatistics
in
Science
Stability analysis (Ex 2 pg 14-15)
Explanatory terms
Model 1:h d h2 d2 hd
hopefully high order terms will not be needed!
Model 2:h/d2
biologist suggests this!
Heights, diameter and whether they fell over were recorded for 545 plants.
Aim: model the probability of stability (not falling over) as a function of height an diameter.
diameter height stable n.0016 0.057 1 1
.0018 0.084 0 1
.0018 0.221 0 1
.0018 0.038 1 1
.0019 0.058 1 1
.0019 0.067 1 1
… … … …
Statistics
in
ScienceStatistics
in
Science
Model 1: h, d, h2, d2, hd
Analysis Of Parameter Estimates
Parameter DF EstimateStandard
Error
Wald 95% Confidence
Limits Chi-Square Pr > ChiSqIntercept 1 -5.3801 0.9402 -7.2228 -3.5374 32.75 <.0001
height 1 -39.1639 4.1510 -47.2998 -31.0280 89.01 <.0001
diameter 1 4958.358 654.0395 3676.464 6240.252 57.47 <.0001
h2 1 10.0396 5.0747 0.0934 19.9859 3.91 0.0479
d2 1 -560913 120280.4 -796659 -325168 21.75 <.0001
hd 1 4206.787 1502.453 1262.033 7151.540 7.84 0.0051
Scale 0 1.0000 0.0000 1.0000 1.0000
How can I describe this!
Statistics
in
ScienceStatistics
in
Science
Model 2: h/d2
Analysis Of Parameter Estimates
Parameter DF EstimateStandard
Error
Wald 95% Confidence
Limits Chi-Square Pr > ChiSqIntercept 1 3.3235 0.3212 2.6940 3.9529 107.09 <.0001
h_d2 1 -1.7884 0.1583 -2.0987 -1.4780 127.56 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
Can understand & even plot this!
Statistics
in
ScienceStatistics
in
Science
SAS/GRAPH
But!
Statistics
in
ScienceStatistics
in
Science
Linear v Quadratic in x = h/d2
?
Statistics
in
ScienceStatistics
in
Science
Finally!Modelling counts
Statistics
in
ScienceStatistics
in
Science
Poisson Regression
For count data- where eg we count all – not a subset out of a total
To estimate the mean, μ, and its relationship with an explanatory variable x use a log link (usually):
log(μ) = a + bx
ie μ = exp(a + bx) (which will be >0)
= ea ebx
model count = x / link = log distribution = poisson;
SAS/GENMOD
Statistics
in
ScienceStatistics
in
Science
Example: Horseshoe crabs & satellites
Each female crab had an attached male (in her nest) & other males (satellites) residing nearby.
• Data recorded– No satellites (response)– Color (light medium, medium, dark medium, dark)– Spine condition
(both good, one worn/broken, both worn/broken)– Carapace width (cm)– Weight (kg)
• Poisson Models:– Log link: log(μ) = a + bx– Identity link: μ = a + bx
Statistics
in
ScienceStatistics
in
Science
Effect of width and colour
Statistics
in
ScienceStatistics
in
Science
Grouping weight & number values
Statistics
in
ScienceStatistics
in
Science
Variation in no. satellites
Statistics
in
ScienceStatistics
in
Science
Practical exercise
SAS/GENMOD for Poisson Regression
Lab Session 5 Exercise 5.3 (a) – (e)