Categorical Data Analysis PGRM 14

Statistics

in

ScienceStatistics

in

Science

Categorical Data Analysis

PGRM 14

Statistics

in

ScienceStatistics

in

Science

What is categorical data?

The measurement scale for the responseconsists of a number of categories

Variable Measurement Scale

Farm system Dairy, Beef, Tillage etc.

Mortality Dead, alive

Food textureVery soft, Soft, Hard,

Very hard

Litter size 0, 1, 2, 3 and >3

Statistics

in

ScienceStatistics

in

Science

Data Analysis considered:

• Response variable(s)is categorical

• Explanatory variable(s) may be categorical or continuous

Example: Does Post-operative survival (categorical response) depend on the explanatory variables?

Sex (categorical)

Age (continuous)

Example: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system.

Farm system (categorical)

Attitude to EU (categorical/ordinal)?

(Two response variables - no explanatory variables)

Could one of these be regarded as explanatory?

Statistics

in

ScienceStatistics

in

Science

Measurement scales for categorical data Nominal - no underlying order


Farm system Dairy, Beef, Tillage etc.Weed Species Stellaria media, Poa annua, etc.

Ordinal - underlying order in the scaleVariable Measurement Scale

Food texture Very soft, Soft, Hard, Very hard

Disease diagnosis Very likely, Likely, Unlikely

Education Primary, Secondary, Tertiary

Interval - underlying numerical distance between scale points


Litter size 0, 1, 2, 3 and >3

Age class <1, 1-2, 2-3.5, 3.5-5, >5

Education years in education

Statistics

in

ScienceStatistics

in

Science

Tablesreporting categoricaldata

1-, 2- & 3-way

Statistics

in

ScienceStatistics

in

Science

Tables reporting count data: single level

Example:A geneticist carries out a crossing experiment between F1 hybrids of a wild type and a mutant genotype and obtains an F2 progeny of 90 offspring with the following characteristics.

Wild Type Mutant Total

80 10 90

Evidence that a wild type is dominant, giving on average 8:1 offspring phenotype in its favour?

Statistics

in

ScienceStatistics

in

Science

Tables for count data: two-way

Example:A sample 124 mice was divided into two groups, 84 receiving a standard dose of pathogenic bacteria followed by an antiserum and a control group of 40 not receiving the antiserum. After 3 weeks the numbers dead and alive in each group were counted.

OutcomeDead Alive Total % dead

antiserum 19 65 84 23control 18 22 40 45Total 37 87 124

Association betweenmortality and treatment?

Statistics

in

ScienceStatistics

in

Science

Tables for count data: two-wayExample (Snedecor & Cochran):

The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate.

• Has the higher concentration given a significantly different percentage kill?

• Is there a relationship between concentration and mortality?

Concentration of sodium oleate (%)

0.65 1.10 1.6 2.1 Total

Dead 55 62 100 72 289Alive 22 13 12 5 52Total 77 75 112 77 341% Dead 71.4 82.7 89.3 93.5 84.8

Statistics

in

ScienceStatistics

in

Science

Is this the relationship?

?

Note:categorical response

interval categorical explanatory variable

Statistics

in

ScienceStatistics

in

Science

Tables for count data: two-wayExample (Cornfield 1962)

Blood pressure (BP) was measured on a sample of males aged 40-59, who were also classified by whether they developed coronary heart disease (CHD) in a 6-year follow-up period.

BP:interval categorical variablein 8 classes

CHD:CHD or No-CHD

BP CHD No CHD Total % CHD

<117 3 153 156 1.9

117 - 126 17 235 252 6.7

127 - 136 12 272 284 4.2

137 - 146 16 255 271 5.9

147 - 156 12 127 139 8.6

157 - 166 8 77 85 9.4

167 - 186 16 83 99 16.2

>186 8 35 43 18.6

Total 92 1237 1329

1.Is the incidence of CHD independent of BP?

2.Is there a simple relationship between the probability of CHD and the level of BP?

Statistics

in

ScienceStatistics

in

Science

CHD v BP relationship

Statistics

in

ScienceStatistics

in

Science

3-way tableExample: Grouped binomial (response has 2 categories) data -

patterns of psychotropic drug consumption in a sample from West London (Murray et al 1981, Psy Med 11,551-60)

Sex Age Group Psych. case On drugs Total

M 1 No 9 531M 2 No 16 500M 3 No 38 644M 4 No 26 275M 5 No 9 90M 1 Yes 12 171M 2 Yes 16 125M 3 Yes 31 121M 4 Yes 16 56M 5 Yes 10 26F 1 No 12 588F 2 No 42 596F 3 No 96 765F 4 No 52 327F 5 No 30 179F 1 Yes 33 210F 2 Yes 47 189F 3 Yes 71 242F 4 Yes 45 98F 5 Yes 21 60

Statistics

in

ScienceStatistics

in

Science

Non-tabulated dataExample: Individual Legousia plants were monitored in an experiment to see whether they survived after 3 months.

Survived -yes is scored 1Survived -no scored 0.

Also recorded were:

CO2 treatment – 2 levels low and high

Density of Legousia

Density of companion species

Height of the plant (mm)two weeks after planting.

Most individuals will have a unique profile in these three additional variables and so tabulation of the data by them is not feasible. The individual data is presented

Statistics

in

ScienceStatistics

in

Science

Non-tabulated data

1. Is survival related to the explanatory variables:CO2, Height, density-self, density-companions?

2. Can the probability of survival be predicted from the subject’s profile?

Density

Subject Surv CO2 Ht Leg. Comp

1 0 L 35 20 30

2 1 L 68 22 27

3 1 H 43 16 33

4 0 L 27 4 16

… … … … … …

… … … … … …

Response

Statistics

in

ScienceStatistics

in

Science

Fixed and non-fixed margins

• One margin fixed: Samples of fixed size are selected for one or more categories and individuals are classified by the other category(s).

• No margin fixed: Individuals in a single sample are simultaneously classified by several categorical variables.

Difference between these depends on the experimental design and how this specified the data should be collected.

Method of analysis is the same.

Statistics

in

ScienceStatistics

in

Science

Asking the right question

• Data summarized by counts

• Questions usually relate to %s(equivalently proportions)

Statistics

in

ScienceStatistics

in

Science

Hypotheses for Categorical Data

• Categorical data is summarised by counting individuals falling into the various combinations of categories

• Hypotheses relate to:the probability of an individual being in a particular category

• These probabilities are estimated by the observed proportions in the data

• Using a sample proportion, p, from a sample of size n, to estimate a population proportion the standard error is

√(p(1 – p)/n)eg with p = 0.5, n = 1100,

2×SE = 0.03the often mentioned 3% margin of error

Statistics

in

ScienceStatistics

in

Science

Example Outcome

Dead Alive Total % deadantiserum 19 65 84 23control 18 22 40 45Total 37 87 124

Does % dead depend on antiserum?

Equivalently:

1. Is there an association between mortality and antiserum?

2. Is mortality independent of anitserum?

Statistics

in

ScienceStatistics

in

Science

Example

• As usual we set up a null hypothesis and measure the extent to which the data conflicts with this

• Here H0:

prob of death for anti = prob of death for control

• equivalently H0:

– no association between mortality and antiserum– Mortality and antiserum are independent

Outcome


Statistics

in

ScienceStatistics

in

Science

Example

Expected counts when H0 is true:

The overall % dead (37/124)would apply to antiserum & control

For the 84 antiserum this would give(84×37)/124 dead and (84×87)/124 alive

For the 40 control this would give(40×37)/124 dead and (40×87)/124 alive

Outcome


E = (row total)(column total)/(table total)

Statistics

in

ScienceStatistics

in

Science

Observed and expected countsOutcome


OutcomeDead Alive Total % dead

antiserum 25.1 58.9 84 29.9control 11.9 28.1 40 29.8Total 37 87 124

Note: some rounding error

Observed

Expected

Statistics

in

ScienceStatistics

in

Science

Chi-squared statistic : X2

• X2 measures difference between observed counts, O, and expected (when H0 holds) counts, E

• If LARGE provides evidence against H0, ie evidence

for an association (dependence) of mortality on anitserum.

• X2 = ∑(O – E)2/E

• Here SAS/FREQ gives: X2 = 6.48

p = Prob(X2 > 6.48 when H0 is true) = 0.0109

• Conclusion:there is evidence (p < 0.05) that mortality depends on antiserum

Statistics

in

ScienceStatistics

in

Science

Practical Exercise

Use Excel to calculate X2 and p

Lab Session 5 exercise 5.1 (a)

Statistics

in

ScienceStatistics

in

Science

Table of antiserum by dead

antiserum dead

Frequency Expected Row Pct 0 1 Total

antiserum 65 58.935 77.38

19 25.065 22.62

84

control 22 28.065 55.00

18 11.935 45.00

40

Total 87 37 124

SAS/FREQ OUTPUT

X2 = ∑(O – E)2/E

O = Frequency

E = Expected

Row Percents make most sense here(% alive/dead in each antiserum group)

Description of cell contents

Statistics

in

ScienceStatistics

in

Science

Statistic DF Value Prob

Chi-Square 1 6.4833 0.0109

Likelihood Ratio Chi-Square 1 6.2846 0.0122

Continuity Adj. Chi-Square 1 5.4583 0.0195

Mantel-Haenszel Chi-Square 1 6.4310 0.0112

Phi Coefficient 0.2287

Contingency Coefficient 0.2229

Cramer's V 0.2287

SAS/FREQ OUTPUT

X2 = ∑(O – E)2/EDF = (r–1)×(c-1)

Ignore!

Statistics

in

ScienceStatistics

in

Science

P = 0.001 with X2 = 6.48

6.48

Area 0.05

Area0.001

68% values < 1(not shown)

Statistics

in

ScienceStatistics

in

Science

Table of status by concstatus(Outcome) conc(Sodium oleate concentration (%))

Total

FrequencyExpectedCell Chi-SquareCol Pct 0.65 1.1 1.6 2.1

Alive 2211.7428.9617

28.57

1311.4370.2136

17.33

1217.0791.5105

10.71

511.7423.8711

6.49

52

Dead 5565.2581.6125

71.43

6263.5630.0384

82.27

10094.9210.2718

89.29

7265.2580.6965

93.51

289

Total 77 75 112 77 341

Aphid example (SAS/FREQ OUTPUT)

X2 = 17.18p = 0.0007 (3 df)

Note the largest contributions (O – E)2/E to X2 (8.96 & 3.87) are in top corners

Statistics

in

ScienceStatistics

in

Science

Locating the concentration effectTable of Outcome by Sodium

Outcome Sodium oleate(%)TotalFrequency

Expected0.65 1.1

Alive22

28.5713

17.3335

Dead55

71.4362

82.67117

Total 77 75 152

Table of Outcome by Sodium

Outcome Sodium oleate(%)TotalFrequency

Expected1.6 2.1

Alive12

10.715

6.4917

Dead100

89.2972

93.51172

Total 112 77 189

X2 = 2.71p = 0.10

X2 = 0.99p = 0.32

Statistics

in

ScienceStatistics

in

Science

Locating the concentration effect

X2 = 12.83p = 0.0003

Table of Outcome by Sodium

OutcomeSodium

oleate(%)

TotalFrequencyCol Pct <1.5% >1.5%

Alive 3523.03

178.99

52

Dead 11776.97

17291.01

289

Total 152 189 341

Statistics

in

ScienceStatistics

in

Science

SAS – data format for FREQ procedureConcentration of sodium oleate (%)

0.65 1.10 1.6 2.1 Total

Dead 55 62 100 72 289Alive 22 13 12 5 52Total 77 75 112 77 341% Dead 71.4 82.7 89.3 93.5 84.8

Conc status number

0.65 d 55

0.65 a 22

1.10 d 62

1.10 a 13

1.60 d 100

1.60 a 12

2.10 d 72

2.10 a 5

2 cols identify the cell

Final column is the ‘response’

– the frequency count for the cell

Statistics

in

ScienceStatistics

in

Science

Validity of chi-squared (2) test

• Test is based on an approximation leading to use of

the 2 distribution to calculate p-values

• With several DF and E 5 approximation is ok

• If E < 1 in any cell approximation may be bad

• With a number of cells in the table perhaps a third or quarter can have E between 1 & 5 without serious

departures from 2 based p-values. (PGRM pg 14-11)

• In cases where good approximation is in doubt use Fisher’s exact test (SAS/FREQ tables option exact)

Statistics

in

ScienceStatistics

in

Science

Code: SAS/FREQproc freq data = conc;

weight number;

tables status*conc

/ chisq cellchi2 expected

norow nopercent nocum;

quit;

Option To Dochisq Test statistics (chi-squared etc)

cellchi2 Contribution to X2 from each cell

expected Expected values for each cell

norow nopercent

Omit row/overall percentages

nocum Omit cumulative frequencies

Statistics

in

ScienceStatistics

in

Science

Practical Exercise

SAS/FREQ procedure

Lab Session 5 exercise 5.1 (b) – (d)

Statistics

in

ScienceStatistics

in

Science

Logistic Regression

Statistics

in

ScienceStatistics

in

Science

Is this the relationship?

?

Note:categorical response

interval categorical explanatory variable

Statistics

in

ScienceStatistics

in

Science

Why logistic and not just 2?• For sparse data

(eg where individuals will have unique profiles)

• With many categorical explanatory variables

• With quantitative explanatory variables

In the case of a continuous response we have looked to see if the mean, , can be expressed as

= a + bx

With categorical data we want an expression for p (the probability of the response in one of the 2 response categories) but

p = a + bxmay give values outside the range 0 to 1!

eg p = 0.1 + 0.2x gives p = 1.1 for x = 5

Statistics

in

ScienceStatistics

in

Science

A solution: TRANSFORM• Use the transformation:

p = exp(a + bx)/(1 + exp(a + bx))

• i.e. log(p/(1 – p)) = a + bxlog(Odds) = a + bx

where Odds = p/(1 – p)

Note:exp(x) = ex

Plot is for:a = 0, b = 1

LOGIT:logit(p) = log(p/(1-p))

Statistics

in

ScienceStatistics

in

Science

SAS/GPLOTlogit(p) = −0.119 + 1.25 conc

Logistic Estimate of Death Probability

p

0.6

0.7

0.8

0.9

1.0

Sodium oleate (%)

0.6 1.0 1.4 1.8 2.2

Statistics

in

ScienceStatistics

in

Science

LD50 – lethal dose for 50%

p = 0.5

p /(1 – p) = 1

logit(p) = 0 (since log(1) = 0, WNF!)

0 = −0.119 + 1.25 conc

conc = 0.119/1.25 = 0.095

Odd Ratio (OR)Increasing conc by 1% increases logit(p) by 1.25

log(Odds2) – log(Odds1) = 1.25

log(OR) = 1.25

OR = exp(1.25) = 3.49

log(a) – log(b) = log(a/b)

Statistics

in

ScienceStatistics

in

Science

SAS/GENMODconc dead total

0.65 53 771.10 57 751.60 95 1122.10 73 77

proc genmod data = log;model dead/total = conc / predlink = logitdist = binomial;outputout = ppredicted = p;

run;

Term Function

dead/total the proportion to be estimated

conc the explanatory variable

pred include predicted p’s in OUTPUT

link = logit for modelling log(p/(1-p)) the log(ODDS)

dist = binomial the data consists of counts out of a total

out = p output will also go to a data set work.p

predicted = pin work.p a column named p will contain predicted values

Statistics

in

ScienceStatistics

in

Science

Practical Exercise

SAS/GENMOD of Logistic Regression

Lab Session 5 exercise 5.2 (a) – (g)

Statistics

in

ScienceStatistics

in

Science

Modelling needs biological insight!

Statistics

in

ScienceStatistics

in

Science

Stability analysis (Ex 2 pg 14-15)

Explanatory terms

Model 1:h d h2 d2 hd

hopefully high order terms will not be needed!

Model 2:h/d2

biologist suggests this!

Heights, diameter and whether they fell over were recorded for 545 plants.

Aim: model the probability of stability (not falling over) as a function of height an diameter.

diameter height stable n.0016 0.057 1 1

.0018 0.084 0 1

.0018 0.221 0 1

.0018 0.038 1 1

.0019 0.058 1 1

.0019 0.067 1 1

… … … …

Statistics

in

ScienceStatistics

in

Science

Model 1: h, d, h2, d2, hd

Analysis Of Parameter Estimates

Parameter DF EstimateStandard

Error

Wald 95% Confidence

Limits Chi-Square Pr > ChiSqIntercept 1 -5.3801 0.9402 -7.2228 -3.5374 32.75 <.0001

height 1 -39.1639 4.1510 -47.2998 -31.0280 89.01 <.0001

diameter 1 4958.358 654.0395 3676.464 6240.252 57.47 <.0001

h2 1 10.0396 5.0747 0.0934 19.9859 3.91 0.0479

d2 1 -560913 120280.4 -796659 -325168 21.75 <.0001

hd 1 4206.787 1502.453 1262.033 7151.540 7.84 0.0051

Scale 0 1.0000 0.0000 1.0000 1.0000

How can I describe this!

Statistics

in

ScienceStatistics

in

Science

Model 2: h/d2

Analysis Of Parameter Estimates

Parameter DF EstimateStandard

Error

Wald 95% Confidence

Limits Chi-Square Pr > ChiSqIntercept 1 3.3235 0.3212 2.6940 3.9529 107.09 <.0001

h_d2 1 -1.7884 0.1583 -2.0987 -1.4780 127.56 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000

Can understand & even plot this!

Statistics

in

ScienceStatistics

in

Science

SAS/GRAPH

But!

Statistics

in

ScienceStatistics

in

Science

Linear v Quadratic in x = h/d2

?

Statistics

in

ScienceStatistics

in

Science

Finally!Modelling counts

Statistics

in

ScienceStatistics

in

Science

Poisson Regression

For count data- where eg we count all – not a subset out of a total

To estimate the mean, μ, and its relationship with an explanatory variable x use a log link (usually):

log(μ) = a + bx

ie μ = exp(a + bx) (which will be >0)

= ea ebx

model count = x / link = log distribution = poisson;

SAS/GENMOD

Statistics

in

ScienceStatistics

in

Science

Example: Horseshoe crabs & satellites

Each female crab had an attached male (in her nest) & other males (satellites) residing nearby.

• Data recorded– No satellites (response)– Color (light medium, medium, dark medium, dark)– Spine condition

(both good, one worn/broken, both worn/broken)– Carapace width (cm)– Weight (kg)

• Poisson Models:– Log link: log(μ) = a + bx– Identity link: μ = a + bx

Statistics

in

ScienceStatistics

in

Science

Effect of width and colour

Statistics

in

ScienceStatistics

in

Science

Grouping weight & number values

Statistics

in

ScienceStatistics

in

Science

Variation in no. satellites

Statistics

in

ScienceStatistics

in

Science

Practical exercise

SAS/GENMOD for Poisson Regression

Lab Session 5 Exercise 5.3 (a) – (e)

Documents

Categorical Data Analysis PGRM 14