52
Chapter 7 Analyzing categorical data 1

Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

Embed Size (px)

Citation preview

Page 1: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Chapter 7

Analyzing categorical data

1

Page 2: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

What is a categorical variable?

Examples:

• Gender (“Male”,“Female”)

• Sick or well

• Success or failure

• Age group (“Below 20”, “20 to below 40”, “40 to below 60”,“60 and above”)

2

Page 3: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Common techniques used to analyze categorical data

• Frequency tables

• Contingency tables

• Charts

• Test of proportion

• Chi-square test

3

Page 4: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Questionnaire design and analysis

• It is the most common way to collect certain types of data

• The data collected can be manually entered into the computerif they are not collected via computer or online.

4

Page 5: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS: proc freq

data ex7 1;input @1 id $3. @4 age 2.0 @6 gender $1.@7 race $1.

@8 marital $1. @9 education $1. @10 subsi 1.0;* Adding labels to the variables;label marital =“Marital Status”education=“Education Level”Subsi=“Baby Subsidy”;

datalines;001291111300245222220033513244004271111200568213230066512432;

5

Page 6: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS: proc freq

proc freq data=ex7 1;title “Frequency Counts for Categorical Variables”;tables gender race marital education subsi;/∗ Alternatively, we can use the following command;tables gender-subsi;∗/run;

6

Page 7: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS output: proc freq

7

Page 8: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS output: proc freq

8

Page 9: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS: Adding “Value Labels” (Format)

proc format;value $sexfmt “1”=“Male”

“2”=“Female”Others=“Miscoded”;

value $race “1”=“Chinese”“2”=“Malay”“3”=“Indian”“4”=“Others”;

value $mari “1”=“Single”“2”=“Married”“3”=“Widowed”“4”=“Divorced”;

9

Page 10: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS: Adding “Value Labels” (Format)

value $educ “1”=“O-level or Less”“2”=“A-Level or Poly”“3”=“Bachelor degree”“4”=“Postgraduate degree”;

value agree 1=“Strongly Disagree”2=“Disagree”3=“No Opinion”4=“Agree”5=“Strongly Agree”;

10

Page 11: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS: Adding “Value Labels” (Format)

data ex7 1label;input @1 id $3. @4 age 2.0 @6 gender $1.@7 race $1.@8 marital $1. @9 education $1.@10 subsi 1.0;label marital =“Marital Status”

education=“Education Level”Subsi=“Baby Subsidy”;

format gender $sexfmt.race $race.marital$mari.education $educ.subsi agree.;

11

Page 12: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS: Adding “Value Labels” (Format)

datalines;001291111300245222220033513244004271111200568213230066512432;proc freq data=ex7 1label;title “Frequency Counts for Categorical Variables”;tables gender race marital education subsi;run;

12

Page 13: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS output: proc freq

13

Page 14: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS output: proc freq

14

Page 15: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS: Using a format to recode a variable

proc format;value agegp low-20=“0-20”

21-40=“21-40”41-60=“41-60”60-high=“Greater than 60”.=“Did not Answer”other=“Out of Range”;

proc freq data=ex7 1label;title “Using a Fromat to Group a Numeric Varible”;tables age;format age agegp.;run;

15

Page 16: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SAS output: Using a format to recode a variable

16

Page 17: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

R: Adding value labels

>ex7.1=read.fwf(“D:/ST2137/ex7 1.txt”,header=F,width=c(3,2,1,1,1,1,1))>names(ex7.1)=c(“id”,“age”,“gender”,“race”,“marital”,“education”,“subsi”)>attach(ex7.1)>gendername=c(“Male”,“Female”)>gendergp=gendername[gender]>gender[1]1 2 1 1 2 1>gendergp[1] “Male” “Female” “Male” “Male” “Female” “Male”

17

Page 18: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

R: Recode a variable

>agegpname=c(“low-20”,“21-40”,“41-60”,“61-80”,‘over 80”)>agegp=agegpname[ceiling(age/20)]>age[1] 29 45 35 27 68 65>agegp[1] “21-40” “41-60” “61-80” “61-80”

18

Page 19: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

R: Table

>gendername=c(“Male”,“Female”)>gendergp=gendername[gender]>table(gendergp)gendergpFemale Male2 4

19

Page 20: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

R: Table

>agegpname=c(“low-20”,“21-40”,“41-60”,“61-80”,“over 80”)>agegp=agegpname[ceiling(age/20)]>table(agegp)agegp21-40 41-60 61-803 1 2

20

Page 21: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

R: Table

>racegpname=c(“Chinese”,“Malay”,“Indian”,“Others”)>racegp=racegpname[race]>table(racegp)racegpChinese Indian Malay3 1 2

21

Page 22: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

R: Table

>marigpname=c(“Single”,“Married”,“Widowed”,“Divorced”)>marigp=marigpname[marital]>table(marigp)marigpDivorced Married Single Widowed1 2 2 1

22

Page 23: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

R: Table

>educgpname=c(“(1)High Sch or Less”,“(2)A-Level or Poly”,+“(3)Bachelor degree”,“(4)Postgraduate degree”)>educgp=educgpname[education]>table(educgp)educgp(1)High Sch or Less (2)A-Level or Poly(3)Bachelor degree (4)Postgraduate degree

2 21 1

23

Page 24: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

R: Table

>likegpname=c(“(1)Strongly Disagree”,“(2)Disagree”,+“(3)No Opinion”,“(4)Agree”,“(5)Strongly Agree”)>subsigp=likegpname[subsi]>table(subsigp)subsigp(2)Disagree (3)No Opinion (4)Agree

3 2 1

24

Page 25: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SPSS: Frequency tables

• Suppose the data set on slide 5 has been imported into theSPSS.

• “Analyze”→ “Descriptive Statistics” →“Frequency...”

• Move the variables to the “Variables” panel → “OK”

25

Page 26: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SPSS output: Frequency tables

26

Page 27: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

SPSS output: Frequency tables

27

Page 28: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables

Count the occurrences of one variable at each level of anothervariable.For example:We would like to know1. How many males and females were there in the sample?2. How many respondents were for Candidate A and how manywere for Candidate B?3. How many males and females were for Candidate A and B,respectively?

28

Page 29: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: SAS

proc format;value $genfmt “M”=“Male”

“F”=”Female”Other=“Miscoded”;

value $candfmt “A”=“Candidate A”“B”=”Candidate B”;

29

Page 30: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: SAS

data ex7 2;infile“D:\ST2137\ex7 2.txt”;input gender $ candid $;label gender=“Gender”candid=“Candidate”;format gender $genfmt.

candid $candfmt.;run;proc freq data=ex7 2;tables gender*candid/chisq;run;

30

Page 31: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: SAS output

31

Page 32: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: SAS output

32

Page 33: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Computing Chi-square from frequency counts: SAS

/*Computing Chi-square from frequency counts*/data ex7 2c;input group $ outcome $ count;datalines;drug alive 90drug dead 10placebo alive 80placebo dead 20;proc freq data=ex7 2c;tables group*outcome/chisq;weight count;run;

33

Page 34: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: SAS output

34

Page 35: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: SAS output

35

Page 36: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: R

>ex7.2=read.table(“D:/ST2137/ex7 2.txt”,header=F)>names(ex7.2)=c(“gender”,“candid”)>table(ex7.2)

candidgender A BF 70 30M 40 40

36

Page 37: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: R

>chisq.test(table(ex7.2))Pearson’s Chi-squared test Yate’s continuity correction

data:table(ex7.2)X-squared=6.6626,df=1,p-value=0.009846Computing chi-square from the frequency counts: R>v=matrix(c(90,10,80,20),nc=2)>v=data.frame(v)>names(v)=c(“Alive”,“Dead”)>row.names(v)=c(“Drug”,“Control”)>chisq.test(v)

Pearson’s Chi-squared test with Yate’s continuity correctiondata:vX-squared=3.1765, df=1, p-value=0.0747

37

Page 38: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: SPSS

• “Analyze”→ “Descriptive Statistics” →“Cross Tables...”

• Move one of the the variables to the “Row” window and secondvariable to “Column(s)” window.

38

Page 39: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Two-way frequency tables: SPSS

• Click on “Statistics”

• Choose “Chi-square” or some other statistics →“Continue”→“OK”

39

Page 40: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Computing Chi-square from frequency tables: SPSS

• Data file as shown below

• “Data”→‘Weight Cases”

• Move the variable “Count” to the “Frequency Variable” panelunder “Weight cases by option”

• Proceed as on p38-39.

40

Page 41: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Computing Chi-square from frequency tables: SPSS

41

Page 42: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

Paired Data

• Paired data arise when the subjects are responding to aquestion under two different conditions (e.g. before and aftertreatment).

• Paired designs are also used when a specific person is matchedon some criteria, such as age and gender, to another person forthe purpose of analysis.

42

Page 43: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s test for paired data: SAS

proc format;value $opin “p”=“Positive” “n”=“Negative”;run;data ex7 3;length before after $1.;infile “D:\ST2137\ex7 3.txt”;input subject before $ after $;format before after $opin.;proc freq data=ex7 3;title “McNemar’s Test for Paired Samples”;tables before *after/agree;run;

43

Page 44: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s test for paired data: SAS output

44

Page 45: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s test for paired data: SAS output

45

Page 46: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s test for frequency counts: SAS

proc format;value $opin “p”=“Positive” “n”=“Negative”;run;data ex7 3c;length before after $1.;input after $ before $ count;format before after $opin.;datalines;n n 32n p 30p n 15p p 23;

46

Page 47: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s test for frequency counts: SAS

proc freq data=ex7 3;title “McNemar’s Test for Paired Samples”;tables before *after/agree;weight count;run;

47

Page 48: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s test: R

#Example 7.3>ex7.3=read.table(“D:/ST2137/ex7 3.txt”,header=F)>names(ex7.3)=c(“ID”,“Before”,“After”)>attach(ex7.3)>mcnemar.test(table(ex7.3[,2:3]))

McNemar’s Chi-square test with continuity correctiondata:table(ex7.3[,2:3])McNemar’s chi-squared=4.3556,df=1,p-value=0.03689

48

Page 49: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s test for Frequency Counts: R

#Example 7.3c: Handling frequency counts>ex7.3c=matrix(c(32,15,30,23),nr=2,byrow=T,+dimnames=list(“Before”=c(“No”,“Yes”),“After”=c(“No”,“Yes”)))>ex7.3c

AfterBefore No YesNo 32 15Yes 30 23>mcnemar.test(ex7.3c)

McNemar’s Chi-squared test with continuity correctiondata:ex7.3cMcNemar’s Chi-squared=4.3556,df=1,p-value=0.03689

49

Page 50: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s Test: SPSS

• “Analyze”→ “Descriptive Statistics” →“Crosstabs...”

• Move “Before” to the “Row” window and “After” to“Column(s)” window.

• Click on “Statistics...” and choose “McNemar”

• “Continue”→“OK”

50

Page 51: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s Test: SPSS

51

Page 52: Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a Fromat to Group a Numeric Varible"; tables age; format age agegp.; run; 15 ... SPSS:

'

&

$

%

McNemar’s Test: SPSS

If frequency counts are available instead of the raw data, then wecan weight the data in the following way.“Data”→“Weight Cases..”

52