ch05.catassoc

7/15/2019 ch05.catassoc

http://slidepdf.com/reader/full/ch05catassoc 1/24

5 Association between

Categorical VariablesCONTINGENCY TABLES 5-3 MARGINAL AND CONDITIONAL DISTRIBUTIONS 5-4 SEGMENTED BAR CHARTS AND MOSAIC PLOTS 5-5

LURKING VARIABLES AND SIMPSON’S PARADOX 5-11 CHI-SQUARE: A MEASURE OF ASSOCIATION 5-14 CRAMER ’S V: I NTERPRETING THE STRENGTH OF ASSOCIATION 5-17 CHECKLIST: CHI-SQUARE AND CRAMER ’S V 5-18 SUMMARY 5-21



2/7/2008 5 Categorical Association

5-2

Amazon spends millions of dollars on advertising.That sounds like a lot until you think about the sizeof the Internet. Busy portals like Google and Yahoocharge plenty for the privilege of advertising on theirpages. Which locations deliver buyers?

The answer comes from understanding the variationin the categorical variable that indicates whether avisitor makes a purchase. Some buy, others onlybrowse. Suppose that everyone from Yahoo is abuyer, but no one from Deal Time. Knowing the linkthat attracted the shopper explains variation inbehavior and reveals the better location for ads.

Let’s focus on the choices faced by an advertisingmanager at Amazon. She has a budget to allocateamong three busy hosts: msn.com, recipesource.com,

and yahoo.com. Together, these three delivered 17,619 visits to Amazonduring the fall of 2002.

Host Visits

msn.com 7,258recipesource.com 4,283yahoo.com 6,078

Total shoppers 17,619

Table 5-1. Frequency table of the categorical variable that identifies shoppers from three hosts.

MSN generates the most visits, but more visits do not automaticallymean more sales. More visits do translate into higher costs, however.Amazon pays a fee for every visit, whether the shopper buys anythingor not. Hosts that generate many visits but few sales are costly toAmazon.

Should Amazon pay some hosts more than others for each shopper sentto amazon.com?




5-3

Contingency Tables

To discover whether some hosts are better than others, we have toconsider a second categorical variable, one that identifies the visits thatresult in a sale. Here are the bar charts.

Host Purchase

Figure 5-1. Bar charts of the hosts and purchase actions. The categorical variable Host identifies the originating site, thecategorical variable summarized in Table 5-1. Purchase indicates whetherthe session produced a sale. There’s precious little variation in Purchase.Only 516 visits, less than 3% result in a purchase. If every sale comesfrom one host, Amazon would know where to place its ads.

The bar charts in Figure 5-1 summarize each categorical variableseparately, but we need to consider them simultaneously. For instance,we need to separate visitors from MSN into those who made a purchaseand those who did not.

The most common arrangement of such counts organizes them in a

table. The rows of the table identify the levels of the one variable, andthe columns of the table identify the levels of the other. Such a table iscalled a contingency table. This contingency table shows the variablePurchase (along the rows) and the variable Host (along the columns).

Host

msn.comrecipesource.

comyahoo.

com Total

No 6973 4282 5848 17103Purchase

Yes 285 1 230 516

Total 7258 4283 6078 17619

Table 5-2. Contingency table of web shopping.

The cells of this contingency table count the visits for every combinationof Host and Purchase. The cells of the contingency table are mutuallyexclusive; each case appears in exactly one cell. For example, thecolumn labeled msn.com shows that 285 of the 7,258 visits from MSNgenerated a purchase. Of those from recipesource.com, only 1 of the4,283 visits led to a purchase.

contingency table

table that shows counts of

e cases of one categorical

riable contingent on the value

another.

mutually exclusive

e conditions that define

e cells in a contingency

ble allow a case to appear

only one cell. There’s no

uble-counting of cases.




5-4

Marginal and Conditional DistributionsThe margins of Table 5-2 (shown in gray) give the total counts in eachrow and column. Because the cells of the table are mutually exclusive,the sum of the counts in the cells of the first column equals the totalnumber of visits from msn.com. The sum for each column appears in the

bottom margin of the contingency table; these sums match the frequencydistribution of Host shown in Table 5-1. The right margin shows thefrequency table of Purchase. Because these counts are typically placedalong the margins of a contingency table, the frequency distributions ofthe variables in the table are also called marginal distributions. The barcharts in Figure 5-1 show these distributions.

Percentages help us interpret a contingency table, but we’ve got to makea choice of which percentage to show. For example, 285 shoppers fromMSN made a purchase. To show this count as a percentage, we havethree choices:

285 is

1.62% of all 17,619 visits

3.93% of the 7,258 visits from MSN

55.23% of the 516 visits that made a purchase

" # $ % $

All are potentially interesting. Some statistics packages embellish acontingency table with every percentage, like this:

Host

CountTotal %Col %Row %

msn.com

Recipesource.com

Yahoo.com Total

No

6973

39.58

96.07

40.77

4282

24.30

99.98

25.04

5848

33.19

96.22

34.19

17103

97.07

Purchase

Yes

285

1.62

3.93

55.23

1

0.01

0.02

0.19

230

1.31

3.78

44.57

516

2.93

7258

41.19

4283

24.31

6078

34.50

17619

Table 5-3. Too many percentages clutter this contingency table.

Tables like this one give percentages a bad reputation. The table showstoo many percentages. Each cell lists the count along with percentages ofthe total, the column, and the row. While it’s fine to consider all of these,it’s better to choose the percentage that answers the relevant question.

Because the account manager at Amazon is interested in which hostproduces the highest proportion of purchasers, a better table shows onlythe counts and column percentages. Let’s start with msn.com. For themoment, we’re interested in only the 7,258 visits from msn.com in the 1st column of Table 5-2. The distribution of a variable that is restricted tocases satisfying a condition is called a conditional distribution. In atable, a conditional distribution refers to counts within a row or column.

msn.com6973

96.07%

285

3.93%

7258

marginal distributione frequency distribution of

variable in a contingency

ble given by counts of the

tal number of cases in rows

r columns .




5-5

By limiting our attention to visits from msn.com, we see the conditionaldistribution of Purchase conditional on the host being MSN.

The following contingency table shows the counts and the columnpercentages. The percentages within each column show the conditionaldistribution of Purchase for each host.

Host

CountCol %

msn.com

recipesource.com

yahoo.com Total

No 6973

96.07%

4282

99.98%

5848

96.22%

17103

97.07%Purchase Yes 285

3.93%

1

0.02%

230

3.78%

516

2.93%

7258 4283 6078 17619

Table 5-4. Contingency table with relevant percentages.

Compare this table to Table 5-3. Without the distraction of extraneouspercentages, we can quickly see that visitors from MSN and Yahoo yield

similar shares of purchases (3.93% and 3.78%, respectively). Incomparison, only one visitor from RecipeSource bought anything(0.02%).

We’ve just discovered that Host and Purchase are associated. Categoricalvariables are associated if the column percentages vary from column tocolumn (or if row percentages vary from row to row). In this case, theproportion of visits that produce a purchase differs among hosts. Theassociation between Host and Purchase means that knowing the hostchanges your impression of the chance of a purchase.

Variables can be associated to different degrees. The least association

occurs when the column percentages are identical. Overall, 516/17619 =2.93% of the visits made a purchase. If 2.93% of visits from every hostmade a purchase, then the chance of a purchase would not depend onthe host. Each conditional distribution of Purchase given Host wouldmatch the marginal distribution of Purchase.

That’s not what happens: Host and Purchase are associated. Visitors fromsome hosts are more likely to make a purchase. Because Host andPurchase are associated, the account manager at Amazon might bewilling to pay more for visits from MSN or Yahoo and less for thosefrom Recipe Source. The value of the visit depends on the host.

Segmented Bar Charts and Mosaic PlotsBar charts of marginal distributions like those in Figure 5-1 don’t revealassociation, but other charts do. For example, Amazon locateswarehouses near large concentrations of shoppers to reduce shippingcosts. Being close makes it cheaper to offer free shipping. Thiscontingency table shows the counts of Purchase by Location over a widerrange of hosts.

conditional distribution

The distribution of a

variable restricted to cases

that satisfy a condition,

such as those in a row of

column of a contingency

table.

association

wo categorical variables

e associated if the

nditional distribution ofe variable depends on the

ue of the other.




5-6

Location

NorthCentral North East South West Total

No 5640

24.46

4450

19.30

8321

36.09

4645

20.15

23056

Purchase

Yes 161

24.66

146

22.36

177

27.11

169

25.88

653

Total 5801 4596 8498 4814 23709

Table 5-5. Contingency table of purchases organized by region.

Because we’re interested in discovering where those who make apurchase live, this table shows row percentages. With four percentagesin each conditional distribution, it becomes helpful to have a plot. Asegmented bar chart divides bars in a bar chart proportionally intosegments corresponding to the percentage in each group. If the bars lookidentical, then the variables are not associated.

Although the South sends the largest number of visitors, visitors fromthe South are more likely to browse rather than buy. About 36% of thebrowsers come from the South, but only 27% of the buyers. If Purchase

and Region were not associated, then these percentages should be aboutthe same. Because they differ, Region is associated with making apurchase. You can see the differences in this segmented bar chart. Theyellow segment identifying visits from the South makes up a largershare among those who don’t purchase (on the top) than among thosewho do make a purchase.

Figure 5-2. Segmented bar chart.

Be careful interpreting a segmented bar chart. This chart comparesrelative frequencies of two conditional distributions. Because these arerelative frequencies rather than counts, the bars do not represent thesame number of cases. The bar on the top summarizes 23,056 cases

whereas the bar on the bottom summarizes 653 purchases. The chartobeys the area principle, but the area is proportional to the percentages

within each row of the table.

Segmented bar charts frequentlyappear in news items such as thisone from the New York Times.1

1 “Economists debate the quickest cure,” The New York Times, January 19, 2008.

segmented bar chart

A bar chart that divides

the bars into shares

based on a second

categorical variable.




5-7

At the time, the government was debating the use of tax cuts to stimulateconsumer spending and avoid a recession. The chart shows that theconditional distribution within the bars is changing, meaning there’sassociation. In this case, we can see that households with smaller incomesare more likely to use tax rebates to pay down debt than households withhigher incomes.

A mosaic plot is an alternative to the segmented bar chart. A mosaic plotshows tiles, colored rectangular regions, that represent the counts in eachcell of a contingency table. The layout of the tiles matches the layout of thecells in a contingency table, and the sizes of the tiles are proportional to thecounts in each cell. The tiles within a column have the same width, butpossibly different heights. The widths of the columns are proportional tothe marginal distribution of the variable positioned on the bottom of thetable. For example, this figure shows the mosaic plot of the data in Table5-5. The tiny height of the red tiles in the second row show the counts ofpurchases; their tiny sizes remind you how rare it is to find a purchase

among the visits.

Figure 5-3. Mosaic p lot of the purchases by region.

Overall, the South contributes the most. These are the widest tiles in theplot. Because purchases are so rare, however, it’s hard to see in the mosaicplot that the share of purchases is smaller for visitors from the South.

Mosaic plots are much more useful for seeing dependence in data forwhich the relative frequencies do not get so small. As an example, thefollowing table shows counts of sales of shirts at a men’s clothing retailer.Do Size and Style appear associated? If the two are not associated,managers should order the same proportion of sizes in every style. If thetwo are associated, the distribution of sizes varies from style to style.

mosaic plot

A tiled plot in which the

size of each tile is

proportional to the count in

a cell of a contingency table.




5-8

Style

ButtonDown

Polo Small Print

Small 18 27 36 81 Medium 65 82 28 175 Size

Large 103 65 22 190 186 174 86 446

Table 5-6. Sales of shirts at a men’s clothing retailer.

It’s hard to see the association quickly in this table of counts, but amosaic plot makes the association very clear.

Figure 5-4. Mosaic p lot of the shirt sales shows association between Size and Style.

The tiles would line up in the absence of association. In this example, theproportions of sizes vary across the styles, causing the tiles to vary inheight. The irregular heights indicate that these variables are associated.

Small sizes are much more prevalent among beach prints than thebutton-down shirts. Because the mosaic plot respects the area principle,we can also see that the button-down style is the biggest seller overall(these tiles are wider than the others) and the beach-print style is thesmallest seller.

4M: Where’s my car?

2002 Dodge Intrepid 2002 Toyota Camry

Auto theft costs owners and insurance companies billions of dollars. TheFBI estimates that 1.2 million cars worth $8.4 billion were stolen in 2002.

Should insurance companies charge the same premium for theftinsurance or should they vary the premium? Obviously, a policy thatinsures a $90,000 Porsche costs more than one for a $15,000 Hyundai.But should the premium be a fixed percentage of the car’s value, or




5-9

should the percentage vary from model to model? To answer thisquestion, we need to know whether some cars are more likely to bestolen than others. It comes down to whether there is an associationbetween car theft and car model.

For this example, it’s up to you to decide whether an insurance company

should charge a fixed percentage of the price to insure against theft. Thedata come from the National Highway Traffic Safety Administration(NHTSA). We picked seven popular models.

Motivation What questions would you

like to answer?My company deciding whether to base premiums

for theft insurance on the chance that the car is

stolen. We can either charge a fixed percentage of

the replacement cost, or charge a variable

percentage for cars that are stolen more often.

Are there large differences in the rates of theft?

Method Identify the variables and report

the W’s. Be certain that the data are

counts and that the categories do notoverlap so that no individual is countedtwice. Also indicate what you intend to dowith these data.

My data from NHTSA give the number of cars

stolen for seven 2002 models. If thefts are

associated with the model, then we should vary therate.

I’ll judge the association by seeing whether the

percentage of cars stolen varies by model.

Mechanics Make an appropriate displayor table to see whether there is adifference in the relative proportions.

Notice that we did not add the missingcolumn for “not stolen”. If you do that,you’ll see that the number made is themarginal (row) total. For example,

Model Stolen Not stolen Total

Intrepid 1657 109834 111491

This table shows the data along with the

percentage of each model that is stolen.

Model Stolen Made Pct Stolen

Chevrolet Cavalier 1017 259230 0.392

Dodge Intrepid 1657 111491 1.486

Dodge Neon 959 119253 0.804

Ford Explorer 1419 610268 0.233Ford Taurus 842 321556 0.262

Honda Accord 702 419398 0.167

Toyota Camry 1027 472030 0.218

Among these models, the Dodge Intrepid has the

highest percentage stolen (1.486%), followed by

the Dodge Neon (0.804%). The Honda Accord has

the least (0.167%).

Method

Mechanics

Motivation




5-10

Message Discuss the patterns in the

table and displays.

A segmented bar chart is less useful

because the percentages stolen are sosmall.

Some models (e.g., Dodge Intrepid) are more likely

to be stolen than others (e.g., Honda Accord).

About 1.5% of 2002 Intrepids were stolen,

compared to less than 0.17% of 2002 Accords.

A lot of Accords get stolen, but that’s explained

by the sheer number of Accords sold each year.

If you can, discuss possible real-worldconsequences.

We should charge higher premiums for theft

insurance for models that are most likely to be

stolen. A Dodge Intrepid is 7 times as likely to be

stolen as a Toyota Camry.

Customers who buy an Intrepid (which costs about

the same as a Camry) should pay a higher premium

for theft insurance.

Are You There?An on-line questionnaire asked visitors to a retail web site if they wouldlike to join a mailing list. This contingency table summarizes the countsof those who join as well as those who made a purchase.

Mailing List Join Decline

Yes 52 12Purchase

No 343 3720

The columns indicate whether the visitor signed up ( Mailing list = Join orDecline), and the rows indicate whether the visitor made a purchase(Purchase = Yes or No). For example, 52 visitors joined the mailing listand made a purchase.

(a) Find the marginal distribution of Purchase.2

(b) Find the conditional distribution of Purchase given whether thecustomer signed up or not. Do the conditional distributions differ?3

(c) Does a segmented bar chart provide a helpful plot for these data?4

(d) Is the variable Purchase associated with the variable Mailing List?5

2 The row totals determine the marginal distribution, 64 “Yes” who made a purchase and 4,063 “No”.3 Among the 395 who join the list, 52 make a purchase (13%). Among those who decline, 12 out of 3,732 make a

purchase (0.32%). Customers who join are more likely to make a purchase . These are rather different.4 Not really, because one percentage is so small. You could show a figure like that in the prior 4M.5 The two are associated (dependent) because the conditional distributions differ. The chance that a customer makes a

purchase depends on whether they sign up.

Message




5-11

Lurking Variables and Simpson’s ParadoxAssociation gets confused with causation. This mistake can lead toserious errors in judgment. Consider the following contingency table.

ServiceCountColumn % Orange

Arrow

Brown

Box TotalDamaged

4515%

6633%

11122.2%Status

OK 255

85%

134

67%

389

87.8%

300 200 500

Table 5-7. Counts of damaged packages.

This contingency table shows the number of cartons that were damagedwhen shipped by two delivery services. The percentages in each cell arecolumn percentages. Overall, 22.2% of the 500 cartons arrived withvisible damage. Conditionally, 15% of cartons shipped via OrangeArrow arrived damaged compared to 33% for Brown Box. There’sdefinitely association; neither conditional distribution matches themarginal distribution of Status.

Table 5-7 suggests that Orange Arrow is the better shipper, and wemight be tempted to believe that cartons are more likely to arriveundamaged because they are shipped on Orange Arrow. If we believethat, we might decide to ship everything on Orange Arrow. Before wedo that, however, we better make sure that this table offers a faircomparison. Maybe there’s another explanation for why packagesshipped on Brown Box are damaged more often.

To think of an alternative explanation, we have to know more aboutthese packages. In this instance, the cartons hold car parts. Some cartonshold engine parts whereas others hold plastic molding. Guess whichcartons are heavier and more prone to damage? The next two tablesseparate the counts in Table 5-7 into those for heavy cartons (left table)and those for light cartons (right table).

Heavy Light

Service ServiceCountColumn % Orange

ArrowBrown

Box TotalOrange

ArrowBrown

Box Total

Damaged 20

67%60

40%80

44.4%25

9%6

12%31

9.7%Status

OK 1033%

9060%

10055.6%

24591%

4488%

38990.3%

30 150 180 270 50 320

Table 5-8. Separate tables for heavy and light packages.

Orange Arrow is no longer clearly the better shipper. Among heavypackages, 67% of those shipped on Orange Arrow arrived damagedcompared to 40% for Brown Box. For light cartons, 9% of those shippedby Orange Arrow arrive damaged compared to 12% for Brown Box.

lurking variable

A concealed variable that

affects the apparent

relationship between two

other variables.




5-12

The initial comparison favors Orange Arrow because it handles a highershare of light packages. Brown Box seems more likely to damagepackages because it handles a greater proportion of heavy cartons.Heavy cartons more often arrive with some damage (44.4% versus9.7%). Table 5-7 presents a misleading comparison; it compares howwell Orange Arrow ships light cartons to how well Brown Box ships

heavy cartons. The weight of the cartons is a hidden, lurking variable.Table 5-8 adjusts for the lurking variable by separating the data intoheavy cartons and light cartons.

Such reversals often go by the name Simpson’s paradox. It can seemsurprising – downright paradoxical – that one service looks better overall,but the other looks better when we restrict the comparison. Theexplanation lies in recognizing the presence of a lurking variable. Beforeyou act on association (like sending all the business to Orange Arrow), besure to identify the effects of lurking variables.

One of the best-known examples of Simpson’s paradox occurred when U.C. Berkeley was sued for bias against women applying to graduate schoolWhen data from all of the graduate programs at Berkeley were pooled, theadmission rate for men was much higher than that for women. However,it was discovered that the rate was very similar in every department. Infact, most departments had a small bias in favor of women. The explanationfor the apparent overall bias was that women tended to apply todepartments that had many applicants and therefore low rates ofadmission. Men, on the other hand, tended to apply to departments suchas mathematics that had fewer applicants and higher admission rates. (P. J. Bickel, E. A. Hammel, and J. W. O'Connell (1975). Sex Bias in Graduate

Admissions: Data from Berkeley. Science 187:4175, pp. 398 – 404.)

4M Picking an Airline

Which airline is more likely to get you to a meeting on time, US Airways

or Delta? The following table summarizes 10,906 arrivals at four airportsserved by both carriers: Boston, Orlando, Philadelphia, and San Diego.

AirlineCountColumn % Delta US

Airways Total

On Time 2596

80%

5966

78%

8562

Arrival

Delayed 659

20%168522%

9667

Total 3255 7651 10906

Simpson’s paradox occurs

when the direction of an

association between two

variables is reversed when a

third variable is controlled.

The effect was named after

E. H. Simpson in who

compiled entertaining

examples in 1951, but it wascarefully described much

earlier by G. U. Yule (1871-

1951 in 1903.




5-13

Table 5-9. Airline arrivals.

This table suggests that the two airlines perform comparably, with aslight edge to Delta with 80% on time compared to 78% for US Airways.Before you book a flight on Delta, however, you should think aboutwhether there’s a lurking variable.

Motivation List the questions thatyou would like to answer, and statethe implications.

My business regularly takes me to these four destinations.Does it matter which airline I choose, Delta or US Airways?

A late arrival might cause me to miss a meeting with a client.

Method Identify the variables and

your data. Describe your plan for the analysis.

Both variables are categorical: the airline and the arrival

status. Data are from the Bureau of Transportation

Statistics.

Contingency tables answer my questions. I’d better think

about lurking variables. A possible lurking variable behind

Table 5-1 is the destination of the flight. This table

combines the status for all four destinations.

Mechanics I’ll form a contingency table that isolates flights into one

destination: Orlando. Here’s the table.

CountCol % Delta

USAirways Total

Delayed 228

19.5%150

15.5%378

OnTime

94080.5%

82084.5%

1760

1168 970 1468

For flights to Orlando, US Airways is the better choice. In

fact, no matter which destination, US Airways has a higher

percentage of on-time arrivals.

On Time% Delta

US

Airways

Boston 80.1% 81.7%

Orlando 80.5% 84.5%

Philadelphia 70.5% 74.3%

San Diego 84.2% 85.4%

Message Discuss the patterns inthe table and displays.

I’ll book a flight on US Airways. No matter which destination,

US Airways is more likely to arrive on time.

It’s worthwhile to review why Delta appears better overall, even thoughUS Airways arrives on time more often for each destination. The initialtable, Table 5-9, masks a lurking variable: destination. The destinationmatters: delays are more common at Philadelphia.

DestinationCountCol % Boston Orlando Philadelphia San Diego Total

Delayed 61519%

37818%

123026%

12115%

2344

Arrival

On Time 262081%

176082%

350574%

67785%

8562

Total 3235 2138 4735 798 10906

Table 5-10. Delayed arrivals by destinations.

Method

Mechanics

Message

Motivation




5-14

In addition, most of these flights on US Airways go to Philadelphia,whereas most on Delta go to Boston.

DestinationCountRow % Boston Orlando Philadelphia San Diego Total

Delta 140943%

116836%

31210%

36611%

3255

Airline

US Airways 1826

24%

970

13%

4423

58%

432

6%

7651

Total 3235 2138 4735 798 10906

Table 5-11. Airlines by destinations.

The initial table (Table 5-9) answers a strange question: “Am I morelikely to arrive on time flying to Boston on Delta or arrive on time inPhiladelphia on US Airways?” The answer: take Delta to Boston. There’snothing wrong with that answer – it’s just an odd question. By focusingthe analysis on flights into a specific destination, we control for thislurking variable and answer the right question.

Once you identify a lurking factor, you can remove its effects as we didin this example. But here’s the hard part: How do you know whetherthere is lurking factor? It’s easy to imagine other lurking factors, too.Maybe it’s the type of airplane, the day of the week, or the time of day.Make no mistake about it. You need to understand the context of yourdata to find a lurking factor.

Chi-Square: A Measure of AssociationIn the first example, we concluded that Purchase and Host are associatedbecause the proportion of visitors who make purchases differs from hostto host. How different are they?

Rather than leave it to subjective judgment, it’s useful to have a statisticthat quantifies the amount of association. Instead of saying “There’ssome association” or “There’s a lot of association”, the statistic calledchi-square (pronounce “chi” as ki ) measures of the degree ofassociation. The larger chi-square becomes, the larger the amount ofassociation. This statistic also offers a preview of an approach frequentlytaken in statistics. To quantify the degree of association, we compare thedata we observe to artificial data that have none. Chi-square measuresassociation in a contingency table by comparing the observedcontingency table to an artificial table that has no association. If the

tables are similar, then there’s not much association. The larger thedifference between the tables, the larger the association.

We’ll illustrate the use and calculation of chi-square with an example. Arecent poll asked 200 people at a university about their attitudes towardsharing copyrighted music. Half of the respondents were students andthe other half were staff at the university (administrators or faculty).This table summarizes the counts.

Attitude to Sharing

chi-square

A statistic that measures

association in a contingency

table; larger values of chi-

square indicate more

association.




5-15

Attitude to Sharing

Staff 30 70 100 Group

Student 50 50 100

Total 80 120 200

Table 5-12. Attitudes toward sharing copyright materials.

Overall, 40% (80 of 200) of those questioned thought it was OK to sharecopyrighted music. That’s the marginal percentage. Each rowdetermines a conditional distribution of the attitude, one for staff andone for students. Only 30% of the staff thought it was OK to share,compared to 50% of students. Because the row percentages differ, Group and Attitude are associated.

To quantify the amount of association, we need a benchmark forcomparison, a point of reference. For that, consider what Table 5-12would look like if there were no association. To figure this out, pretendthat we know the marginal totals, but not the counts within the table:

Attitude to SharingOK Not OK Totals Staff ? ? 100 Group

Student ? ? 100

Totals 80 120 200

Table 5-13. What goes in these cells if the variables are not associated?

Overall, ½ of the respondents are staff and ½ are students. Were Group and Attitude not associated, then ½ of the cases in each column would bestaff and ½ would be students. We would expect the table to look likethis:

Attitude to Sharing

OK to Share Not OK Totals Staff 40 60 100 Group

Students 40 60 100

Totals 80 120 200

Table 5-14. Artificial table with cells that we would expect were Group and Attitude not associated.

Chi-square measures the distance between the cells in the real table andthose in the artificial table. We first subtract the values in the cells; useonly the cells, not the margins. The differences in the counts are:

Real Data

30 70

50 50

– Artificial

40 60

40 60

= Difference

-10 10

10 -10

Table 5-15. Deviations from the original counts.

Next, we combine the differences. If we add them, we get zero becausethe negative and positive values cancel. We had this problem withcancellation when we defined the variance s2 in Chapter 4. We’ll solvethe problem as we did then: square the differences before we add them.




5-16

When the squared deviations are added, chi-square assigns some ofthem larger weight. Look at the differences in the first row. Both are 10,but the difference in the first column is larger relative to what weexpected than the difference in the second column (10 out of 40compared to 10 out of 60). Rather than treat these the same, chi-squareassigns more weight to the first. After all, saying 40 and finding 30 is a

larger proportional error than saying 60 and finding 70. To give moreweight to larger proportional deviations, we divide the squareddeviations by the expected values in the artificial table.

The chi-square statistic is the sum of these weighted, squareddifferences. For this table, chi-square, denoted in formulas as χ

2, is

" 2=

30 # 40( )2

40+

70 # 60( )2

60+

50 # 40( )2

40+

50 # 60( )2

60

=

#10( )2

40+

10( )2

60+

10( )2

40+

#10( )2

60

= 2.5 +1.67 + 2.5 +1.67

= 8.33

Chi-square has another similarity to s2: it’s hard to interpret. The valueof chi-square depends on n, the total number of cases, and the size of thetable. The larger the table, the larger chi-square becomes.

Are You There?Here’s the contingency table from the prior AYT, including the marginaltotals.

Mailing List

Join Decline Total Yes 52 12 64

PurchaseNo 343 3720 4063

Total 395 3732 4127

(a) Chi-square requires the artificial table of counts. What count wouldbe expected in the highlighted cell for those who join the mailing list andmake a purchase if Purchase and Mailing List are not associated?6

(b) What is the contribution to chi-square from the cell for those who join the mailing list and make a purchase?7

(c) The value of chi-square for this table is χ 2

≈ 385.9. Does your answerto “b” reveal which cell produces the largest contribution to chi-square?8

6 If the two variables are not associated, then the percentage who make a purchase among those who join ought to be the

same as the percentage in the margin of the table, which is 64/4127 or about 1.55%. The expected count in the first cell

is then 395 × 64/4127 ≈ 6.126.7 Subtract the expected count from (a) from the observed count to get the deviation. Then square the deviation and

divide by the expected count. The contribution is (52-6.126)2/6.126 ≈ 343.5.8 Each summand that goes into χ

2is positive, so most of it is coming from the first cell. The big deviation from the

artificial table is the large count in the first cell.




5-17

Cramer’s V: Interpreting the Strength of AssociationThe value of chi-square for the example of music sharing is 8.33 whereaschi-square for the AYT exercise is 385.9. Is there much more associationin the second example, or is chi-square larger because n = 4,127 in thesecond table compared to n = 200 in the first? A more interpretable

statistic allows comparisons of the amount of association across tables.To remove the effects of n and the size of the table, Cramer’s V adjustschi-square so that the resulting measure of association lies between 0and 1. If V = 0, the variables are not associated. If V = 1, they are perfectlyassociated. If V < 0.25, we will say that the association is weak. If V >0.75, we will say that it’s strong. In between, we will say there ismoderate association.

To find Cramer’s V , divide χ 2 by the product of the number of cases

times the smaller of the number of rows minus 1 or the number ofcolumns minus 1 and take the square root. The formula for Cramer’s V

is simpler than words. As usual, n stands for the total number of cases,and let r be the number of rows and c the number of columns. Theformula for Cramer’s V is

V = "

2

nmin(r #1,c #1)

If V = 0, the two categorical variables are not associated. If V = 1, the twovariables are perfectly associated. If variables are perfectly associated,you can guess one once you know the value of the other. For the surveyof file sharing, χ

2 = 8.33 and both r and c are 2 and n = 200. Hence,

V = " 2

200 min(2 #1,2 #1)=

8.33

200$ 0.20

There’s association, but it’s weak. Staff and students have differentattitudes toward file sharing, but the differences are not very large. Forthe AYT example, χ

2 = 385.9, n = 4127, and r = c = 2. In this case,

V = "

2

nmin(r #1,c #1)=

385.9

4127$ 0.31

There is indeed more association in this example than in the example offile sharing, but not that much more. The huge difference between the

values of chi-square is a consequence of the difference in sample sizes,not the degree of dependence.

What does a table look like when there is strong association? Strongassociation implies very large differences among row or columnpercentages of a table. Suppose the survey results had turned out asshown in this table:

Cramer’s V is named

after the influential

Swedish mathematician

and statistician, Carl

Harald Cramér (1893-

1985). Cramér is best

known for his work inprobability and risk. He

found many real-life

applications for his work,

especially in the insurance

tip

Cramer’s V

A statistic derived from

chi-square that measures

the association in a

contingency table on a

scale from 0 to 1.




5-18

OK to Share Not OK Totals Staff 0 100 100 Students 80 20 100

Totals 80 120 200

Table 5-16. A table with strong association.

No staff thought it was OK, compared to 80% of the students. You’dexpect arguments between staff and students about sharing materials onthis campus. Let’s find χ

2 and Cramer’s V for this table. The margins ofTable 5-16 are the same as those in the original contingency table, so thecalculation of χ

2 is similar. We just need to replace the original counts bythose in Table 5-16.

" 2=

0# 40( )2

40+

100#60( )2

60+

80# 40( )2

40+

20#60( )2

60

= 40+ 26.67 + 40+ 26.67

= 133.33

Cramer’s V indicates strong association between the variables:

V =133.33

200= 0.816

The size of Cramer’s V indicates that you can almost predict exactlywhat a respondent will say if you know whether the respondent is onthe staff or is a student. If you know a person is a staff member, then youknow their attitude toward sharing files. Every member of the staff saythat file sharing is not OK. Among students, 80% say that it’s OK toshare.

Checklist: Chi-square and Cramer’s VChi-square and Cramer’s V measures association between two categorical variables that define a contingency table. Before you use these, verifythat your data meet these prerequisites.

✓ Categorical variables. If a variable is numerical, there are betterways to measure association.

✓ No obvious lurking variables. A lurking variable means that theassociation you’ve found is the result of some other variable inthat’s not shown.

4M Real EstateA developer needs to pick heating systems and appliances for newlybuilt single-family homes. If the house has electric heat, it’s cheaper toinstall electric appliances in the kitchen. If the home has gas heat, gasappliances make sense in the kitchen. If he’s limited to gas or electricheating, how many of each should he offer? Does everyone who heatswith gas prefer to cook with gas as well?




5-19

The builder checked the preferences of 447 homes in the area. For each,his data give the type of fuel used for cooking and the type used forheating. It’s your job to use these data to answer the two questions.

Motivation State the

questions that you would like to

answer and mention the

implications.

The builder wants to configure homes that match

the demand for gas or electric heat. He also has

to decide the types of appliances customers wantin kitchens. If there’s little association, then the

developer needs a wider mix of configurations.

Method Identify the variables

and data. Describe your planfor the analysis.

The data are two categorical variables with 447

rows. The rows are homes that heat with gas or

electric. The variables are the type of fuel used

for heating and the type used for cooking.

I’ll generate a contingency table and compare the

conditional distributions to the marginal

distribution. There’s association if these are

different. Mechanics About 2/3 heat with natural gas (298/447) and

1/3 with electricity. This contingency table shows

column percentages. These give the conditional

distributions of cooking fuel given the type of

fuel used for heating.

Cooking Fuel By Fuel Heat Home

CountCol %

ElectricHeat

GasHeat Total

Electricity 13691.28

13645.64

272

Natural

Gas

10

6.71

162

54.36

172

Other 30.20

00.00

3

149 298 447

There’s association. Among homes with electric

heat, 91% cook with electricity. Only 46% of

those who heat with gas use electricity to cook.

To quantify the strength of the association,

χ 2=98.62 and V = sqrt(98.62/(447 × 1)) ≈ 0.47.

That’s moderate association.

Message Summarize your key results and displays.

If you can answer the

questions, state your answer directly.

Be honest. If you have some

Homeowners prefer natural gas to electric heat

by 2 to 1. Of those with electric heat, 90% cook

with electricity. Of those with gas heat, 46%

cook with electricity.

These findings suggest building 2/3 of the homes

with gas heat and the rest with electric heat. Of

those with electric heat, keep it simple and

install an electric kitchen. For those with gas,

put an electric kitchen in half and gas in the rest.




5-20

reservations, mention themhere. There’s a big caveat, however. I’ve assumed that

new buyers are looking for the same things in a

home that these residents have – a big if.




5-21

Summary

A contingency table displays counts and may include selectedpercentages. The totals for rows and columns of the table give themarginal distributions of the two variables. Individual rows and

columns of the table show the conditional distribution of one variablegiven a label of the other. If the conditional distribution of a variablediffers from its marginal distribution, the two variables are associated.Segmented bar charts and mosaic plots are useful for seeing associationin a contingency table. A lurking variable offers another explanationfor the association found in a table. A lurking variable can produceSimpson’s paradox; the association in the table might be the result of alurking variable rather than the two that define the rows and columns.Chi-square and Cramer’s V are statistics that quantify the degree ofassociation.

Key Terms association, 5-5

chi-square, 5-14, 5-16

contingency table, 5-3 cell, 5-3

margin, 5-4

Cramer’s V , 5-17

distribution

conditional, 5-4

marginal, 5-4

lurking variable, 5-12 mosaic plot, 5-7

mutually exclusive, 5-3

segmented bar chart, 5-6

Simpson’s paradox, 5-12

FormulasChi-square

The key step in computing chi-square is to obtain the table of artificial countsthat are expected were there no association. The marginal counts of theartificial table match those of the data. The conditional distributions within therows and columns of the artificial table must be consistent with these, but notshow any association. A formula shows how to compute the cells of theartificial table. Let rowi denote the marginal frequency of the ith row (thenumber of observations in this row), and let col j denote the marginal frequencyof the jth column (the number in this column). If there is no association betweenthe two variables, then we expect to find

expectedi , j =rowi " col j

n

cases in the jth cell of the ith row. A spreadsheet is helpful to organize thecalculations for larger tables. To find χ

2, sum the weighted, squared deviationsbetween expectedi,j and the observed counts observedi,j. Using the summationnotation introduced in Chapter 4, the formula for chi-square is compactlywritten like this:




5-22

" 2=

observedi , j # expectedi , j( )2

expectedi , ji , j

$

where the sum extends over all of the cells of the table.

Cramer’s V

V = " 2

nmin(r # 1, c# 1)

for a table with r rows and c columns that summarizes n cases.

Best Practices• Use contingency tables to find association between categorical varibles. You

cannot see the association in the separate bar charts. It only becomes evident whenyou look at the table and compare the conditional distributions to the marginaldistributions.

• Consider the possibility of lurking variables. Before you interpret the associationyou find between two variables, think about whether there is some other variablethat offers a different explanation for your table. Are the data in the columns orrows of your table really comparable, or might some other factor that’s not evidentexplain the association that you see.

• Recognize association. Assciation means that some of the variation in a variablecan be described, or explained, by knowing another. By identifying this type ofassociation, you can make choices that fit together.

• Use plots to show association. Segmented bar charts and mosaic plots are usefulfor comparing relative frequencies in larger tables. Adjacent pie charts are another

choice, but these can make it hard to compare percentages unless the differencesare large.

• Exploit the absence of association. If the two categorical variables are notassociated, the variation is self-contained in each of the variables, and you do notneed the complexity of a table.

Pitfalls• Don’t interpret association as causation. You might have found association, but

that hardly means that you know why values fall in one category rather thananother. Think about the possibility of lurking variables.

• Don’t put too many numbers in a table. Computers make it easy to decorate atable with too many percentages. Choose just the ones that you need, those thathelp you answer the question at hand.

About the DataThe Amazon data in this chapter (and Chapter 3) come from ComScore,a firm that monitors the web-browsing habits of a sample of consumersaround the country. The data on airline arrivals in the 4M example of




5-23

Simpson’s paradox is from the web site of the Bureau of TransportationStatistics. (From the main page, follow the links to data that summarizeinformation about various types of travel in the US.) We used arrivaldata for January 2006. The data for kitchen preferences is a subset ofRECS, the Residential Energy Consumption Survey performed by theDepartment of Energy. The example of attitudes toward file sharing is

from a story in The Daily Pennsylvanian, the student newspaper at theUniversity of Pennsylvania.

Software TipsExcel

Excel has a powerful feature for producing contingency tables, but youneed to master its concept of pivot tables to get them. If you want to staywith Excel for all of your computing, then it’s probably worth the effort.Start by reading the help files produced by searching for pivot tablesfrom the help menu.

Once you have the contingency table, it’s not too hard to compute thevalue of chi-square and Cramer’s V using typical formulamanipulations. We find it easiest to build a table of expected counts(under independence) then subtract this table from the observed tableand square each cell. Adding up the squared deviations divided by theexpected counts gets chi-square.

MinitabTo obtain the contingency table, follow the menu items

Stat > Tables > Cross-Tabulation and chi-squareand fill in dialog with the names of two categorical variables. Pick one

variable to identify the rows of the table and the other for the columns.(Layers allow you to produce tables such as Table 5-8 that show aseparate table for each value of a third variable.) Options also produceintermediate steps in the calculation of chi-square, such as thecontribution from each cell to the total. It’s an easy calculation toconvert chi-square to Cramer’s V .

JMPFollow the menu commands

Analyze > Fit Y by Xand pick one categorical variable for the Y variable and one for X. The

variable chosen for Y identifies the columns of the contingency table andthe variable identified as X identifies the rows of the contingency table.By default, the output from JMP shows the mosaic plot. The pop-upmenu produced by clicking on the red triangle beside the header“Contingency Table” in the output window allows you to modify thetable by removing, for instance, some of the shown percentages.

The value of the chi-square statistic appears below the table in thesection of the output labeled “Tests.” The value of chi-square is labeled




5 24

in the output Pearson. (There are variations on how to compute the chi-square statistic.)

Source DF -LogLike RSquare (U)Model 5 8.28511 0.0144

Error 1221 567.79052

C. Total 1226 576.07562

N 1231

Test ChiSquare Prob>ChiSqLikelihood Ratio 16.570 0.0054

Pearson 16.056 0.0067

Once you have chi-square, use the formula given in the text to obtainCramer’s V .

Documents

ch05.catassoc