Upload
amisha2562585
View
14
Download
0
Embed Size (px)
DESCRIPTION
bn yu
Citation preview
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 1/24
5 Association between
Categorical VariablesCONTINGENCY TABLES 5-3 MARGINAL AND CONDITIONAL DISTRIBUTIONS 5-4 SEGMENTED BAR CHARTS AND MOSAIC PLOTS 5-5
LURKING VARIABLES AND SIMPSON’S PARADOX 5-11 CHI-SQUARE: A MEASURE OF ASSOCIATION 5-14 CRAMER ’S V: I NTERPRETING THE STRENGTH OF ASSOCIATION 5-17 CHECKLIST: CHI-SQUARE AND CRAMER ’S V 5-18 SUMMARY 5-21
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 2/24
2/7/2008 5 Categorical Association
5-2
Amazon spends millions of dollars on advertising.That sounds like a lot until you think about the sizeof the Internet. Busy portals like Google and Yahoocharge plenty for the privilege of advertising on theirpages. Which locations deliver buyers?
The answer comes from understanding the variationin the categorical variable that indicates whether avisitor makes a purchase. Some buy, others onlybrowse. Suppose that everyone from Yahoo is abuyer, but no one from Deal Time. Knowing the linkthat attracted the shopper explains variation inbehavior and reveals the better location for ads.
Let’s focus on the choices faced by an advertisingmanager at Amazon. She has a budget to allocateamong three busy hosts: msn.com, recipesource.com,
and yahoo.com. Together, these three delivered 17,619 visits to Amazonduring the fall of 2002.
Host Visits
msn.com 7,258recipesource.com 4,283yahoo.com 6,078
Total shoppers 17,619
Table 5-1. Frequency table of the categorical variable that identifies shoppers from three hosts.
MSN generates the most visits, but more visits do not automaticallymean more sales. More visits do translate into higher costs, however.Amazon pays a fee for every visit, whether the shopper buys anythingor not. Hosts that generate many visits but few sales are costly toAmazon.
Should Amazon pay some hosts more than others for each shopper sentto amazon.com?
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 3/24
2/7/2008 5 Categorical Association
5-3
Contingency Tables
To discover whether some hosts are better than others, we have toconsider a second categorical variable, one that identifies the visits thatresult in a sale. Here are the bar charts.
Host Purchase
Figure 5-1. Bar charts of the hosts and purchase actions. The categorical variable Host identifies the originating site, thecategorical variable summarized in Table 5-1. Purchase indicates whetherthe session produced a sale. There’s precious little variation in Purchase.Only 516 visits, less than 3% result in a purchase. If every sale comesfrom one host, Amazon would know where to place its ads.
The bar charts in Figure 5-1 summarize each categorical variableseparately, but we need to consider them simultaneously. For instance,we need to separate visitors from MSN into those who made a purchaseand those who did not.
The most common arrangement of such counts organizes them in a
table. The rows of the table identify the levels of the one variable, andthe columns of the table identify the levels of the other. Such a table iscalled a contingency table. This contingency table shows the variablePurchase (along the rows) and the variable Host (along the columns).
Host
msn.comrecipesource.
comyahoo.
com Total
No 6973 4282 5848 17103Purchase
Yes 285 1 230 516
Total 7258 4283 6078 17619
Table 5-2. Contingency table of web shopping.
The cells of this contingency table count the visits for every combinationof Host and Purchase. The cells of the contingency table are mutuallyexclusive; each case appears in exactly one cell. For example, thecolumn labeled msn.com shows that 285 of the 7,258 visits from MSNgenerated a purchase. Of those from recipesource.com, only 1 of the4,283 visits led to a purchase.
contingency table
table that shows counts of
e cases of one categorical
riable contingent on the value
another.
mutually exclusive
e conditions that define
e cells in a contingency
ble allow a case to appear
only one cell. There’s no
uble-counting of cases.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 4/24
2/7/2008 5 Categorical Association
5-4
Marginal and Conditional DistributionsThe margins of Table 5-2 (shown in gray) give the total counts in eachrow and column. Because the cells of the table are mutually exclusive,the sum of the counts in the cells of the first column equals the totalnumber of visits from msn.com. The sum for each column appears in the
bottom margin of the contingency table; these sums match the frequencydistribution of Host shown in Table 5-1. The right margin shows thefrequency table of Purchase. Because these counts are typically placedalong the margins of a contingency table, the frequency distributions ofthe variables in the table are also called marginal distributions. The barcharts in Figure 5-1 show these distributions.
Percentages help us interpret a contingency table, but we’ve got to makea choice of which percentage to show. For example, 285 shoppers fromMSN made a purchase. To show this count as a percentage, we havethree choices:
285 is
1.62% of all 17,619 visits
3.93% of the 7,258 visits from MSN
55.23% of the 516 visits that made a purchase
" # $ % $
All are potentially interesting. Some statistics packages embellish acontingency table with every percentage, like this:
Host
CountTotal %Col %Row %
msn.com
Recipesource.com
Yahoo.com Total
No
6973
39.58
96.07
40.77
4282
24.30
99.98
25.04
5848
33.19
96.22
34.19
17103
97.07
Purchase
Yes
285
1.62
3.93
55.23
1
0.01
0.02
0.19
230
1.31
3.78
44.57
516
2.93
7258
41.19
4283
24.31
6078
34.50
17619
Table 5-3. Too many percentages clutter this contingency table.
Tables like this one give percentages a bad reputation. The table showstoo many percentages. Each cell lists the count along with percentages ofthe total, the column, and the row. While it’s fine to consider all of these,it’s better to choose the percentage that answers the relevant question.
Because the account manager at Amazon is interested in which hostproduces the highest proportion of purchasers, a better table shows onlythe counts and column percentages. Let’s start with msn.com. For themoment, we’re interested in only the 7,258 visits from msn.com in the 1st column of Table 5-2. The distribution of a variable that is restricted tocases satisfying a condition is called a conditional distribution. In atable, a conditional distribution refers to counts within a row or column.
msn.com6973
96.07%
285
3.93%
7258
marginal distributione frequency distribution of
variable in a contingency
ble given by counts of the
tal number of cases in rows
r columns .
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 5/24
2/7/2008 5 Categorical Association
5-5
By limiting our attention to visits from msn.com, we see the conditionaldistribution of Purchase conditional on the host being MSN.
The following contingency table shows the counts and the columnpercentages. The percentages within each column show the conditionaldistribution of Purchase for each host.
Host
CountCol %
msn.com
recipesource.com
yahoo.com Total
No 6973
96.07%
4282
99.98%
5848
96.22%
17103
97.07%Purchase Yes 285
3.93%
1
0.02%
230
3.78%
516
2.93%
7258 4283 6078 17619
Table 5-4. Contingency table with relevant percentages.
Compare this table to Table 5-3. Without the distraction of extraneouspercentages, we can quickly see that visitors from MSN and Yahoo yield
similar shares of purchases (3.93% and 3.78%, respectively). Incomparison, only one visitor from RecipeSource bought anything(0.02%).
We’ve just discovered that Host and Purchase are associated. Categoricalvariables are associated if the column percentages vary from column tocolumn (or if row percentages vary from row to row). In this case, theproportion of visits that produce a purchase differs among hosts. Theassociation between Host and Purchase means that knowing the hostchanges your impression of the chance of a purchase.
Variables can be associated to different degrees. The least association
occurs when the column percentages are identical. Overall, 516/17619 =2.93% of the visits made a purchase. If 2.93% of visits from every hostmade a purchase, then the chance of a purchase would not depend onthe host. Each conditional distribution of Purchase given Host wouldmatch the marginal distribution of Purchase.
That’s not what happens: Host and Purchase are associated. Visitors fromsome hosts are more likely to make a purchase. Because Host andPurchase are associated, the account manager at Amazon might bewilling to pay more for visits from MSN or Yahoo and less for thosefrom Recipe Source. The value of the visit depends on the host.
Segmented Bar Charts and Mosaic PlotsBar charts of marginal distributions like those in Figure 5-1 don’t revealassociation, but other charts do. For example, Amazon locateswarehouses near large concentrations of shoppers to reduce shippingcosts. Being close makes it cheaper to offer free shipping. Thiscontingency table shows the counts of Purchase by Location over a widerrange of hosts.
conditional distribution
The distribution of a
variable restricted to cases
that satisfy a condition,
such as those in a row of
column of a contingency
table.
association
wo categorical variables
e associated if the
nditional distribution ofe variable depends on the
ue of the other.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 6/24
2/7/2008 5 Categorical Association
5-6
Location
NorthCentral North East South West Total
No 5640
24.46
4450
19.30
8321
36.09
4645
20.15
23056
Purchase
Yes 161
24.66
146
22.36
177
27.11
169
25.88
653
Total 5801 4596 8498 4814 23709
Table 5-5. Contingency table of purchases organized by region.
Because we’re interested in discovering where those who make apurchase live, this table shows row percentages. With four percentagesin each conditional distribution, it becomes helpful to have a plot. Asegmented bar chart divides bars in a bar chart proportionally intosegments corresponding to the percentage in each group. If the bars lookidentical, then the variables are not associated.
Although the South sends the largest number of visitors, visitors fromthe South are more likely to browse rather than buy. About 36% of thebrowsers come from the South, but only 27% of the buyers. If Purchase
and Region were not associated, then these percentages should be aboutthe same. Because they differ, Region is associated with making apurchase. You can see the differences in this segmented bar chart. Theyellow segment identifying visits from the South makes up a largershare among those who don’t purchase (on the top) than among thosewho do make a purchase.
Figure 5-2. Segmented bar chart.
Be careful interpreting a segmented bar chart. This chart comparesrelative frequencies of two conditional distributions. Because these arerelative frequencies rather than counts, the bars do not represent thesame number of cases. The bar on the top summarizes 23,056 cases
whereas the bar on the bottom summarizes 653 purchases. The chartobeys the area principle, but the area is proportional to the percentages
within each row of the table.
Segmented bar charts frequentlyappear in news items such as thisone from the New York Times.1
1 “Economists debate the quickest cure,” The New York Times, January 19, 2008.
segmented bar chart
A bar chart that divides
the bars into shares
based on a second
categorical variable.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 7/24
2/7/2008 5 Categorical Association
5-7
At the time, the government was debating the use of tax cuts to stimulateconsumer spending and avoid a recession. The chart shows that theconditional distribution within the bars is changing, meaning there’sassociation. In this case, we can see that households with smaller incomesare more likely to use tax rebates to pay down debt than households withhigher incomes.
A mosaic plot is an alternative to the segmented bar chart. A mosaic plotshows tiles, colored rectangular regions, that represent the counts in eachcell of a contingency table. The layout of the tiles matches the layout of thecells in a contingency table, and the sizes of the tiles are proportional to thecounts in each cell. The tiles within a column have the same width, butpossibly different heights. The widths of the columns are proportional tothe marginal distribution of the variable positioned on the bottom of thetable. For example, this figure shows the mosaic plot of the data in Table5-5. The tiny height of the red tiles in the second row show the counts ofpurchases; their tiny sizes remind you how rare it is to find a purchase
among the visits.
Figure 5-3. Mosaic p lot of the purchases by region.
Overall, the South contributes the most. These are the widest tiles in theplot. Because purchases are so rare, however, it’s hard to see in the mosaicplot that the share of purchases is smaller for visitors from the South.
Mosaic plots are much more useful for seeing dependence in data forwhich the relative frequencies do not get so small. As an example, thefollowing table shows counts of sales of shirts at a men’s clothing retailer.Do Size and Style appear associated? If the two are not associated,managers should order the same proportion of sizes in every style. If thetwo are associated, the distribution of sizes varies from style to style.
mosaic plot
A tiled plot in which the
size of each tile is
proportional to the count in
a cell of a contingency table.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 8/24
2/7/2008 5 Categorical Association
5-8
Style
ButtonDown
Polo Small Print
Small 18 27 36 81 Medium 65 82 28 175 Size
Large 103 65 22 190 186 174 86 446
Table 5-6. Sales of shirts at a men’s clothing retailer.
It’s hard to see the association quickly in this table of counts, but amosaic plot makes the association very clear.
Figure 5-4. Mosaic p lot of the shirt sales shows association between Size and Style.
The tiles would line up in the absence of association. In this example, theproportions of sizes vary across the styles, causing the tiles to vary inheight. The irregular heights indicate that these variables are associated.
Small sizes are much more prevalent among beach prints than thebutton-down shirts. Because the mosaic plot respects the area principle,we can also see that the button-down style is the biggest seller overall(these tiles are wider than the others) and the beach-print style is thesmallest seller.
4M: Where’s my car?
2002 Dodge Intrepid 2002 Toyota Camry
Auto theft costs owners and insurance companies billions of dollars. TheFBI estimates that 1.2 million cars worth $8.4 billion were stolen in 2002.
Should insurance companies charge the same premium for theftinsurance or should they vary the premium? Obviously, a policy thatinsures a $90,000 Porsche costs more than one for a $15,000 Hyundai.But should the premium be a fixed percentage of the car’s value, or
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 9/24
2/7/2008 5 Categorical Association
5-9
should the percentage vary from model to model? To answer thisquestion, we need to know whether some cars are more likely to bestolen than others. It comes down to whether there is an associationbetween car theft and car model.
For this example, it’s up to you to decide whether an insurance company
should charge a fixed percentage of the price to insure against theft. Thedata come from the National Highway Traffic Safety Administration(NHTSA). We picked seven popular models.
Motivation What questions would you
like to answer?My company deciding whether to base premiums
for theft insurance on the chance that the car is
stolen. We can either charge a fixed percentage of
the replacement cost, or charge a variable
percentage for cars that are stolen more often.
Are there large differences in the rates of theft?
Method Identify the variables and report
the W’s. Be certain that the data are
counts and that the categories do notoverlap so that no individual is countedtwice. Also indicate what you intend to dowith these data.
My data from NHTSA give the number of cars
stolen for seven 2002 models. If thefts are
associated with the model, then we should vary therate.
I’ll judge the association by seeing whether the
percentage of cars stolen varies by model.
Mechanics Make an appropriate displayor table to see whether there is adifference in the relative proportions.
Notice that we did not add the missingcolumn for “not stolen”. If you do that,you’ll see that the number made is themarginal (row) total. For example,
Model Stolen Not stolen Total
Intrepid 1657 109834 111491
This table shows the data along with the
percentage of each model that is stolen.
Model Stolen Made Pct Stolen
Chevrolet Cavalier 1017 259230 0.392
Dodge Intrepid 1657 111491 1.486
Dodge Neon 959 119253 0.804
Ford Explorer 1419 610268 0.233Ford Taurus 842 321556 0.262
Honda Accord 702 419398 0.167
Toyota Camry 1027 472030 0.218
Among these models, the Dodge Intrepid has the
highest percentage stolen (1.486%), followed by
the Dodge Neon (0.804%). The Honda Accord has
the least (0.167%).
Method
Mechanics
Motivation
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 10/24
2/7/2008 5 Categorical Association
5-10
Message Discuss the patterns in the
table and displays.
A segmented bar chart is less useful
because the percentages stolen are sosmall.
Some models (e.g., Dodge Intrepid) are more likely
to be stolen than others (e.g., Honda Accord).
About 1.5% of 2002 Intrepids were stolen,
compared to less than 0.17% of 2002 Accords.
A lot of Accords get stolen, but that’s explained
by the sheer number of Accords sold each year.
If you can, discuss possible real-worldconsequences.
We should charge higher premiums for theft
insurance for models that are most likely to be
stolen. A Dodge Intrepid is 7 times as likely to be
stolen as a Toyota Camry.
Customers who buy an Intrepid (which costs about
the same as a Camry) should pay a higher premium
for theft insurance.
Are You There?An on-line questionnaire asked visitors to a retail web site if they wouldlike to join a mailing list. This contingency table summarizes the countsof those who join as well as those who made a purchase.
Mailing List Join Decline
Yes 52 12Purchase
No 343 3720
The columns indicate whether the visitor signed up ( Mailing list = Join orDecline), and the rows indicate whether the visitor made a purchase(Purchase = Yes or No). For example, 52 visitors joined the mailing listand made a purchase.
(a) Find the marginal distribution of Purchase.2
(b) Find the conditional distribution of Purchase given whether thecustomer signed up or not. Do the conditional distributions differ?3
(c) Does a segmented bar chart provide a helpful plot for these data?4
(d) Is the variable Purchase associated with the variable Mailing List?5
2 The row totals determine the marginal distribution, 64 “Yes” who made a purchase and 4,063 “No”.3 Among the 395 who join the list, 52 make a purchase (13%). Among those who decline, 12 out of 3,732 make a
purchase (0.32%). Customers who join are more likely to make a purchase . These are rather different.4 Not really, because one percentage is so small. You could show a figure like that in the prior 4M.5 The two are associated (dependent) because the conditional distributions differ. The chance that a customer makes a
purchase depends on whether they sign up.
Message
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 11/24
2/7/2008 5 Categorical Association
5-11
Lurking Variables and Simpson’s ParadoxAssociation gets confused with causation. This mistake can lead toserious errors in judgment. Consider the following contingency table.
ServiceCountColumn % Orange
Arrow
Brown
Box TotalDamaged
4515%
6633%
11122.2%Status
OK 255
85%
134
67%
389
87.8%
300 200 500
Table 5-7. Counts of damaged packages.
This contingency table shows the number of cartons that were damagedwhen shipped by two delivery services. The percentages in each cell arecolumn percentages. Overall, 22.2% of the 500 cartons arrived withvisible damage. Conditionally, 15% of cartons shipped via OrangeArrow arrived damaged compared to 33% for Brown Box. There’sdefinitely association; neither conditional distribution matches themarginal distribution of Status.
Table 5-7 suggests that Orange Arrow is the better shipper, and wemight be tempted to believe that cartons are more likely to arriveundamaged because they are shipped on Orange Arrow. If we believethat, we might decide to ship everything on Orange Arrow. Before wedo that, however, we better make sure that this table offers a faircomparison. Maybe there’s another explanation for why packagesshipped on Brown Box are damaged more often.
To think of an alternative explanation, we have to know more aboutthese packages. In this instance, the cartons hold car parts. Some cartonshold engine parts whereas others hold plastic molding. Guess whichcartons are heavier and more prone to damage? The next two tablesseparate the counts in Table 5-7 into those for heavy cartons (left table)and those for light cartons (right table).
Heavy Light
Service ServiceCountColumn % Orange
ArrowBrown
Box TotalOrange
ArrowBrown
Box Total
Damaged 20
67%60
40%80
44.4%25
9%6
12%31
9.7%Status
OK 1033%
9060%
10055.6%
24591%
4488%
38990.3%
30 150 180 270 50 320
Table 5-8. Separate tables for heavy and light packages.
Orange Arrow is no longer clearly the better shipper. Among heavypackages, 67% of those shipped on Orange Arrow arrived damagedcompared to 40% for Brown Box. For light cartons, 9% of those shippedby Orange Arrow arrive damaged compared to 12% for Brown Box.
lurking variable
A concealed variable that
affects the apparent
relationship between two
other variables.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 12/24
2/7/2008 5 Categorical Association
5-12
The initial comparison favors Orange Arrow because it handles a highershare of light packages. Brown Box seems more likely to damagepackages because it handles a greater proportion of heavy cartons.Heavy cartons more often arrive with some damage (44.4% versus9.7%). Table 5-7 presents a misleading comparison; it compares howwell Orange Arrow ships light cartons to how well Brown Box ships
heavy cartons. The weight of the cartons is a hidden, lurking variable.Table 5-8 adjusts for the lurking variable by separating the data intoheavy cartons and light cartons.
Such reversals often go by the name Simpson’s paradox. It can seemsurprising – downright paradoxical – that one service looks better overall,but the other looks better when we restrict the comparison. Theexplanation lies in recognizing the presence of a lurking variable. Beforeyou act on association (like sending all the business to Orange Arrow), besure to identify the effects of lurking variables.
One of the best-known examples of Simpson’s paradox occurred when U.C. Berkeley was sued for bias against women applying to graduate schoolWhen data from all of the graduate programs at Berkeley were pooled, theadmission rate for men was much higher than that for women. However,it was discovered that the rate was very similar in every department. Infact, most departments had a small bias in favor of women. The explanationfor the apparent overall bias was that women tended to apply todepartments that had many applicants and therefore low rates ofadmission. Men, on the other hand, tended to apply to departments suchas mathematics that had fewer applicants and higher admission rates. (P. J. Bickel, E. A. Hammel, and J. W. O'Connell (1975). Sex Bias in Graduate
Admissions: Data from Berkeley. Science 187:4175, pp. 398 – 404.)
4M Picking an Airline
Which airline is more likely to get you to a meeting on time, US Airways
or Delta? The following table summarizes 10,906 arrivals at four airportsserved by both carriers: Boston, Orlando, Philadelphia, and San Diego.
AirlineCountColumn % Delta US
Airways Total
On Time 2596
80%
5966
78%
8562
Arrival
Delayed 659
20%168522%
9667
Total 3255 7651 10906
Simpson’s paradox occurs
when the direction of an
association between two
variables is reversed when a
third variable is controlled.
The effect was named after
E. H. Simpson in who
compiled entertaining
examples in 1951, but it wascarefully described much
earlier by G. U. Yule (1871-
1951 in 1903.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 13/24
2/7/2008 5 Categorical Association
5-13
Table 5-9. Airline arrivals.
This table suggests that the two airlines perform comparably, with aslight edge to Delta with 80% on time compared to 78% for US Airways.Before you book a flight on Delta, however, you should think aboutwhether there’s a lurking variable.
Motivation List the questions thatyou would like to answer, and statethe implications.
My business regularly takes me to these four destinations.Does it matter which airline I choose, Delta or US Airways?
A late arrival might cause me to miss a meeting with a client.
Method Identify the variables and
your data. Describe your plan for the analysis.
Both variables are categorical: the airline and the arrival
status. Data are from the Bureau of Transportation
Statistics.
Contingency tables answer my questions. I’d better think
about lurking variables. A possible lurking variable behind
Table 5-1 is the destination of the flight. This table
combines the status for all four destinations.
Mechanics I’ll form a contingency table that isolates flights into one
destination: Orlando. Here’s the table.
CountCol % Delta
USAirways Total
Delayed 228
19.5%150
15.5%378
OnTime
94080.5%
82084.5%
1760
1168 970 1468
For flights to Orlando, US Airways is the better choice. In
fact, no matter which destination, US Airways has a higher
percentage of on-time arrivals.
On Time% Delta
US
Airways
Boston 80.1% 81.7%
Orlando 80.5% 84.5%
Philadelphia 70.5% 74.3%
San Diego 84.2% 85.4%
Message Discuss the patterns inthe table and displays.
I’ll book a flight on US Airways. No matter which destination,
US Airways is more likely to arrive on time.
It’s worthwhile to review why Delta appears better overall, even thoughUS Airways arrives on time more often for each destination. The initialtable, Table 5-9, masks a lurking variable: destination. The destinationmatters: delays are more common at Philadelphia.
DestinationCountCol % Boston Orlando Philadelphia San Diego Total
Delayed 61519%
37818%
123026%
12115%
2344
Arrival
On Time 262081%
176082%
350574%
67785%
8562
Total 3235 2138 4735 798 10906
Table 5-10. Delayed arrivals by destinations.
Method
Mechanics
Message
Motivation
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 14/24
2/7/2008 5 Categorical Association
5-14
In addition, most of these flights on US Airways go to Philadelphia,whereas most on Delta go to Boston.
DestinationCountRow % Boston Orlando Philadelphia San Diego Total
Delta 140943%
116836%
31210%
36611%
3255
Airline
US Airways 1826
24%
970
13%
4423
58%
432
6%
7651
Total 3235 2138 4735 798 10906
Table 5-11. Airlines by destinations.
The initial table (Table 5-9) answers a strange question: “Am I morelikely to arrive on time flying to Boston on Delta or arrive on time inPhiladelphia on US Airways?” The answer: take Delta to Boston. There’snothing wrong with that answer – it’s just an odd question. By focusingthe analysis on flights into a specific destination, we control for thislurking variable and answer the right question.
Once you identify a lurking factor, you can remove its effects as we didin this example. But here’s the hard part: How do you know whetherthere is lurking factor? It’s easy to imagine other lurking factors, too.Maybe it’s the type of airplane, the day of the week, or the time of day.Make no mistake about it. You need to understand the context of yourdata to find a lurking factor.
Chi-Square: A Measure of AssociationIn the first example, we concluded that Purchase and Host are associatedbecause the proportion of visitors who make purchases differs from hostto host. How different are they?
Rather than leave it to subjective judgment, it’s useful to have a statisticthat quantifies the amount of association. Instead of saying “There’ssome association” or “There’s a lot of association”, the statistic calledchi-square (pronounce “chi” as ki ) measures of the degree ofassociation. The larger chi-square becomes, the larger the amount ofassociation. This statistic also offers a preview of an approach frequentlytaken in statistics. To quantify the degree of association, we compare thedata we observe to artificial data that have none. Chi-square measuresassociation in a contingency table by comparing the observedcontingency table to an artificial table that has no association. If the
tables are similar, then there’s not much association. The larger thedifference between the tables, the larger the association.
We’ll illustrate the use and calculation of chi-square with an example. Arecent poll asked 200 people at a university about their attitudes towardsharing copyrighted music. Half of the respondents were students andthe other half were staff at the university (administrators or faculty).This table summarizes the counts.
Attitude to Sharing
chi-square
A statistic that measures
association in a contingency
table; larger values of chi-
square indicate more
association.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 15/24
2/7/2008 5 Categorical Association
5-15
Attitude to Sharing
Staff 30 70 100 Group
Student 50 50 100
Total 80 120 200
Table 5-12. Attitudes toward sharing copyright materials.
Overall, 40% (80 of 200) of those questioned thought it was OK to sharecopyrighted music. That’s the marginal percentage. Each rowdetermines a conditional distribution of the attitude, one for staff andone for students. Only 30% of the staff thought it was OK to share,compared to 50% of students. Because the row percentages differ, Group and Attitude are associated.
To quantify the amount of association, we need a benchmark forcomparison, a point of reference. For that, consider what Table 5-12would look like if there were no association. To figure this out, pretendthat we know the marginal totals, but not the counts within the table:
Attitude to SharingOK Not OK Totals Staff ? ? 100 Group
Student ? ? 100
Totals 80 120 200
Table 5-13. What goes in these cells if the variables are not associated?
Overall, ½ of the respondents are staff and ½ are students. Were Group and Attitude not associated, then ½ of the cases in each column would bestaff and ½ would be students. We would expect the table to look likethis:
Attitude to Sharing
OK to Share Not OK Totals Staff 40 60 100 Group
Students 40 60 100
Totals 80 120 200
Table 5-14. Artificial table with cells that we would expect were Group and Attitude not associated.
Chi-square measures the distance between the cells in the real table andthose in the artificial table. We first subtract the values in the cells; useonly the cells, not the margins. The differences in the counts are:
Real Data
30 70
50 50
– Artificial
40 60
40 60
= Difference
-10 10
10 -10
Table 5-15. Deviations from the original counts.
Next, we combine the differences. If we add them, we get zero becausethe negative and positive values cancel. We had this problem withcancellation when we defined the variance s2 in Chapter 4. We’ll solvethe problem as we did then: square the differences before we add them.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 16/24
2/7/2008 5 Categorical Association
5-16
When the squared deviations are added, chi-square assigns some ofthem larger weight. Look at the differences in the first row. Both are 10,but the difference in the first column is larger relative to what weexpected than the difference in the second column (10 out of 40compared to 10 out of 60). Rather than treat these the same, chi-squareassigns more weight to the first. After all, saying 40 and finding 30 is a
larger proportional error than saying 60 and finding 70. To give moreweight to larger proportional deviations, we divide the squareddeviations by the expected values in the artificial table.
The chi-square statistic is the sum of these weighted, squareddifferences. For this table, chi-square, denoted in formulas as χ
2, is
" 2=
30 # 40( )2
40+
70 # 60( )2
60+
50 # 40( )2
40+
50 # 60( )2
60
=
#10( )2
40+
10( )2
60+
10( )2
40+
#10( )2
60
= 2.5 +1.67 + 2.5 +1.67
= 8.33
Chi-square has another similarity to s2: it’s hard to interpret. The valueof chi-square depends on n, the total number of cases, and the size of thetable. The larger the table, the larger chi-square becomes.
Are You There?Here’s the contingency table from the prior AYT, including the marginaltotals.
Mailing List
Join Decline Total Yes 52 12 64
PurchaseNo 343 3720 4063
Total 395 3732 4127
(a) Chi-square requires the artificial table of counts. What count wouldbe expected in the highlighted cell for those who join the mailing list andmake a purchase if Purchase and Mailing List are not associated?6
(b) What is the contribution to chi-square from the cell for those who join the mailing list and make a purchase?7
(c) The value of chi-square for this table is χ 2
≈ 385.9. Does your answerto “b” reveal which cell produces the largest contribution to chi-square?8
6 If the two variables are not associated, then the percentage who make a purchase among those who join ought to be the
same as the percentage in the margin of the table, which is 64/4127 or about 1.55%. The expected count in the first cell
is then 395 × 64/4127 ≈ 6.126.7 Subtract the expected count from (a) from the observed count to get the deviation. Then square the deviation and
divide by the expected count. The contribution is (52-6.126)2/6.126 ≈ 343.5.8 Each summand that goes into χ
2is positive, so most of it is coming from the first cell. The big deviation from the
artificial table is the large count in the first cell.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 17/24
2/7/2008 5 Categorical Association
5-17
Cramer’s V: Interpreting the Strength of AssociationThe value of chi-square for the example of music sharing is 8.33 whereaschi-square for the AYT exercise is 385.9. Is there much more associationin the second example, or is chi-square larger because n = 4,127 in thesecond table compared to n = 200 in the first? A more interpretable
statistic allows comparisons of the amount of association across tables.To remove the effects of n and the size of the table, Cramer’s V adjustschi-square so that the resulting measure of association lies between 0and 1. If V = 0, the variables are not associated. If V = 1, they are perfectlyassociated. If V < 0.25, we will say that the association is weak. If V >0.75, we will say that it’s strong. In between, we will say there ismoderate association.
To find Cramer’s V , divide χ 2 by the product of the number of cases
times the smaller of the number of rows minus 1 or the number ofcolumns minus 1 and take the square root. The formula for Cramer’s V
is simpler than words. As usual, n stands for the total number of cases,and let r be the number of rows and c the number of columns. Theformula for Cramer’s V is
V = "
2
nmin(r #1,c #1)
If V = 0, the two categorical variables are not associated. If V = 1, the twovariables are perfectly associated. If variables are perfectly associated,you can guess one once you know the value of the other. For the surveyof file sharing, χ
2 = 8.33 and both r and c are 2 and n = 200. Hence,
V = " 2
200 min(2 #1,2 #1)=
8.33
200$ 0.20
There’s association, but it’s weak. Staff and students have differentattitudes toward file sharing, but the differences are not very large. Forthe AYT example, χ
2 = 385.9, n = 4127, and r = c = 2. In this case,
V = "
2
nmin(r #1,c #1)=
385.9
4127$ 0.31
There is indeed more association in this example than in the example offile sharing, but not that much more. The huge difference between the
values of chi-square is a consequence of the difference in sample sizes,not the degree of dependence.
What does a table look like when there is strong association? Strongassociation implies very large differences among row or columnpercentages of a table. Suppose the survey results had turned out asshown in this table:
Cramer’s V is named
after the influential
Swedish mathematician
and statistician, Carl
Harald Cramér (1893-
1985). Cramér is best
known for his work inprobability and risk. He
found many real-life
applications for his work,
especially in the insurance
tip
Cramer’s V
A statistic derived from
chi-square that measures
the association in a
contingency table on a
scale from 0 to 1.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 18/24
2/7/2008 5 Categorical Association
5-18
OK to Share Not OK Totals Staff 0 100 100 Students 80 20 100
Totals 80 120 200
Table 5-16. A table with strong association.
No staff thought it was OK, compared to 80% of the students. You’dexpect arguments between staff and students about sharing materials onthis campus. Let’s find χ
2 and Cramer’s V for this table. The margins ofTable 5-16 are the same as those in the original contingency table, so thecalculation of χ
2 is similar. We just need to replace the original counts bythose in Table 5-16.
" 2=
0# 40( )2
40+
100#60( )2
60+
80# 40( )2
40+
20#60( )2
60
= 40+ 26.67 + 40+ 26.67
= 133.33
Cramer’s V indicates strong association between the variables:
V =133.33
200= 0.816
The size of Cramer’s V indicates that you can almost predict exactlywhat a respondent will say if you know whether the respondent is onthe staff or is a student. If you know a person is a staff member, then youknow their attitude toward sharing files. Every member of the staff saythat file sharing is not OK. Among students, 80% say that it’s OK toshare.
Checklist: Chi-square and Cramer’s VChi-square and Cramer’s V measures association between two categorical variables that define a contingency table. Before you use these, verifythat your data meet these prerequisites.
✓ Categorical variables. If a variable is numerical, there are betterways to measure association.
✓ No obvious lurking variables. A lurking variable means that theassociation you’ve found is the result of some other variable inthat’s not shown.
4M Real EstateA developer needs to pick heating systems and appliances for newlybuilt single-family homes. If the house has electric heat, it’s cheaper toinstall electric appliances in the kitchen. If the home has gas heat, gasappliances make sense in the kitchen. If he’s limited to gas or electricheating, how many of each should he offer? Does everyone who heatswith gas prefer to cook with gas as well?
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 19/24
2/7/2008 5 Categorical Association
5-19
The builder checked the preferences of 447 homes in the area. For each,his data give the type of fuel used for cooking and the type used forheating. It’s your job to use these data to answer the two questions.
Motivation State the
questions that you would like to
answer and mention the
implications.
The builder wants to configure homes that match
the demand for gas or electric heat. He also has
to decide the types of appliances customers wantin kitchens. If there’s little association, then the
developer needs a wider mix of configurations.
Method Identify the variables
and data. Describe your planfor the analysis.
The data are two categorical variables with 447
rows. The rows are homes that heat with gas or
electric. The variables are the type of fuel used
for heating and the type used for cooking.
I’ll generate a contingency table and compare the
conditional distributions to the marginal
distribution. There’s association if these are
different. Mechanics About 2/3 heat with natural gas (298/447) and
1/3 with electricity. This contingency table shows
column percentages. These give the conditional
distributions of cooking fuel given the type of
fuel used for heating.
Cooking Fuel By Fuel Heat Home
CountCol %
ElectricHeat
GasHeat Total
Electricity 13691.28
13645.64
272
Natural
Gas
10
6.71
162
54.36
172
Other 30.20
00.00
3
149 298 447
There’s association. Among homes with electric
heat, 91% cook with electricity. Only 46% of
those who heat with gas use electricity to cook.
To quantify the strength of the association,
χ 2=98.62 and V = sqrt(98.62/(447 × 1)) ≈ 0.47.
That’s moderate association.
Message Summarize your key results and displays.
If you can answer the
questions, state your answer directly.
Be honest. If you have some
Homeowners prefer natural gas to electric heat
by 2 to 1. Of those with electric heat, 90% cook
with electricity. Of those with gas heat, 46%
cook with electricity.
These findings suggest building 2/3 of the homes
with gas heat and the rest with electric heat. Of
those with electric heat, keep it simple and
install an electric kitchen. For those with gas,
put an electric kitchen in half and gas in the rest.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 20/24
2/7/2008 5 Categorical Association
5-20
reservations, mention themhere. There’s a big caveat, however. I’ve assumed that
new buyers are looking for the same things in a
home that these residents have – a big if.
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 21/24
2/7/2008 5 Categorical Association
5-21
Summary
A contingency table displays counts and may include selectedpercentages. The totals for rows and columns of the table give themarginal distributions of the two variables. Individual rows and
columns of the table show the conditional distribution of one variablegiven a label of the other. If the conditional distribution of a variablediffers from its marginal distribution, the two variables are associated.Segmented bar charts and mosaic plots are useful for seeing associationin a contingency table. A lurking variable offers another explanationfor the association found in a table. A lurking variable can produceSimpson’s paradox; the association in the table might be the result of alurking variable rather than the two that define the rows and columns.Chi-square and Cramer’s V are statistics that quantify the degree ofassociation.
Key Terms association, 5-5
chi-square, 5-14, 5-16
contingency table, 5-3 cell, 5-3
margin, 5-4
Cramer’s V , 5-17
distribution
conditional, 5-4
marginal, 5-4
lurking variable, 5-12 mosaic plot, 5-7
mutually exclusive, 5-3
segmented bar chart, 5-6
Simpson’s paradox, 5-12
FormulasChi-square
The key step in computing chi-square is to obtain the table of artificial countsthat are expected were there no association. The marginal counts of theartificial table match those of the data. The conditional distributions within therows and columns of the artificial table must be consistent with these, but notshow any association. A formula shows how to compute the cells of theartificial table. Let rowi denote the marginal frequency of the ith row (thenumber of observations in this row), and let col j denote the marginal frequencyof the jth column (the number in this column). If there is no association betweenthe two variables, then we expect to find
expectedi , j =rowi " col j
n
cases in the jth cell of the ith row. A spreadsheet is helpful to organize thecalculations for larger tables. To find χ
2, sum the weighted, squared deviationsbetween expectedi,j and the observed counts observedi,j. Using the summationnotation introduced in Chapter 4, the formula for chi-square is compactlywritten like this:
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 22/24
2/7/2008 5 Categorical Association
5-22
" 2=
observedi , j # expectedi , j( )2
expectedi , ji , j
$
where the sum extends over all of the cells of the table.
Cramer’s V
V = " 2
nmin(r # 1, c# 1)
for a table with r rows and c columns that summarizes n cases.
Best Practices• Use contingency tables to find association between categorical varibles. You
cannot see the association in the separate bar charts. It only becomes evident whenyou look at the table and compare the conditional distributions to the marginaldistributions.
• Consider the possibility of lurking variables. Before you interpret the associationyou find between two variables, think about whether there is some other variablethat offers a different explanation for your table. Are the data in the columns orrows of your table really comparable, or might some other factor that’s not evidentexplain the association that you see.
• Recognize association. Assciation means that some of the variation in a variablecan be described, or explained, by knowing another. By identifying this type ofassociation, you can make choices that fit together.
• Use plots to show association. Segmented bar charts and mosaic plots are usefulfor comparing relative frequencies in larger tables. Adjacent pie charts are another
choice, but these can make it hard to compare percentages unless the differencesare large.
• Exploit the absence of association. If the two categorical variables are notassociated, the variation is self-contained in each of the variables, and you do notneed the complexity of a table.
Pitfalls• Don’t interpret association as causation. You might have found association, but
that hardly means that you know why values fall in one category rather thananother. Think about the possibility of lurking variables.
• Don’t put too many numbers in a table. Computers make it easy to decorate atable with too many percentages. Choose just the ones that you need, those thathelp you answer the question at hand.
About the DataThe Amazon data in this chapter (and Chapter 3) come from ComScore,a firm that monitors the web-browsing habits of a sample of consumersaround the country. The data on airline arrivals in the 4M example of
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 23/24
2/7/2008 5 Categorical Association
5-23
Simpson’s paradox is from the web site of the Bureau of TransportationStatistics. (From the main page, follow the links to data that summarizeinformation about various types of travel in the US.) We used arrivaldata for January 2006. The data for kitchen preferences is a subset ofRECS, the Residential Energy Consumption Survey performed by theDepartment of Energy. The example of attitudes toward file sharing is
from a story in The Daily Pennsylvanian, the student newspaper at theUniversity of Pennsylvania.
Software TipsExcel
Excel has a powerful feature for producing contingency tables, but youneed to master its concept of pivot tables to get them. If you want to staywith Excel for all of your computing, then it’s probably worth the effort.Start by reading the help files produced by searching for pivot tablesfrom the help menu.
Once you have the contingency table, it’s not too hard to compute thevalue of chi-square and Cramer’s V using typical formulamanipulations. We find it easiest to build a table of expected counts(under independence) then subtract this table from the observed tableand square each cell. Adding up the squared deviations divided by theexpected counts gets chi-square.
MinitabTo obtain the contingency table, follow the menu items
Stat > Tables > Cross-Tabulation and chi-squareand fill in dialog with the names of two categorical variables. Pick one
variable to identify the rows of the table and the other for the columns.(Layers allow you to produce tables such as Table 5-8 that show aseparate table for each value of a third variable.) Options also produceintermediate steps in the calculation of chi-square, such as thecontribution from each cell to the total. It’s an easy calculation toconvert chi-square to Cramer’s V .
JMPFollow the menu commands
Analyze > Fit Y by Xand pick one categorical variable for the Y variable and one for X. The
variable chosen for Y identifies the columns of the contingency table andthe variable identified as X identifies the rows of the contingency table.By default, the output from JMP shows the mosaic plot. The pop-upmenu produced by clicking on the red triangle beside the header“Contingency Table” in the output window allows you to modify thetable by removing, for instance, some of the shown percentages.
The value of the chi-square statistic appears below the table in thesection of the output labeled “Tests.” The value of chi-square is labeled
7/15/2019 ch05.catassoc
http://slidepdf.com/reader/full/ch05catassoc 24/24
2/7/2008 5 Categorical Association
5 24
in the output Pearson. (There are variations on how to compute the chi-square statistic.)
Source DF -LogLike RSquare (U)Model 5 8.28511 0.0144
Error 1221 567.79052
C. Total 1226 576.07562
N 1231
Test ChiSquare Prob>ChiSqLikelihood Ratio 16.570 0.0054
Pearson 16.056 0.0067
Once you have chi-square, use the formula given in the text to obtainCramer’s V .