Stat 31, Section 1, Last Time Inference for Proportions –Hypothesis Tests 2 Sample Proportions...

Preview:

Citation preview

Stat 31, Section 1, Last Time• Inference for Proportions

– Hypothesis Tests

• 2 Sample Proportions Inference

– Skipped

• 2-way Tables

– Sliced populations in 2 different ways

– Look for independence of factors

– Chi Square Hypothesis test

Reading In Textbook

Approximate Reading for Today’s Material:

Pages 582-611, 634-667

Approximate Reading for Next Class:

Pages 634-667

Midterm I - Results

Preliminary comments:

• Circled numbers are points taken off

• Total for each problem in brackets

• Points evenly divided among parts

• Page total in lower right corner

• Check those sum to total on front

• Overall score out of 100 points

Midterm I - Results

Interpretation of Scores:

• Too early for letter grades

• These will change a lot:

– Some with good grades will relax

– Some with bad grades will wake up

• Don’t believe “A & C” average to “B”

Midterm I - Results

Interpretation of Scores:

• Recall large variation over 2 midterms

– No exception this semester

Midterm I - ResultsCompare Midterm Scores

40

50

60

70

80

90

100

40 50 60 70 80 90 100

Midterm I

Mid

term

2 I

Midterm I - ResultsCompare Midterm Scores

40

50

60

70

80

90

100

40 50 60 70 80 90 100

Midterm I

Mid

term

2 I

Line ofEqualScores

Midterm I - ResultsCompare Midterm Scores

40

50

60

70

80

90

100

40 50 60 70 80 90 100

Midterm I

Mid

term

2 I

Some have DramaticallyImproved

Others haveBeen distractedBy other things

Midterm I - Results

Interpretation of Scores:

• Recall large variation over 2 midterms

– No exception this semester

• Get better info from 2 test Total

– So will report answers in those terms

Midterm I - Results

Histogram

of Results:

Midterm I + II, Total Score

0

2

4

6

8

10

12

14

Total Score

Fre

qu

en

cy

Midterm I - Results

Interpretation of Scores (2 Test total):

170 - 200 A

155 – 168 B

131 – 154 C

120 – 129 D

-- 119 F

Midterm I - Results

Where do we go from here?

• I see 2 rather different groups…

• Which are you in?

• What can you do?

• Most important:

It is still early days……

Chapter 9: Two-Way TablesMain idea:

Divide up populations in two ways– E.g. 1: Age & Sex– E.g. 2: Education & Income

• Typical Major Question:

How do divisions relate?

• Are the divisions independent?– Similar idea to indepe’nce in prob. Theory– Statistical Inference?

Two-Way TablesBig Question:Is there a

relationship?

Note: tallest bars French Wine French Music Italian Wine Italian Music Other Wine No MusicSuggests there is a relationship

NoneFrench

Italian

French Wine

Italian Wine

Other Wine

0

5

10

15

20

25

30

35

40

45

# Bottles purchased

Music

Class Example 31 - Counts

Two-Way TablesGeneral Directions:

• Can we make this precise?

• Could it happen just by chance?

– Really: how likely to be a chance effect?

• Or is it statistically significant?

– I.e. music and wine purchase are related?

Two-Way TablesAn alternate view:

Replace counts by proportions (or %-ages)

Class Example 31 (Wine & Music), Part 2http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls

Advantage:

May be more interpretable

Drawback:

No real difference (just rescaled)

Two-Way TablesTesting for independence:

What is it?

From probability theory:

P{A | B} = P{A}

i.e. Chances of A, when B is known, are same as when B is unknown

Table version of this idea?

Independence in 2-Way Tables

Counts analog of P{A|B}???

Equivalent condition for independence is:

So for counts, look for:

Table Prop’n = Row Marg’l Prop’n x Col’n Marg’l Prop’n

i.e. Entry = Product of Marginals

}{}{}&{ BPAPBAP

Independence in 2-Way TablesVisualize Product of Marginals for:

Class Example 31 (Wine & Music), Part 4http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls

Shows same structure

as marginals

But not match between

music & wine

Good null hypothesisNone

FrenchItalian

French Wine

Italian Wine

Other Wine

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

# Bottles purchased

Music

Class Example 31 - Independent Model

Independence in 2-Way TablesApproach:

• Measure “distance between tables”– Use Chi Square Statistic

– Has known probability distribution when table is independent

• Assess significance using P-value

– Set up as: H0: Indep. HA: Dependent

– P-value = P{what saw or m.c. | Indep.}

Independence in 2-Way TablesChi-square statistic: Based on:

• Observed Counts (raw data),

• Expected Counts (under indep.),

Notes:– Small for only random variation

– Large for significant departure from indep.

iObs

iExp

icells i

ii

ExpExpObs

X2

2

Independence in 2-Way TablesChi-square statistic calculation:

Class example 31, Part 5:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls

– Calculate term by term

– Then sum

– Is X2 = 18.3 “big” or “small”?

icells i

ii

ExpExpObs

X2

2

Independence in 2-Way TablesH0 distribution of the X2 statistic:

“Chi Squared” (another Greek letter )

Parameter: “degrees of freedom”

(similar to T distribution)

Excel Computation:– CHIDIST (given cutoff, find area = prob.)

– CHIINV (given prob = area, find cutoff)

2

Independence in 2-Way TablesFor test of independence, use:

degrees of freedom =

= (#rows – 1) x (#cols – 1)

E.g. Wine and Music:

d.f. = (3 – 1) x (3 – 1) = 4

Independence in 2-Way TablesE.g. Wine and Music:

P-value = P{Observed X2 or m.c. | Indep.} =

= P{X2 = 18.3 of m.c. | Indep.} =

= P{X2 >= 18.3 | d.f. = 4} =

= 0.0011

Also see Class Example 31, Part 5http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls

Independence in 2-Way TablesE.g. Wine and Music:

P-value = 0.001

Yes-No: Very strong evidence against

independence, conclude music has a

statistically significant effect

Gray-Level: Also very strong

evidence

Independence in 2-Way TablesExcel shortcut:

CHITEST

• Avoids the (obs-exp)^2 / exp calculat’n

• Automatically computes d.f.

• Returns P-value

Independence in 2-Way TablesHW:

9.27

9.29

And Now for Something Completely Different

A statistics joke, from:

GARY C. RAMSEYER'S INTERNET GALLERY

OF STATISTICS JOKES

http://www.ilstu.edu/~gcramsey/Gallery.html

And Now for Something Completely Different

A somewhat advanced society has figured

how to package basic knowledge in pill

form.

A student, needing some learning, goes to

the pharmacy and asks what kind of

knowledge pills are available.

And Now for Something Completely Different

The pharmacist says "Here's a pill for

English literature."

The student takes the pill and swallows it

and has new knowledge about English

literature!

And Now for Something Completely Different

"What else do you have?" asks the student.

"Well, I have pills for art history, biology,

and world history, "replies the

pharmacist.

The student asks for these, and swallows

them and has new knowledge about

those subjects!

And Now for Something Completely Different

Then the student asks, "Do you have a pill for statistics?"

The pharmacist says "Wait just a moment", and goes back into the storeroom and brings back a whopper of a pill that is about twice the size of a jawbreaker and plunks it on the counter.

"I have to take that huge pill for statistics?" inquires the student.

And Now for Something Completely Different

The pharmacist understandingly nods his

head and replies:

"Well, you know statistics always was a little

hard to swallow."

Caution about 2-Way TablesSimpson’s Paradox:

Aggregation into tables can be dangerous

E.g. from:

http://www.math.sfu.ca/~cschwarz/Stat-301/Handouts/node50.html

Study Admission rates to professional programs, look for sex bias….

Simpson’s ParadoxAdmissions to Business School:

% Males ad’ted = 480 / (480 + 120) * 100%

= 80%

% Females ad’ted = 180 / (180 + 20)* 100%

= 90%

Better for females???

Admit Deny

Male 480 120

Female 180 20

Simpson’s ParadoxAdmissions to Law School:

% Males ad’ted = 10 / (10 + 90) * 100%

= 10%

% Females ad’ted = 100 / (100+200)*100%

= 33.3%

Better for females???

Admit Deny

Male 10 90

Female 100 200

Simpson’s ParadoxCombined Admissions:

% Males ad’ted = 490 / (490 + 210) * 100%

= 70%

% Females ad’ted = 280 / (280+210)*100%

= 56%

Better for males???

Admit Deny

Male 490 210

Female 280 220

Simpson’s ParadoxHow can the rate be higher for both females

and also males?Reason: depends on relative proportionsNotes:• In Business (male applicants

dominant), easier to get in(660 / 800)

• In Law (female applicants dominant), much harder to get in

(110 / 400)

Simpson’s ParadoxLesson:

Must be very careful about aggregation

Worse: may not be aware that aggregation has been done….

Recall terminology: Lurking Variable

Can hide in aggregation…

Could be used for cheating…

Simpson’s ParadoxHW:

9.15

9.17

Inference for RegressionChapter 10

Recall:

• Scatterplots

• Fitting Lines to Data

Now study statistical inference associated with fit lines

E.g. When is slope statistically significant?

Recall Scatterplot

For data (x,y)

View by plot:

(1,2)

(3,1)

(-1,0)

(2,-1)

Toy Scatterplot, Separate Points

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-2 -1 0 1 2 3 4

x

y

Recall Linear Regression

Idea:

Fit a line to data in a scatterplot

• To learn about “basic structure”

• To “model data”

• To provide “prediction of new values”

Recall Linear Regression

Recall some basic geometry:A line is described by an equation:

y = mx + b

m = slope m

b = y intercept b

Varying m & b gives a “family of lines”,Indexed by “parameters” m & b (or a & b)

Recall Linear Regression

Approach:

Given a scatterplot of data:

Find a & b (i.e. choose a line)

to “best fit the data”

),(),...,,( 11 nn yxyx

Recall Linear Regression

Given a line, , “indexed” by

Define “residuals” = “data Y” – “Y on line”

=

Now choose to make these “small”

),( 11 yx

abxy

)( abxy ii

),( 22 yx

),( 33 yx

ab&

ab&

Recall Linear Regression

Excellent Demo, by Charles Stanton, CSUSBhttp://www.math.csusb.edu/faculty/stanton/m262/regress/regress.html

More JAVA Demos, by David Lane at Rice U.http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html

http://www.ruf.rice.edu/~lane/stat_sim/comp_r/index.html

Recall Linear Regression

Make Residuals > 0, by squaring

Least Squares: adjust to

Minimize the “Sum of Squared Errors”

ab&

21

)(

n

iii abxySSE

Least Squares in Excel

Computation:

1. INTERCEPT (computes y-intercept a)

2. SLOPE (computes slope b)

Revisit Class Example 14http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg14.xls

HW: 10.17a

Inference for Regression

Goal: develop

• Hypothesis Tests and Confidence Int’s

• For slope & intercept parameters, a & b

• Also study prediction

Inference for Regression

Idea: do statistical inference on:

– Slope a

– Intercept b

Model:

Assume: are random, independent

and

iii ebaXY

ie

eN ,0

Inference for Regression

Viewpoint: Data generated as:

y = ax + b

Yi chosen from

Xi

Note: a and b are “parameters”

Inference for Regression

Parameters and determine the

underlying model (distribution)

Estimate with the Least Squares Estimates:

and

(Using SLOPE and INTERCEPT in Excel,

based on data)

a b

a b

Inference for Regression

Distributions of and ?

Under the above assumptions, the sampling

distributions are:

• Centerpoints are right (unbiased)

• Spreads are more complicated

a b

aaNa ,~ˆ

bbNb ,~ˆ

Inference for RegressionFormula for SD of :

• Big (small) for big (small, resp.)– Accurate data Accurate est. of slope

• Small for x’s more spread out– Data more spread More accurate

• Small for more data– More data More accuracy

a

n

ii

ea

xxaSD

1

e

Inference for RegressionFormula for SD of :

• Big (small) for big (small, resp.)– Accurate data Accur’te est. of intercept

• Smaller for – Centered data More accurate intercept

• Smaller for more data– More data More accuracy

b

n

ii

eb

xx

xn

bSD

1

2

21ˆ

e

0x

Inference for RegressionOne more detail:

Need to estimate using data

For this use:

• Similar to earlier sd estimate,

• Except variation is about fit line

• is similar to from before

e

2

ˆˆ1

2

n

bxays

n

iii

e

s

2n 1n

Inference for Regression

Now for Probability Distributions,

Since are estimating by

Use TDIST and TINV

With degrees of freedom =

e es

2n

Inference for RegressionConvenient Packaged Analysis in Excel:

Tools Data Analysis Regression

Illustrate application using:

Class Example 27,

Old Text Problem 8.6 (now 10.12)

Recommended