The two sample t-test

Preview:

Citation preview

Two-sample tests

Binary or categorical outcomes (proportions)

Outcome Variable

Are the observations correlated? Alternative to the chi-square test if sparse cells:independent correlated

Binary or categorical(e.g. fracture, yes/no)

Chi-square test: compares proportions between two or more groups

Relative risks: odds ratios or risk ratios

Logistic regression: multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios

McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after)

Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data)

GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures)

Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5).

McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).

Recall: The odds ratio (two samples=cases and controls)

  Smoker (E) Non-smoker (~E)

 

Stroke (D) 15 35

No Stroke (~D) 8 42

 

50

50

25.28*35

42*15

bcadOR

Interpretation: there is a 2.25-fold higher odds of stroke in smokers vs. non-smokers.

Inferences about the odds ratio… Does the sampling distribution

follow a normal distribution? What is the standard error?

Simulation… 1. In SAS, assume infinite population of cases and

controls with equal proportion of smokers (exposure), p=.23 (UNDER THE NULL!)

2. Use the random binomial function to randomly select n=50 cases and n=50 controls each with p=.23 chance of being a smoker.

3. Calculate the observed odds ratio for the resulting 2x2 table.

4. Repeat this 1000 times (or some large number of times).

5. Observe the distribution of odds ratios under the null hypothesis.

Properties of the OR (simulation)(50 cases/50 controls/23% exposed)

Under the null, this is the expected variability of the sample ORnote the right skew

Properties of the lnOR

Normal!

Properties of the lnOR

From the simulation, can get the empirical standard error (~0.5) and p-value (~.10)

Properties of the lnOR

dcba1111

Or, in general, standard error =

Inferences about the ln(OR)

  Smoker (E) Non-smoker (~E)

 

Stroke (D) 15 35

No Stroke (~D) 8 42

 

50

50

81.0)ln(25.2

OR

OR

64.1494.081.0

421

351

151

81

0)25.2ln(

Z p=.10

Confidence interval…  Smoker (E) Non-smoker

(~E) 

Stroke (D) 15 35

No Stroke (~D) 8 42

 

50

50

92.5,85.0, CI %95

78.1,16.0494.0*96.181.0ln CI %9578.116.

eeOR

OR

Final answer: 2.25 (0.85,5.92)

Practice problem:Suppose the following data were collected in a case-control study of brain tumor and cell phone usage: 

  Brain tumor No brain tumor

Own a cell phone

20 60

Don’t own a cell phone

10 40

 

 Is there sufficient evidence for an association between cell phones and brain tumor?

Answer1. What is your null hypothesis?Null hypothesis: OR=1.0; lnOR = 0Alternative hypothesis: OR 1.0; lnOR>0 2. What is your null distribution? lnOR~ N(0, ) ; =SD (lnOR) = .44 3. Empirical evidence: = 20*40/60*10 =800/600 = 1.33 lnOR = .288 4. Z = (.288-0)/.44 = .65p-value = P(Z>.65 or Z<-.65) = .26*2

5. Not enough evidence to reject the null hypothesis of no association

401

601

201

101

401

601

201

101

TWO-SIDED TEST

TWO-SIDED TEST: it would be just as extreme if the sample lnOR were .65 standard deviations or more below the null mean

Key measures of relative risk: 95% CIs OR and RR:

dcbadcba1111

96.11111

96.1

exp*OR ,exp*OR

cdcc

abaa

cdcc

abaa )/(1)/(1

96.1)/(1)/(1

96.1

exp*RR ,exp*RR

For an odds ratio, 95% confidence limits:

For a risk ratio, 95% confidence limits:

Continuous outcome (means)

Outcome Variable

Are the observations independent or correlated?Alternatives if the normality assumption is violated (and small sample size):

independent correlated

Continuous(e.g. pain scale, cognitive function)

Ttest: compares means between two independent groups

ANOVA: compares means between more than two independent groups

Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes

Paired ttest: compares means between two related groups (e.g., the same subjects before and after)

Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time

Non-parametric statisticsWilcoxon sign-rank test: non-parametric alternative to the paired ttest

Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest

Kruskal-Wallis test: non-parametric alternative to ANOVA

Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

The two-sample t-test

The two-sample T-test Is the difference in means that we

observe between two groups more than we’d expect to see based on chance alone?

The standard error of the difference of two means  

**First add the variances and then take the square root of the sum to get the standard error.

mnyx

yx

22

Recall, Var (A-B) = Var (A) + Var (B) if A and B are independent!

Shown by simulation:

91.305

SE

91.305

SE

91.305

SE

91.305

SE

29.13025

3025)( diffSE

One sample of 30 (with SD=5). One sample of

30 (with SD=5).

Difference of the two samples.

Distribution of differences

),(~22

mnNYX yx

yxmn

If X and Y are the averages of n and m subjects, respectively:

But… As before, you usually have to use

the sample SD, since you won’t know the true SD ahead of time…

So, again becomes a T-distribution...

Estimated standard error of the difference….

ms

ns yx

yx

22

Just plug in the sample standard deviations for each group.

Case 1: un-pooled variance

Question: What are your degrees of freedom here?Answer: Not obvious!

Case 1: ttest, unpooled variances

It is complicated to figure out the degrees of freedom here! A good approximation is given as df ≈ harmonic mean (or SAS will tell you!):

t

ms

ns

YXT

yx

mn ~22

mn11

2

Case 2: pooled varianceIf you assume that the standard deviation of the characteristic (e.g., IQ) is the same in both groups, you can pool all the data to estimate a common standard deviation. This maximizes your degrees of freedom (and thus your power).

2

)()(

)()1( and 1

)(

)()1( and 1

)(

: variancespooling

1

2

1

2

2

1

221

2

2

1

221

2

2

mn

yyxxs

yysmm

yys

xxsnn

xxs

m

imi

n

ini

p

m

imiy

m

imi

y

n

inix

n

ini

x

2)1()1( 22

2

mnsmsn

s yxp

Degrees of Freedom!

Estimated standard error (using pooled variance estimate)

ms

ns pp

yx

22

2

)()(

:

1

2

1

2

2

mn

yyxxs

wherem

imi

n

ini

p

The degrees of freedom are n+m-2

Case 2: ttest, pooled variances

222~

mn

pp

mn t

ms

ns

YXT

2)1()1( 22

2

mnsmsn

s yxp

Alternate calculation formula: ttest, pooled variance

2~

mn

p

mn t

mnnms

YXT

)()()11( 2222

mnmns

mnm

mnns

nms

ns

ms

ppppp

Pooled vs. unpooled varianceRule of Thumb: Use pooled unless you

have a reason not to.Pooled gives you more degrees of

freedom.Pooled has extra assumption: variances

are equal between the two groups.SAS automatically tests this assumption for

you (“Equality of Variances” test). If p<.05, this suggests unequal variances, and better to use unpooled ttest.

Example: two-sample t-test In 1980, some researchers reported that

“men have more mathematical ability than women” as evidenced by the 1979 SAT’s, where a sample of 30 random male adolescents had a mean score ± 1 standard deviation of 436±77 and 30 random female adolescents scored lower: 416±81 (genders were similar in educational backgrounds, socio-economic status, and age). Do you agree with the authors’ conclusions?

Data Summaryn Sampl

e Mean

Sample Standard Deviation

Group 1:women

30 416 81

Group 2:men

30 436 77

Two-sample t-test1. Define your hypotheses (null,

alternative)H0: ♂-♀ math SAT = 0Ha: ♂-♀ math SAT ≠ 0 [two-sided]

Two-sample t-test2. Specify your null distribution:

F and M have similar standard deviations/variances, so make a “pooled” estimate of variance.

624558

81)29(77)29(2

)1()1( 22222

mnsmsn

s fmp

)30

624530

6245,0(~ 583030 TFM 4.2030

624530

6245

Two-sample t-test3. Observed difference in our experiment = 20

points

Two-sample t-test4. Calculate the p-value of what you observed

98.4.20020

58

T

data _null_; pval=(1-probt(.98, 58))*2; put pval;

run; 0.3311563454 5. Do not reject null! No evidence that men are better in math ;)

Example 2: Difference in means

Example: Rosental, R. and Jacobson, L. (1966) Teachers’ expectancies: Determinates of pupils’ I.Q. gains. Psychological Reports, 19, 115-118.

The Experiment (note: exact numbers have been altered)

Grade 3 at Oak School were given an IQ test at the beginning of the academic year (n=90).

Classroom teachers were given a list of names of students in their classes who had supposedly scored in the top 20 percent; these students were identified as “academic bloomers” (n=18).

BUT: the children on the teachers lists had actually been randomly assigned to the list.

At the end of the year, the same I.Q. test was re-administered.

Example 2 Statistical question: Do students in the

treatment group have more improvement in IQ than students in the control group?

What will we actually compare? One-year change in IQ score in the

treatment group vs. one-year change in IQ score in the control group.

“Academic bloomers”

(n=18)Controls (n=72)

Change in IQ score: 12.2 (2.0)  8.2 (2.0)

Results:

12.2 points 8.2 points

Difference=4 points

The standard deviation of change scores was 2.0 in both groups. This affects statistical significance…

What does a 4-point difference mean? Before we perform any formal

statistical analysis on these data, we already have a lot of information.

Look at the basic numbers first; THEN consider statistical significance as a secondary guide.

Is the association statistically significant? This 4-point difference could reflect

a true effect or it could be a fluke. The question: is a 4-point

difference bigger or smaller than the expected sampling variability?

Hypothesis testing

Null hypothesis: There is no difference between “academic bloomers” and normal students (= the difference is 0%)

Step 1: Assume the null hypothesis.

Hypothesis Testing

These predictions can be made by mathematical theory or by computer simulation.

Step 2: Predict the sampling variability assuming the null hypothesis is true

Hypothesis TestingStep 2: Predict the sampling variability assuming the null hypothesis is true—math theory:

0.42 p

s

)52.0724

184,0(~ 88"" Tcontrolgifted

Hypothesis Testing

In computer simulation, you simulate taking repeated samples of the same size from the same population and observe the sampling variability.

I used computer simulation to take 1000 samples of 18 treated and 72 controls

Step 2: Predict the sampling variability assuming the null hypothesis is true—computer simulation:

Computer Simulation Results

Standard error is about 0.52

3. Empirical dataObserved difference in our experiment = 12.2-8.2 = 4.0

 

4. P-valuet-curve with 88 df’s has slightly wider cut-off’s for 95% area (t=1.99) than a normal curve (Z=1.96) 

p-value <.0001

852.4

52.2.82.12

88

t

If we ran this study 1000 times, we wouldn’t expect to get 1 result as big as a difference of 4 (under the null hypothesis).

Visually…

5. Reject null! Conclusion: I.Q. scores can bias

expectancies in the teachers’ minds and cause them to unintentionally treat “bright” students differently from those seen as less bright.

Confidence interval (more information!!)95% CI for the difference: 4.0±1.99(.52) =

(3.0 – 5.0)

t-curve with 88 df’s has slightly wider cut-off’s for 95% area (t=1.99) than a normal curve (Z=1.96)

What if our standard deviation had been higher? The standard deviation for change

scores in treatment and control were each 2.0. What if change scores had been much more variable—say a standard deviation of 10.0 (for both)?

Standard error is 0.52 Std. dev in

change scores = 2.0

Std. dev in change scores = 10.0

Standard error is 2.58

With a std. dev. of 10.0…LESS STATISICAL POWER!

Standard error is 2.58

If we ran this study 1000 times, we would expect to get +4.0 or –4.0 12% of the time.

P-value=.12

Don’t forget: The paired T-test Did the control group in the previous

experiment improveat all during the year?

Do not apply a two-sample ttest to answer this question!

After-Before yields a single sample of differences…

“within-group” rather than “between-group” comparison…

Continuous outcome (means);

Outcome Variable

Are the observations independent or correlated?Alternatives if the normality assumption is violated (and small sample size):

independent correlated

Continuous(e.g. pain scale, cognitive function)

Ttest: compares means between two independent groups

ANOVA: compares means between more than two independent groups

Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes

Paired ttest: compares means between two related groups (e.g., the same subjects before and after)

Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time

Non-parametric statisticsWilcoxon sign-rank test: non-parametric alternative to the paired ttest

Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest

Kruskal-Wallis test: non-parametric alternative to ANOVA

Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Data Summary

n Sample

Mean

Sample Standard Deviation

Group 1:Change

72 +8.2 2.0

Did the control group in the previous experiment improveat all during the year?

2829.

2.8

722

02.8271

t

p-value <.0001

Normality assumption of ttest

If the distribution of the trait is normal, fine to use a t-test.

But if the underlying distribution is not normal and the sample size is small (rule of thumb: n>30 per group if not too skewed; n>100 if distribution is really skewed), the Central Limit Theorem takes some time to kick in. Cannot use ttest.

Note: ttest is very robust against the normality assumption!

Alternative tests when normality is violated: Non-parametric tests

Continuous outcome (means);

Outcome Variable

Are the observations independent or correlated?Alternatives if the normality assumption is violated (and small sample size):

independent correlated

Continuous(e.g. pain scale, cognitive function)

Ttest: compares means between two independent groups

ANOVA: compares means between more than two independent groups

Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes

Paired ttest: compares means between two related groups (e.g., the same subjects before and after)

Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time

Non-parametric statisticsWilcoxon sign-rank test: non-parametric alternative to the paired ttest

Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest

Kruskal-Wallis test: non-parametric alternative to ANOVA

Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Non-parametric tests

t-tests require your outcome variable to be normally distributed (or close enough), for small samples.

Non-parametric tests are based on RANKS instead of means and standard deviations (=“population parameters”).

Example: non-parametric tests

10 dieters following Atkin’s diet vs. 10 dieters following Jenny Craig

Hypothetical RESULTS:Atkin’s group loses an average of 34.5 lbs.

J. Craig group loses an average of 18.5 lbs.

Conclusion: Atkin’s is better?

Example: non-parametric tests

BUT, take a closer look at the individual data…

Atkin’s, change in weight (lbs):+4, +3, 0, -3, -4, -5, -11, -14, -15, -300

J. Craig, change in weight (lbs)-8, -10, -12, -16, -18, -20, -21, -24, -26, -30

Jenny Craig

-30 -25 -20 -15 -10 -5 0 5 10 15 200

5

10

15

20

25

30

Percent

Weight Change

Atkin’s

-300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 200

5

10

15

20

25

30

Percent

Weight Change

t-test inappropriate… Comparing the mean weight loss of

the two groups is not appropriate here.

The distributions do not appear to be normally distributed.

Moreover, there is an extreme outlier (this outlier influences the mean a great deal).

Wilcoxon rank-sum test RANK the values, 1 being the least weight

loss and 20 being the most weight loss. Atkin’s +4, +3, 0, -3, -4, -5, -11, -14, -15, -300  1, 2, 3, 4, 5, 6, 9, 11, 12, 20 J. Craig -8, -10, -12, -16, -18, -20, -21, -24, -26, -30 7, 8, 10, 13, 14, 15, 16, 17, 18, 19

Wilcoxon rank-sum test Sum of Atkin’s ranks:  1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 +

20=73 Sum of Jenny Craig’s ranks:7 + 8 +10+ 13+ 14+ 15+16+ 17+

18+19=137

Jenny Craig clearly ranked higher! P-value *(from computer) = .018

*For details of the statistical test, see appendix of these slides…

Binary or categorical outcomes (proportions)

Outcome Variable

Are the observations correlated? Alternative to the chi-square test if sparse cells:independent correlated

Binary or categorical(e.g. fracture, yes/no)

Chi-square test: compares proportions between two or more groups

Relative risks: odds ratios or risk ratios

Logistic regression: multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios

McNemar’s chi-square test: compares binary outcome between two correlated groups (e.g., before and after)

Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data)

GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures)

Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5).

McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).

Difference in proportions (special case of chi-square test)

Standard error of the difference of two proportions=

21

2211

212

22

1

11 )()(n where,)1()1(or )ˆ1(ˆ)ˆ1(ˆnn

pnppn

ppn

ppn

ppn

pp

Standard error of a proportion=n

pp )1(

Null distribution of a difference in proportions

Standard error can be estimated by=

(still normally distributed)n

pp )ˆ1(ˆ

Analagous to pooled variance in the ttest

The variance of a difference is the sum of variances (as with difference

in means).

Null distribution of a difference in proportions

Difference of proportions ))1()1(,(~21

21 npp

nppppN

Difference in proportions testNull hypothesis: The difference in proportions is 0.

21

21

)1(*)1(*n

ppn

ppppZ

2 groupin number 1 groupin number

2 groupin proportion1 groupin proportion

)proportion average(just

2

1

2

1

21

2211

nnpp

nnpnpnp

Recall, variance of a proportion is p(1-p)/n

Use average (or pooled) proportion in standard error formula, because under the null hypothesis, groups have equal proportions.

Follows a normal because binomial can be approximated with normal

Recall case-control example:

  Smoker (E) Non-smoker (~E)

 

Stroke (D) 15 35

No Stroke (~D) 8 42

 

50

50

Absolute risk: Difference in proportions exposed

%14%16%3050/850/15)~/()/(

DEPDEP

  Smoker (E) Non-smoker (~E)

 

Stroke (D) 15 35

No Stroke (~D) 8 42

 

50

50

Difference in proportions exposed

67.1084.14.

5077.*23.

5077.*23.

%0%14

Z

.31 to03.0084.*96.114 .0:CI %95

Example 2: Difference in proportions Research Question: Are

antidepressants a risk factor for suicide attempts in children and adolescents?

Example modified from: “Antidepressant Drug Therapy and Suicide in Severely Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:865-872.

Example 2: Difference in Proportions Design: Case-control study Methods: Researchers used Medicaid

records to compare prescription histories between 263 children and teenagers (6-18 years) who had attempted suicide and 1241 controls who had never attempted suicide (all subjects suffered from depression).

Statistical question: Is a history of use of antidepressants more common among cases than controls?

Example 2 Statistical question: Is a history of use of

antidepressants more common among heart disease cases than controls?

What will we actually compare? Proportion of cases who used

antidepressants in the past vs. proportion of controls who did

No (%) of cases

(n=263)

No (%) of controls (n=1241)

Any antidepressant drug ever 120 (46%)  448 (36%)

46% 36%

Difference=10%

Results

Is the association statistically significant? This 10% difference could reflect a

true association or it could be a fluke in this particular sample.

The question: is 10% bigger or smaller than the expected sampling variability?

Hypothesis testing

Null hypothesis: There is no association between antidepressant use and suicide attempts in the target population (= the difference is 0%)

Step 1: Assume the null hypothesis.

Hypothesis TestingStep 2: Predict the sampling variability assuming the null hypothesis is true

)033.=1241

)1504568

1(1504568

+263

)1504568

1(1504568

=σ,0(N~p̂p̂ controlscases

Also: Computer Simulation Results

Standard error is about 3.3%

Hypothesis TestingStep 3: Do an experiment

We observed a difference of 10% between cases and controls.

Hypothesis TestingStep 4: Calculate a p-value

003.=p;0.3=033.10.

=Z

When we ran this study 1000 times, we got 1 result as big or bigger than 10%.

P-value from our simulation…

We also got 3 results as small or smaller than –10%.

P-valueP-value

From our simulation, we estimate the p-value to be:

4/1000 or .004

Here we reject the null.

Alternative hypothesis: There is an association between antidepressant use and suicide in the target population.

Hypothesis TestingStep 5: Reject or do not reject the null hypothesis.

What would a lack of statistical significance mean?

If this study had sampled only 50 cases and 50 controls, the sampling variability would have been much higher—as shown in this computer simulation…

Standard error is about 10%

50 cases and 50 controls.

Standard error is about 3.3% 263 cases and

1241 controls.

With only 50 cases and 50 controls…

Standard error is about 10%

If we ran this study 1000 times, we would expect to get values of 10% or higher 170 times (or 17% of the time).

Two-tailed p-valueTwo-tailed p-value = 17%x2=34%

Practice problem…

An August 2003 research article in Developmental and Behavioral Pediatrics reported the following about a sample of UK kids: when given a choice of a non-branded chocolate cereal vs. CoCo Pops, 97% (36) of 37 girls and 71% (27) of 38 boys preferred the CoCo Pops. Is this evidence that girls are more likely to choose brand-named products?

Answer1. Hypotheses:

H0: p♂-p♀= 0

Ha: p♂-p♀≠ 0 [two-sided]

 2. Null distribution of difference of two proportions:  

3. Observed difference in our experiment = .97-.71= .26 4. Calculate the p-value of what you observed:

085.38

)16(.84.37

)16(.84.

)38

)75631(

7563

37

)75631(

7563

,0(~ˆˆ

Npp mf

data _null_;

pval=(1-probnorm(3.06))*2;

put pval; run; 0.0022133699

5. p-value is sufficiently low for us to reject the null; there does appear to be a difference in gender preferences here.

Null says p’s are equal so estimate standard error using overall observed p

06.3085.

026.

Z

Key two-sample Hypothesis Tests…

Test for Ho: μx- μy = 0 (σ2 unknown, but roughly equal):

Test for Ho: p1- p2= 0:

 

2)1()1(

;22

2

222

n

snsns

ns

ns

yxt yyxxp

y

p

x

p

n

21

2211

21

21 ˆˆ;

)1)(()1)((

ˆˆnnpnpn

p

npp

npp

ppZ

Corresponding confidence intervals…

For a difference in means, 2 independent samples (σ2’s unknown but roughly equal):

For a difference in proportions, 2 independent samples:

 

y

p

x

pn n

sns

tyx22

2/,2)(

212/21

)1)(()1)(()ˆˆ(n

ppn

ppZpp

Appendix: details of rank-sum test…

Wilcoxon Rank-sum test

),min(12

)1(2

Z

2)1(

U

,10 ,01for 2

)1(U

)(n populationlarger thefrom ranks theof sum theis T)(n populationsmaller from ranks theof sum theis T

n. to1 fromorder in nsobservatio theof allRank

210

2121

210

222

212

21111

211

22

11

UUU

nnnn

nnU

Tnn

nn

nnTnn

nn

Find P(U² U0) in Mann-Whitney U tablesWith n2 = the bigger of the 2 populations

Example For example, if team 1 and team 2 (two gymnastic

teams) are competing, and the judges rank all the individuals in the competition, how can you tell if team 1 has done significantly better than team 2 or vice versa?

Answer Intuition: under the null hypothesis of no difference between the two

groups… If n1=n2, the sums of T1 and T2 should be equal. But if n1 ≠n2, then T2 (n2=bigger group) should automatically be

bigger. But how much bigger under the null?

For example, if team 1 has 3 people and team 2 has 10, we could rank all 13 participants from 1 to 13 on individual performance. If team1 (X) and team2 don’t differ in talent, the ranks ought to be spread evenly among the two groups, e.g.…

1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even distribution if team1 ranks 3rd, 7th, and 11th)

(larger) 2 group of ranks of sum(smaller) 1 group of ranks of sum

2

1

TT

2122112

2221121

21

2121

121

2)1(

2)1(

2)(

2)1)((21

nnnnnnnnnnnnnn

nnnniTTnn

i

Remember this?

sum of within-group ranks for smaller group.

2)1( 11

1

1

nnin

i

sum of within-group ranks for larger group.

2)1( 22

1

2

nnin

i

30655912

)14)(13(:here e.g.,13

121

i

iTT

212211

21 2)1(

2)1( nnnnnnTT

Take-home point:

49655

62

)4(3

552

)11(10

3

1

10

1

i

i

i

T1 = 3 + 7 + 11 =21T2 = 1 + 2 + 4 + 5 + 6 + 8 + 9 +10 + 12 +13 = 70

70-21 = 49 Magic!

The difference between the sum of theranks within each individual group is 49.

The difference between the sum of theranks of the two groups is also equal to 49if ranks are evenly interspersed (null istrue).

It turns out that, if the null hypothesis is true, the difference between the larger-group sum of ranks and the smaller-group sum of ranks is exactly equal to the difference between T1 and T2

2)1(

2)1(

null, Under the

112212

nnnnTT

. equal should sumTheir 2

)1( Udefine

2)1( Udefine

22)1(

22)1(

2)1(

2)1(

2)1(

2)1(

21

12111

1

22122

2

21111

21222

112212

212211

12

nn

Tnnnn

Tnnnn

nnnnT

nnnnT

nnnnTT

nnnnnnTT

From slide 23

From slide 24

Define new statistics

Here, under null:U2=55+30-70U1=6+30-21U2+U1=30

under null hypothesis, U1 should equal U2:

0 )]T()2

)1(2

)1([()U- E(U 12

112212

T

nnnnE

The U’s should be equal to each other and will equal n1n2/2:  U1 + U2 = n1n2 Under null hypothesis, U1 = U2 = U0 E(U1 + U2) = 2E(U0) = n1n2

E(U1 = U2=U0) = n1n2/2

So, the test statistic here is not quite the difference in the sum-of-ranks of the 2 groups

It’s the smaller observed U value: U0

For small n’s, take U0, and get p-value directly from a U table.

For large enough n’s (>10 per group)…

)(2

)()(

Z0

210

0

00

UVar

nnU

UVarUEU

2)( 21

0nnUE

12)1()( 2121

0

nnnnUVar

Add observed data to the example…

Example: If the girls on the two gymnastics teams were ranked as follows:Team 1: 1, 5, 7 Observed T1 = 13Team 2: 2,3,4,6,8,9,10,11,12,13 Observed T2 = 78 Are the teams significantly different?Total sum of ranks = 13*14/2 = 91 n1n2=3*10 = 30 Under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 (each should equal about 15 under the null) and U0 = 15

  

U1=30 + 6 – 13 = 23U2= 30 + 55 – 78 = 7  U0 = 7 Not quite statistically significant in U table…p=.1084 (see attached) x2 for two-tailed test

Example problem 2A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig (low-cal, low-fat). The following weight changes were obtained; note they are very skewed because someone lost 100 pounds; the mean loss for Atkins is going to look higher because of the bozo, but does that mean the diet is better overall? Conduct a Mann-Whitney U test to compare ranks.  Atkins Jenny Craig

-100 -11

-8 -15

-4 -5

+5 +6

+8 -20

+2  

 

Answer Atkins Jenny Craig1 45 37 69 1011 28  

Sum of ranks for JC = 25 (n=5)Sum of ranks for Atkins=41 (n=6) n1n2=5*6 = 30 under the null hypothesis: expect U1 - U2 = 0 andU1 + U2 = 30 and U0 = 15   U1=30 + 15 – 25 = 20U2= 30 + 21 – 41 = 10  U0 = 10; n1=5, n2=6Go to Mann-Whitney chart….p=.2143x 2 = .42

Recommended