19
Lecture 8 • Resistance of two-sample t- tools and outliers (Chapters 3.3-3.4) • Transformations of the Data (Chapter 3.5)

Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Embed Size (px)

Citation preview

Page 1: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Lecture 8

• Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4)

• Transformations of the Data (Chapter 3.5)

Page 2: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Outliers and resistance

• Outliers are observations relatively far from their estimated means.

• Outliers may arise either– (a) if the population distribution is long-tailed.

– (b) they don’t belong to the population of interest (come from contaminating population)

• A statistical procedure is resistant if one or a few outliers cannot have an undue influence on result.

Page 3: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Resistance

• Illustration for understanding resistance: the sample mean is not resistant; the sample median is.– Sample: 9, 3, 5, 8, 100– Mean with outlier: 25, without: 6.2– Median with outlier: 8, without: 6.5

• t-tools are not resistant to outliers because they are based on sample means.

Page 4: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Strategy for Dealing With Outliers

• Follow Display 3.6

• Important aspect of strategy: An outlier does not get swept under the rug simply because it is different from the other observations. To warrant its removal, an explanation for why it is different must be established.

Page 5: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Excluding Observations from Analysis in JMP for Investigating Outliers

• Click on row you want to exclude.• Click on rows menu and then click

exclude/unexclude. A red circle with a line through it will appear next to the excluded observation.

• Multiple observations can be excluded. • To include an observation that was excluded back

into the analysis, click on excluded row, click on rows menu and then click exclude/unexclude. The red circle next to observation should disappear.

Page 6: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Conceptual Question #6

• (a) What course of action would you propose for the statistical analysis if it was learned that Vietnam veteran #646 (the largest observation in Display 3.6) worked for several years, after Vietnam, handling herbicides with dioxin?

• (b) What would you propose if this was learned instead for Vietnam veteran #645 (second largest observation)?

Page 7: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Rules of thumb for validity of t-tools

• Assumptions and rules of thumb for validity of t-tools in the face of violations– Normality: Look for gross skewness. Okay if both sample

sizes greater than 30.– Equal spread: Validity okay if ratio of larger sample standard

deviation to smaller sample standard deviation is less than 2 and ratio of larger group size to smaller group size is less than 2. Consider transformations.

– Outliers: Look for outliers in box plots, especially very extreme points (more than 3 box-lengths away from box). Apply the examination strategy in Display 3.6.

– Independence: If indep. not appropriate, apply matched pairs if appropriate or other tools later in course.

Page 8: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Case Study 3.1.1: Cloud Seeding

• A random experiment was conducted to test a hypothesis that massive injection of silver iodide into cumulus clouds can lead to increased rainfall.

• On each of 52 days that were deemed suitable for cloud seeding, a random mechanism was used to decide whether to seed target cloud on that day or leave it unseeded as a control.

• Airplane flew through cloud in both cases, experimenters were blind to whether seeding was used – double blind trial.

• Question of interest: Did cloud seeding cause higher rainfall in this experiment?

Page 9: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Oneway Analysis of Rainfall (acre-feet) By Group

Ra

infa

ll (a

cre

-fe

et)

-5000

50010001500200025003000

Seeded Unseeded

Group

How do the seeded and unseeded groups differ? Is an additive treatment effects model appropriate?

Page 10: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

The log transformation

• Let log denote the logarithm to the base e, ln, log(x)=c means

• log(2.718)=1, log(2.7182)=2, etc.• Procedure:

– Transform to get two new columns: – Graphically examine to see if the t-tools are

appropriate for – If appropriate, use t-tools on – Interpret results on original scale

xec

)log(),log( 2211 YZYZ

21,ZZ

21,ZZ21,ZZ

Page 11: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Cloud seeding data after log transformation

Oneway Analysis of Log Rainfall By Group L

og

Ra

infa

ll

-1

1

3

5

7

SeededUnseeded

Group

Page 12: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Interpretation – Causal Inference

• If the randomized experiment model with additive treatment effect is thought to hold for the log-transformed data, then an experimental unit that would respond to treatment 1 with a logged outcome of log(Y) would respond to treatment 2 with a logged outcome of log(Y)+

• i.e., experimental unit responds to treatment 1 with an outcome of Y and treatment 2 with an outcome of Y

• Multiplicative treatment effect model: The effect of the treatment 2 is to multiply the treatment 1 outcome by

e

e

e

Page 13: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Inference for multiplicative treatment effects

• To test whether there is any treatment effect, perform the usual t-test for with the log transformed data

• To describe the treatment effect, “back-transform” the estimate of and the endpoints of the confidence interval for from the log-transformed data.

0:0 H

Page 14: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

O n e w a y A n a l y s i s o f L o g R a i n f a l l B y G r o u p

t - T e s t D i f f e r e n c e t - T e s t D F P r o b > | t |

E s t i m a t e 1 . 1 4 4 4 6 2 . 5 4 6 5 0 0 . 0 1 4 0 S t d E r r o r 0 . 4 4 9 5 9 L o w e r 9 5 % 0 . 2 4 1 4 4 U p p e r 9 5 % 2 . 0 4 7 4 8 T h e r e i s m o d e r a t e e v i d e n c e t h a t c l o u d s e e d i n g c a u s e s a c h a n g e i n r a i n f a l l ( p - v a l u e = . 0 1 4 0 ) . W e e s t i m a t e t h a t c l o u d s e e d i n g c a u s e s r a i n f a l l t o b e m u l t i p l i e d b y

139.3144.1 e . A 9 5 % c o n f i d e n c e i n t e r v a l f o r t h e m u l t i p l i c a t i v e t r e a t m e n t e f f e c t i s )745.7 ,273.1(),( 047.2241. ee

Page 15: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Log Transformation for Population Inference

• Consider comparing means of two populations. If the populations appear skewed with the larger population having the larger spread, using the t-tools to analyze the log transformed data

might be more appropriate.• Using the t-tools on the log transformed data is

appropriate (i.e., produces approximately valid results) if and are approximately normally distributed.

)log(),log( 2211 YZYZ

1Z 2Z

Page 16: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Inference for Population Medians

• If distributions of Z1=log(Y1) and Z2=log(Y2) appear approximately normal with equal SD, then we can make inferences about the ratio of population medians for Y1 and Y2 as follows:– To test if population medians are the same, test the null

hypothesis that the means of Z1 and Z2 are the same– An estimate of the ratio of the population 2 median to the

population 1 median is exp( ).– To form a confidence interval for the ratio of population

medians, form a confidence interval for the difference in the means of Z1 and Z2, (U,L). A confidence interval for the ratio of the population 2 median to the population 1 median is

12 ZZ

),( UL ee

Page 17: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

O n e w a y A n a l y s i s o f L o g S a l a r y B y S e x ( C a s e S t u d y 1 . 1 . 2 ) t - T e s t D i f f e r e n c e t - T e s t D F P r o b > | t | E s t i m a t e - 0 . 1 4 6 9 4 - 6 . 1 7 1 9 1 < . 0 0 0 1 S t d E r r o r 0 . 0 2 3 8 1 L o w e r 9 5 % - 0 . 1 9 4 2 4 U p p e r 9 5 % - 0 . 0 9 9 6 5 A s s u m i n g e q u a l v a r i a n c e s ( A s s u m i n g f i c t i t i o u s r a n d o m s a m p l i n g m o d e l ) T h e r e i s c o n v i n c i n g e v i d e n c e t h a t t h e m e d i a n s a l a r y o f m e n a n d w o m e n a r e d i f f e r e n t . I t i s e s t i m a t e d t h e m e d i a n s a l a r y f o r m a l e s i s 16.1147.0 e t i m e s a s l a r g e a s t h e m e d i a n s a l a r y f o r f e m a l e s . S i n c e a 9 5 % c o n f i d e n c e i n t e r v a l f o r t h e d i f f e r e n c e i n m e a n s b e t w e e n m a l e s a n d f e m a l e s o n t h e l o g s c a l e i s 0 . 1 0 0 t o 0 . 1 9 4 , a 9 5 % c o n f i d e n c e i n t e r v a l f o r t h e r a t i o o f t h e m a l e p o p u l a t i o n m e d i a n t o t h e f e m a l e p o p u l a t i o n m e d i a n i s

)21.1,11.1(),( 194.0100.0 ee .

Page 18: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

When to use log transformation

What indicates that log might work?– Distributions are skewed– Spread is greater in the distribution with larger

center– The data values differ by orders of magnitude,

e.g., as a rough guide, the ratio of the largest to the smallest is >10 (or perhaps >4)

– Multiplicative statement is desirable

Page 19: Lecture 8 Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) Transformations of the Data (Chapter 3.5)

Other transformations

• Square root transformation - applies to data that are counts and to measurements of area

• Reciprocal transformation - applies to data that are waiting times (e.g., time to failure of lightbulbs), reciprocal of time measurement can often be interpreted directly as a rate or a speed

• Goals of transformation: Establish a scale on which two groups have roughly the same spread. – Inferences from log transformation are directly interpretable when

converted back to original scale of measurement. Other transformations are not so easily interpretable, e.g., square of difference between means of and is not so easily interpretable.

Y

Y/1

1Y2Y