STAT 3022 SLIDES UMN CHAPTER3

  • Upload
    yang-yi

  • View
    236

  • Download
    0

Embed Size (px)

Citation preview

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    1/27

    Intrduction Robustness Resistance Transformation Outlier

    Chapter 3

    A Closer Look at Assumptions

    STAT 3022School of Statistic, University of Minnesota

    2013 spring

    1 / 2 7

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    2/27

    Intrduction Robustness Resistance Transformation Outlier

    Introduction

    In Chapter 2, we discussed the mechanics of using t-proceduresto perform statistical inference. Namely t-tests and confidenceinterval.

    We base these procedures on certain assumptions:

    we have random samples, representative of populations

    data come from Normal population

    samples are drawn independently.

    in pooled two-sample settings, we have equal variance(1 = 2 = )

    In practice, these assumptions are usually not strictly met.When are these procedures still appropriate?

    2 / 2 7

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    3/27

    Intrduction Robustness Resistance Transformation Outlier

    Case Study: Making it Rain

    Data collected in southern Florida between 1968 - 1972 to testhypothesis that massive injection of silver iodide (AgI) intocumulus clouds can lead to increased rainfall.

    This process is called cloud seeding. Over 52 days, either

    seeded a target cloud or left it unseeded (as control). Randomlyassigned treatment.

    Researchers were blindto the treatment - pilots flew throughcloud every day, whether treatment or control, and mechanismin plane either seeded the cloud or left it unseeded.

    Question: Did cloud seeding have an effect on rainfall? If so,how much?

    3 / 2 7

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    4/27

    Intrduction Robustness Resistance Transformation Outlier

    Graphical Summaries

    library("Sleuth2")

    boxplot(Rainfall ~ Treatment, ylab='Rainfall (acre-feet)', data=case0301)

    Unseeded Seeded

    0

    500

    1000

    1500

    2000

    2500

    Rainfall(acre

    feet)

    4 / 2 7

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    5/27

    Intrduction Robustness Resistance Transformation Outlier

    Graphical Summaries

    par(mfrow=c(2,1), mar=c(4,4,1,0.5))

    hist(case0301$Rainfall[case0301$Treatment=="Seeded"], breaks=10,

    main="Seeded - Rainfall", xlim=c(0,3000), col="gray", xlab="")hist(case0301$Rainfall[case0301$Treatment=="Unseeded"], breaks=8,

    main="Unseeded - Rainfall", xlim=c(0,3000),col="gray", xlab="")

    ee e a n a

    Frequency

    0 500 1000 1500 2000 2500 3000

    0

    2

    4

    6

    8

    10

    12

    nsee e a n a

    Frequency

    0 500 1000 1500 2000 2500 3000

    0

    5

    10

    15

    20

    5 / 2 7

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    6/27

    Intrduction Robustness Resistance Transformation Outlier

    Numerical Summaries and Interpretations

    Numerical Summaries: Do it yourself (follow the R-code onpage 42 of Chapter 2 slides)

    Graphical and numerical summaries indicate that rainfall

    tended to be greater on seeded days. However, there areproblems with our necessary assumptions:

    both distributions are very skewed

    both distributions have outliers

    variability is much greater in the seeded group than in theunseeded group

    Can we use our usual t-tools to analyze these data? How?

    6 / 2 7

    d b f l

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    7/27

    Intrduction Robustness Resistance Transformation Outlier

    Can we do this?

    > t.test(Rainfall ~ Treatment, alternative="two.sided",

    + var.equal=TRUE, data=case0301)

    Two Sample t-test

    data: Rainfall by Treatment

    t = -1.9982, df = 50, p-value = 0.05114

    alternative hypothesis: true difference in means is not equal to 0

    95 percent confidence interval:

    -556.224179 1.431851

    sample estimates:

    mean in group Unseeded mean in group Seeded

    164.5885 441.9846

    How much did the violations of our assumptions affect theseresults?

    7 / 2 7

    I t d ti R b t R i t T f ti O tli

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    8/27

    Intrduction Robustness Resistance Transformation Outlier

    Robustness

    t-tools may be used even when assumptions are violated, to acertain degree, because the t-tools are robust.

    Robustness: A statistical procedure is robust to departuresfrom a particular assumption if it is valid even when theassumption is not met.

    8 / 2 7

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    9/27

    Intrduction Robustness Resistance Transformation Outlier

    Type 1: Robustness Against Departures fromNormality

    Recall that the Central Limit Theorem (CLT) states that sample

    averages have approximately Normal sampling distributions,regardless of the shape of the population distribution, for largesamples.

    As long as samples are large enough, the t-ratio will follow an

    approximate t-distribution even if the data is non-Normal.

    9 / 2 7

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    10/27

    Intrduction Robustness Resistance Transformation Outlier

    Type 1: Robustness Against Departures fromNormality

    Effects of Skewness

    If two populations have same standard deviations andapproximately same shapes, and ifn1 n2, then validityoft-tools is affected very little by skewness.

    If two populations have same standard deviations andapproximately same shapes, but n1 = n2, then validity oft-tools is affected substantially by skewness. Larger samplesize diminish this effect.

    If skewness in two populations differs considerably, tools

    can be very misleading with small and moderate samplesizes.

    See Display 3.4 in the textbook for simulation results.

    10/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    11/27

    Intrduction Robustness Resistance Transformation Outlier

    Type 2: Robustness Against Differing StandardDeviations

    When we cannot assume 1 = 2, more serious problems mayarise:

    sp no longer estimates any parameterSE(x1 x2) no longer estimates the standard deviation ofthe difference between averagesthe t-ratio no longer follows a t-distribution

    What can we do:

    Ifn1 n2, t-tools remain fairly valid even when 1 = 2.

    When n1 and n2 are very different, we need the ratio 1/2to be between 1/2 and 2 to have reliable results.

    See Display 3.5 in the textbook for simulation results.

    > t.test(x1, x2, alternative = 'two.sided', var.equal = FALSE)11/27

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    12/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    13/27

    Intrduction Robustness Resistance Transformation Outlier

    Resistance and Outliers

    An outlier is an observation judged to be far from its groupaverage.

    A statistical procedure is resistant if it does not change verymuch when a small part of the data changes, perhapsdrastically.

    Whether or not we should simply remove such observationsdepend on how resistant our tools are to changes in the data.

    Question: Can you tell the difference between Robustnessand Resistance?

    13/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    14/27

    Example of Outlier

    6 4 2 0 2 4 6

    1.0

    0.

    5

    0.

    0

    0.

    5

    1.

    0

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    14/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    15/27

    Example of Resistance

    Consider a hypothetical sample:

    10, 20, 30, 50, 70

    The sample mean is 36, and the sample median is 30.

    Now consider the sample:

    10, 20, 30, 50, 700

    What happens to the sample mean? What about the sample

    median?

    The sample median is resistant to any change in a singleobservation, while the sample mean is not.

    15/27

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    16/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    17/27

    Practical Strategies for the Two-Sample Problem

    Our task is to size up actual conditions, using available data,and evaluate appropriateness of t-tools:

    1 think about possible cluster and serial effects

    2 evaluate the suitability of t-tools by examining graphicaldisplays (side-by-side histograms or box plots)

    3 consider alternatives

    a. Transform the data (Section 3.5) to see if the transformeddata looks nicer

    b. Alternative tools that do not require model assumptions(Chapter 4)

    17/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    18/27

    Transformations of Data

    For positive data, the most useful transformation is the

    logarithm (log), particularly the natural (base e) logarithm (e =2.71828...).

    log(1) = 0log(ex) = x

    0 2 4 6 8 10

    2

    1

    0

    1

    2

    log function

    x

    log

    (x)

    18/27

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    19/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    20/27

    Cloud Seeding - Transformation

    Recall both groups are skewed, with the seeded days having alarger average and a greater spread.> max(case0301$Rainfall[case0301$Treatment=="Seeded"])/

    + min(case0301$Rainfall[case0301$Treatment=="Seeded"])

    [1] 669.6586

    > max(case0301$Rainfall[case0301$Treatment=="Unseeded"])/

    + min(case0301$Rainfall[case0301$Treatment=="Unseeded"])

    [1] 1202.6

    > case0301$logRain head(case0301)

    Rainfall Treatment logRain

    1 1202.6 Unseeded 7.092241

    2 830.1 Unseeded 6.721546

    3 372.4 Unseeded 5.919969

    4 345.5 Unseeded 5.844993

    5 321.2 Unseeded 5.772064

    6 244.3 Unseeded 5.498397

    Unseeded Seeded

    0

    500

    1000

    1500

    2000

    2500

    before transformation

    Unseeded Seeded

    0

    2

    4

    6

    8

    after transformation

    20/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    21/27

    Two-Sample t-Analysis

    Before:> t.test(Rainfall ~ Treatment, alternative="two.sided",

    + var.equal=TRUE, data=case0301)

    Two Sample t-test

    data: Rainfall by Treatment

    t = -1.9982, df = 50, p-value = 0.05114

    alternative hypothesis: true difference in means is not equal to 0

    95 percent confidence interval:

    -556.224179 1.431851

    sample estimates:mean in group Unseeded mean in group Seeded

    164.5885 441.9846

    After:> t.test(logRain ~ Treatment, data=case0301,

    + alternative="less", var.equal=TRUE)

    Two Sample t-test

    data: logRain by Treatment

    t = -2.5444, df = 50, p-value = 0.007041

    alternative hypothesis: true difference in means is less than 0

    95 percent confidence interval:

    -Inf -0.3904045

    sample estimates:

    mean in group Unseeded mean in group Seeded

    3.990406 5.134187

    There is convincing evidence that seeding increased rainfall. 21/27

    Intrduction Robustness Resistance Transformation Outlier

    l l ff

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    22/27

    Multiplicative Treatment Effect

    Definition: Suppose Z= logY. It is estimated that the responseof an experimental unit to treatment 2 will be eZ2Z1 times aslarge as its response to treatment 1 (where Z1 = average oflog(Y1)).

    > m1 m2 (diffmeans (est.mult.effect

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    23/27

    Confidence Interval

    > (test test$conf.int

    [1] -2.0466973 -0.2408651

    attr(,"conf.level")

    [1] 0.95

    > exp(test$conf.int)

    [1] 0.1291608 0.7859476

    attr(,"conf.level")

    [1] 0.95

    A 95% confidence interval for the multiplicative effect ofunseeding/seeding is 0.129 to 0.786 times.

    23/27

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    24/27

    Intrduction Robustness Resistance Transformation Outlier

    R i O tli d Oth D t P i t

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    25/27

    Removing Outliers and Other Data Points

    > library(Sleuth2); ex0327[15:17, ]

    Country Life Income Type

    15 Portugal 68.1 956 Industrialized16 South_Africa 68.2 NaN Industrialized

    17 Sweden 74.7 5596 Industrialized

    > range(ex0327$Income, na.rm=TRUE)

    [1] 110 5596

    > data

    > d1 ### dealing with Missing data ###

    > (cc data2

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    26/27

    Q: How many conservative economists does it take to change alight bulb?

    26/27

    Intrduction Robustness Resistance Transformation Outlier

  • 7/29/2019 STAT 3022 SLIDES UMN CHAPTER3

    27/27

    A: None, theyre all waiting for the unseen hand of the marketto correct the lighting disequilibrium.

    27/27