Robust Statistics Osnat Goren-Peyser 25.3.08. Outline 1.Introduction 2.Motivation 3.Measuring robustness 4.M estimators 5.Order statistics approaches

Robust Statistics

Osnat Goren-Peyser25.3.08

Outline

1. Introduction2. Motivation3. Measuring robustness4. M estimators5. Order statistics approaches6. Summary and conclusions

1. Introduction

Problem definition Let be a set of iid

variables with distribution F, sorted in increasing order so that

is the value of the m’th order statistics

An estimator is a function of the observations.

We are looking for estimators such that

1 2ˆ ˆ , ,..., nx x x

1 2, ,..., nx x x x

1 2 ... nx x x

mx

ˆ

Asymptotic values of estimator Define: such that

where is the asymptotic value of estimator at F.

Estimator is consistent for θ if:

We say that is asymptotically normal with parameters θ,V(θ) if

ˆ ˆ F ˆ ˆn p

F ˆ F

ˆ F

n

n

ˆ 0,n dn N V

n

Efficiency An unbiased estimator is efficient if

Where is Fisher information

An unbiased estimator is asymptotic efficient if

ˆvar 1I

ˆlim var 1nI

I

Relative efficiency For fixed underlying distribution, assume two

unbiased estimators and of . we say is more efficient than if

The relative efficiency (RE) of estimator with respect to is defined as the ratio of their variances.

1 2

1 2

12

2 1 1 2ˆ ˆ ˆ ˆ; var varRE

1 2ˆ ˆvar var

Asymptotic Relative efficiency

The asymptotic relative efficiency (ARE) is the limit of the RE as the sample size n→∞

For two estimators which are each consistent for θ and also asymptotically normal [1]:

1

2 1

2

ˆvarˆ ˆ; lim

ˆvarnARE

1 2ˆ ˆ,

The location model xi=μ+ui ; i=1,2,…,n

Where μ is the unknown location parameter, ui are the errors, and xi are the observations.

The errors ui‘s are i.i.d random variables each with the same distribution function F0.

The observations xi‘s are i.i.d random variables with common distribution function: F(x) = F0(x- μ)

Normality Classical statistical methods rely on the

assumption that F is exactly known The assumption that is a normal

distribution is commonly used But normality often happens approximately

robust methods Approximately normality

The majority of observations are normally distributed some observations follow a different pattern (not

normal) or no pattern at all. Suggested model: a mixture model

2,F N

A mixture model Formalizing the idea of F being approximate

normal Assume that a proportion 1-ε of the observations

is generated by the normal model, while a proportion ε is generated by an unknown model.

The mixture model: F=(1- ε)G+ εH F is a contamination “neighborhood” of G and

also called the gross error model F is called a normal mixture model when both G

and H are normal

2. Motivation

Outliers Outlier is an atypical

observation that is well separated from the bulk of the data.

Statistics derived from data sets that include outliers will often be misleading

Even a single outlier can have a large distorting influence on a classical statistical methods

-4 -2 0 2 40

10

20

30

40

50

60

70

80

90

100

values-15 -10 -5 0 50

20

40

60

80

100

120

140

160

180

values

outliers

Estimators not sensitive to outliers are said to be robustEstimators not sensitive to

outliers are said to be robust

Without the outliers

2 1

With the outliers

Mean and standard deviation

The sample mean is defined by

A classical estimation for the location (center) of the data For , the sample mean is unbiased with

The sample standard deviation (SD) is defied by

A classical estimation for the dispersion of the data

How much influence a single outlier can have on these classical estimators?

How much influence a single outlier can have on these classical estimators?

1

1 n

ii

x xn

21

1

1

n

ii

s x xn

2,N 2,N n

Example 1 – the flour example Consider the following 24 determinations of the copper

content in wholemeal flour (in parts per million), sorted in ascending order [6]

2.20,2.20,2.40,2.40,2.50,2.70,2.80,2.90,3.03,3.03,3.10,3.37,3.40,3.40,3.40,3.50,3.60,3.70,3.70,3.70,3.70,3.77,5.28,28.95

The value 28.95 considered as an outlier. Two cases:

Case A - Taking into account the whole data Case B - Deleting the suspicious outlier

Case A Case B

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18Case A - Using the whole data for estimation

Observation value

Pro

babi

lity

data

sample mean

2 2.5 3 3.5 4 4.5 5 5.50

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18Case B - Deleting the outlier

Observation value

Pro

babi

lity

data

sample mean

Example 1 – PDFs

outlier meanmean

4.28, 5.30x s 3.21, 0.69x s

Example 1 – arising question Question: How much influence a single outlier

can have on sample mean and sample SD? Assuming the outlier value 28.95 is replaced by

an arbitrary value varying from −∞ to +∞: The value of the sample mean changes from −∞ to

+∞. The value of the sample SD changes from −∞ to +∞.

Conclusion: A single outlier has an unbounded influence on these two classical estimators!

This is related to sensitivity curve and influence function, as we will see later.

Handling outliers approaches

Detect and remove outliers from the data set Manual screening The normal Q-Q plot The “three-sigma edit” rule

Robust estimators!

Manual screeningWhy screen the data and remove outliers is not

sufficient? Users do not always screen the data. Outliers are not always errors!

Outliers may be correct, and very important for seeing the whole picture including extreme cases.

It can be very difficult to spot outliers in multivariate or highly structured data.

It is a subjective decision Without any unified criterion: Different users different

results It is difficult to determine the statistical behavior of the

complete procedure

The Q-Q normal plot Manual screening tool

for an underlying Normal distribution

A quantile-quantile plot of the sample quantiles of X versus theoretical quantiles from a normal distribution.

If the distribution of X is normal, the plot will be close to linear.

The “three-sigma edit” rule Outlier detection tool for an underlying Normal distribution Define the ratio between xi distance to the sample mean and

the sample SD:

The “three-sigma edit rule”: Observations with |ti|>3 are deemed as suspicious

Example 1 - The largest observation in the flour data has ti=4.65, and so is suspicious

Disadvantages: In a very small samples the rule is ineffective Masking: When there are several outliers, their effects may interact in

such a way that some or all of them remain unnoticed

ii

x xt

s

Example 2 – Velocity of light Consider the following 20 determinations of

the time (in microseconds) needed for light to travel a distance of 7442 m [6].

28,26,33,24,34,-44,27,16,40,-229, 22,24,21,25,30, 23,29,31,19

The actual times are the table values × 0.001 + 24.8.

The values -2 and -44 suspicious as outliers.

Example 2 – QQ plot

-2 -1.5 -1 -0.5 0 0.5 1 1.5 224.75

24.76

24.77

24.78

24.79

24.8

24.81

24.82

24.83

24.84

Standard Normal Quantiles

Qua

ntile

s of

Inp

ut S

ampl

e

QQ Plot of Sample Data versus Standard Normal

Outlier -2

Outlier -44

Example 2 – Masking Results:

Based on the three-sigma edit rule: the value of |ti | for the observation −2 does not

indicate that it is an outlier. the value −44 “masks” the value −2.

2 1.35 , 44 3.73i i i it x t x

Detect and remove outliers There are many other methods for detecting

outliers. Deleting an outlier poses a number of

problems: Affects the distribution theory Underestimating

data variability Depends on the user’s subjective decisions

difficult to determine the statistical behavior of the complete procedure.

Robust estimators provide automatic ways of detecting, and removing outliersRobust estimators provide automatic ways of detecting, and removing outliers

Example 1 – Comparing the sample median to the sample mean

2 2.5 3 3.5 4 4.5 5 5.5

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Case C - Testing the median estimator

Observation value

Pro

babi

lity

data

Case A sample mean

Case A sample medianCase B sample mean

Case B sample median

Median BMean B Mean A

Median A

• Case A: med = 3.3850• Case B: med = 3.3700

• The sample median fits the bulk of the data in both cases

• The value of the sample median does not change from −∞ to +∞ as was the case for the sample mean. • The sample median is a good robust alternative to the sample mean.

• Case A: med = 3.3850• Case B: med = 3.3700

• The sample median fits the bulk of the data in both cases

• The value of the sample median does not change from −∞ to +∞ as was the case for the sample mean. • The sample median is a good robust alternative to the sample mean.

Robust alternative to mean Sample median is a very old method for estimating the

“middle” of the data. The sample median is defined for some integer m by

For large n and , the sample median is approximately

At normal distribution: ARE(median;mean)=2/π≈64%

1

, 2 1

, 22

m

m m

x if n is odd n m

Med x x xif n is even n m

2, 2N n 2,NF

Single outlier affection The sample mean can be upset completely by a

single outlier The sample median is little affected by a single

outlier The median is resistant to gross errors whereas

the mean is not The median will tolerate up to 50% gross errors

before it can be made arbitrarily large Breakdown point:

Median: 50% Mean: 0%

Mean & median – robustness vs. efficiency

For mixture mode:

The sample mean variance is

The sample median variance is approximately

21 ,1 ,F N N

21

n

22 1n

The gain in robustness due to using the median is paid for by a loss in efficiency when F is very close to normal

The gain in robustness due to using the median is paid for by a loss in efficiency when F is very close to normal

So why not always use the sample median? If the data do not contain outliers, the sample median

has statistical performance which is poorer than that of the classical sample mean

Robust estimation goal: “The best of both worlds”

We shall develop estimators which combine the low variance of the mean at the normal, with the robustness of the median under contamination.

3. Measuring robustness

Analysis tools

Sensitivity curve (SC) Influence function (IF) Breakdown point (BP)

Sensitivity curve SC measures the effect of different

locations of an outlier on the sample The sensitivity curve of an estimator

for the samples is:

where x0 is the location of a single outlier Bounded SC(x0) high robustness!

1 2, ,..., nx x x

1 2 1 20 0ˆ ˆ, ,..., , , ,...,n nSC x x x xx xx x

SC of mean & median

n=200 F=N(0,1)

-10 -5 0 5 10-5

-4

-3

-2

-1

0

1

2

3

4

5x 10

-3 Sample mean

outlier

SC

-10 -5 0 5 10-5

-4

-3

-2

-1

0

1

2

3

4

5x 10

-3 Sample median

outlier

SC

Standardized sensitivity curve

The standardized sensitivity curve is defined by

1 2 0 1 20

ˆ ˆ, ,..., , , ,...,

1 1n n

n

x x x x x x xSC x

n

What happens if we add one more observation to a very large sample?What happens if we add one more

observation to a very large sample?

Influence function The influence function of an estimator (Hampel,

1974) is an asymptotic version of its sensitivity curve. It is an approximation to the behavior of θ∞ when the

sample contains a small fraction ε of identical outliers. It is defined as

Where is the point-mass at x0 and stands for “limit from the right”

0x

0

ˆ 00

0 0

ˆ ˆ1, lim

ˆ 1

xF FIF x F

F

IF main uses is the asymptotic value of the

estimate when the underlying distribution is F and a fraction ε of outliers is equal to x0.

IF has two main uses: Assessing the relative influence of individual

observation towards the value of estimate Unbounded IF less robustness

Allowing a simple heuristic assessment of the asymptotic variance of an estimate

0

ˆ 1 xF

IF as limit version of SC The SC is a finite sample version of IF

If ε is small

for large n:where ε=1/(n+1)

0 ˆ 0

ˆ ˆ1 ,xF F IF x F

0 ˆ 0

ˆ ˆ1 ,xbias F F IF x F

ˆ0 0 ,nSC x IF x F

Breakdown Point BP is the proportion of arbitrarily large

observations an estimator can handle before giving an arbitrarily large result

Maximum possible BP = 50% High BP more robustness! As seen before:

Mean: 0% Median: 50%

Summary

SC measures the effect of different outliers on estimation

IF is the asymptotic behavior of SC The IF and the BP consider extreme

situations in the study of contamination. IF deals with “infinitesimal” values of ε BP deals with the largest ε an estimate

can tolerate.

4. M estimators

Maximum likelihood of μ Consider the location model, and assume that

has density

The likelihood function is:

The maximum likelihood estimate (MLE) of μ is

0F

0f

1 2ˆ arg max , ,..., ;nL x x x

1 2 01

, ,..., ;n

n ii

L x x x f x

M estimators of location (μ) MLE-like estimators: generalizing ML estimators If we have density , which is everywhere positive

The MLE would solve:

Let , if this exists, then

M estimator can almost equivalently described by ρ or Ψ If ρ is everywhere differentiable and ψ is monotonic, then the

forms (*) and (**) are equivalent [6] If ψ continuous and increasing, the solution is unique [6]

0f

0log f 1

ˆ arg minn

ii

x

1

ˆ 0n

ii

x

(*)

(**)

Special casesThe sample mean

The sample median

2 2x x x x

x x

1 1

1ˆ ˆ0

n n

i ii i

x xn

1

1

ˆ 0

ˆ ˆ0 0 0

ˆ ˆ ˆ# # 0

n

ii

n

i ii

i i

sign x

I x I x

x x Med x

1 0

0 0

1 0

0 0

x

x sign x x

x

I x I x

Special cases: ρ and ψ

-4 -2 0 2 40

2

4

6Squared errors

x

(x)

-4 -2 0 2 40

1

2

3Absolute errors

x

(x)

-4 -2 0 2 4-4

-2

0

2

4Squared errors

x

(x

)

-4 -2 0 2 4-1

-0.5

0

0.5

1Absolute errors

x

(x

)

Sample

mean

Sample

median

Asymptotic behavior of location M estimators

For a given distribution F, assume ρ is differentiable and ψ is increasing, and define as the solution of

For large n, and the distribution of estimator

is approximately with

If is uniquely defined then is consistent at F [3]

0 0 F

0 0FE x

0ˆp

(***)

0 ,N n 0 0

22

F FE x E x

Desirable properties M estimators are robust to large proportions of

outliers When ψ is odd, bounded and monotonically increasing, the

BP is 0.5 The IF is proportional to ψ

Ψ function may be chosen to bound the influence of outliers and achieve high efficiency

M estimators are asymptotically normal Can be also consistent for μ

M estimators can be chosen to completely reject outliers (called redescending M estimators) , while maintaining a large BP and high efficiency

Disadvantages

They are in general only implicitly defined and must be found by iterative search

They are in general will not be scale equivariant

Huber functions A popular family of M estimator (Huber,1964) This estimator is an odd nondecreasing ψ

function which minimizes the asymptotic variance among all estimator satisfying:where

Advantages: Combines sample mean for small errors with sample

median for gross errors Boundedness of ψ

( )IF x c 2c

Huber ρ and ψ functions

With derivative , where

2

22k

x if x kx

k x k if x k

0

, 0

, 0k

x if x k kx

sign x k if x k k

x sign x

2 k x

Huber ρ and ψ functions

Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust Statistics: Theory and Methods , 2006 John Wiley & Sons

ρ (x)

Ψ(x)

Huber functions – Robustness & efficiency tradeoff

Special cases: K=0 sample median K ∞ sample mean

Asymptotic variances at normal mixture model with G = N(0, 1) and H = N(0, 10),

The larger the asymptotic variance, the less efficient estimator, but the more robust.

Efficiency come on the expense of robustness

mean

median


Increasing v

Decreasing robustness

Redescending M estimators Redescending M estimators have ψ functions which are

non-decreasing near the origin but then decrease toward the axis far from the origin

They usually satisfy ψ(x)=0 for all |x|≥r, where r is the minimum rejection point.

Accept the completely outliers rejection ability, they: Do not suffer form masking effect Has a potential to have high BP Their ψ functions can be chosen to redescend smoothly to

zero information in moderately large outliers is not ignored completely improve efficiency!

A popular family of redescending M estimators (Tukey), called Bisquare or biweight

Bisquare ρ and ψ functions

With derivative , where

32

1 1 /

1

x k if x kx

if x k

22

1x

x x I x kk

26 /x k

Bisquare ρ and ψ functions


ρ (x)

Ψ(x)

Bisquare function – efficiency

ARE(bisquare;MLE)=

Achieved ARE close to 1

ARE 0.8 0.85 0.9 0.95

k 3.14 3.44 3.88 4.68

Choice of ψ and ρ

In practical, the choice of ρ and ψ function is not critical to gaining a good robust estimate (Huber, 1981).

Redescending and bounded ψ functions are to be preferred

Bounded ρ functions are to be preferred Bisquare function is a popular choiceBisquare function is a popular choice

5. Order statistics approaches

The β-trimmed mean Let and The β-trimmed mean is defined by

Where [.] stands for the integer part and denotes the ith order statistic.

is the sample mean after the m largest and the m smallest observation have been discard

1

1

2

n m

ii m

x xn m

0,0.5 1m n

ix

x

The β-trimmed mean – cont.

Limit cases β=0 the sample mean β 0.5 the sample median

Distribution of trimmed mean The exact distribution is intractable For large n the, distribution under

location model is approximately normal BP of β % trimmed mean = β %

Example 1 – trimmed mean

All data Delete outlier

Mean 4.28 3.2

Median 3.38 3.37

Trimmed mean 10% 3.2 3.11

Trimmed mean 25% 3.17 3.17

Median and trimmed mean are less sensitive to outliers existenceMedian and trimmed mean are less sensitive to outliers existence

The W-Winsorized mean The W-Winsorized mean is defied by

The m smallest observations are replaced by the (m+1)’st smallest observation, and the m largest observations are replaced by the (m+1)’st largest observations

1

,2

11 11 n m

w im

mi

m nm x m xx xn

Trimmed and W-Winsorized mean disadvantages Uses more information from the sample

than the sample median Unless the underlying distribution is

symmetric, they are unlikely to produce an unbiased estimator for either the mean or the median.

Does not have a normal distribution

L estimators Trimmed and Winsorized mean are special cases

of L estimators L estimators is defined as Linear combinations

of order statistics:

Where the are given constants

For β-trimmed mean:

1

ˆn

i ii

x

'i s

11

2i I m i n mn m

L vs. M estimators

M estimators are: More flexible Can be generalized straightforwardly

to multi-parameter problems Have high BP

L estimator Less efficient because they

completely ignore part of the data

6. Summary & conclusions

SC of location M estimators

n=20xi~N(0,1)Trimmed mean: α=25%Huber: k=1.37Bisquare: k=4.68

n=20xi~N(0,1)Trimmed mean: α=25%Huber: k=1.37Bisquare: k=4.68

The effect of increasing contamination on a sample

Replace m points by a fixed value x0=1000

0 0 1 1 2ˆ ˆ,..., , ,..., , ,...,m n nbiased SC m x x x x x x x

n=

20

xi~

N(0

,1)

Tri

mm

ed m

ean:

α=

8.5

%H

uber:

k=

1.3

7B

isquare

: k=

4.6

8

n=

20

xi~

N(0

,1)

Tri

mm

ed m

ean:

α=

8.5

%H

uber:

k=

1.3

7B

isquare

: k=

4.6

8

IF of location M estimator

IF is proportional to ψ (Huber, 1981)

In general

0

ˆ 0

ˆ,

ˆ

xIF x F

E x

BP of location M estimator In general

When ψ is odd, bounded and monotonically increasing, the BP is 50%

Assume are finite, then the BP is:

Special cases: Sample mean = 0% Sample median = 50%

1 2,k k 1 2 1 2min ,k k k k

Comparison between different location estimators

Estimator BP SC/IF/ψ Redescending ψ Efficiency in mixture model

Mean 0% unbounded No low

Median 50% bounded No low

Huber 50% bounded No high

Bisquare 50% Bounded at 0 Yes high

x% trimmed mean

x% bounded No

Conclusions Robust statistics provides an alternative

approach to classical statistical methods. Robust statistics seeks to provide methods

that emulate classical methods, but which are not unduly affected by outliers or other small departures from model assumptions.

In order to quantify the robustness of a method, it is necessary to define some measures of robustness

Efficiency vs. Robustness Efficiency can be achieved by taking ψ

proportional to the derivative of the log-likelihood defined by the density of F: ψ(x)=-c(f’/f)(x), where c is constant≠0

Robustness is achieved by choosing ψ that is smooth and bounded to reduce the influence of a small proportion of observations

References1. Robert G. Staudte and Simon J. Sheather, Robust estimation and

testing, Wiley 1990.2. Elvezio Ronchetti, “THE HISTORICAL DEVELOPMENT OF ROBUST

STATISTICS”, ICOTS-7, 2006: Ronchetti, University of Geneva, Switzerland, 2006.

3. Huber, P. (1981). Robust Statistics. New York: Wiley.4. Hampel F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A.

(1986). Robust Statistics: The Approach Based on Influence Functions. New York: Wiley.

5. Tukey6. Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust

Statistics: Theory and Methods , 2006 John Wiley & Sons7. B. D. Ripley , M.Sc. in Applied Statistics MT2004, Robust Statistics,

1992–2004.8. Robust statistics From Wikipedia.

Documents

Robust Statistics Osnat Goren-Peyser 25.3.08. Outline 1.Introduction 2.Motivation 3.Measuring robustness 4.M estimators 5.Order statistics approaches