Upload
anthony-henry
View
221
Download
0
Embed Size (px)
Citation preview
Robust Statistics
Osnat Goren-Peyser25.3.08
Outline
1. Introduction2. Motivation3. Measuring robustness4. M estimators5. Order statistics approaches6. Summary and conclusions
1. Introduction
Problem definition Let be a set of iid
variables with distribution F, sorted in increasing order so that
is the value of the m’th order statistics
An estimator is a function of the observations.
We are looking for estimators such that
1 2ˆ ˆ , ,..., nx x x
1 2, ,..., nx x x x
1 2 ... nx x x
mx
ˆ
Asymptotic values of estimator Define: such that
where is the asymptotic value of estimator at F.
Estimator is consistent for θ if:
We say that is asymptotically normal with parameters θ,V(θ) if
ˆ ˆ F ˆ ˆn p
F ˆ F
ˆ F
n
n
ˆ 0,n dn N V
n
Efficiency An unbiased estimator is efficient if
Where is Fisher information
An unbiased estimator is asymptotic efficient if
ˆvar 1I
ˆlim var 1nI
I
Relative efficiency For fixed underlying distribution, assume two
unbiased estimators and of . we say is more efficient than if
The relative efficiency (RE) of estimator with respect to is defined as the ratio of their variances.
1 2
1 2
12
2 1 1 2ˆ ˆ ˆ ˆ; var varRE
1 2ˆ ˆvar var
Asymptotic Relative efficiency
The asymptotic relative efficiency (ARE) is the limit of the RE as the sample size n→∞
For two estimators which are each consistent for θ and also asymptotically normal [1]:
1
2 1
2
ˆvarˆ ˆ; lim
ˆvarnARE
1 2ˆ ˆ,
The location model xi=μ+ui ; i=1,2,…,n
Where μ is the unknown location parameter, ui are the errors, and xi are the observations.
The errors ui‘s are i.i.d random variables each with the same distribution function F0.
The observations xi‘s are i.i.d random variables with common distribution function: F(x) = F0(x- μ)
Normality Classical statistical methods rely on the
assumption that F is exactly known The assumption that is a normal
distribution is commonly used But normality often happens approximately
robust methods Approximately normality
The majority of observations are normally distributed some observations follow a different pattern (not
normal) or no pattern at all. Suggested model: a mixture model
2,F N
A mixture model Formalizing the idea of F being approximate
normal Assume that a proportion 1-ε of the observations
is generated by the normal model, while a proportion ε is generated by an unknown model.
The mixture model: F=(1- ε)G+ εH F is a contamination “neighborhood” of G and
also called the gross error model F is called a normal mixture model when both G
and H are normal
2. Motivation
Outliers Outlier is an atypical
observation that is well separated from the bulk of the data.
Statistics derived from data sets that include outliers will often be misleading
Even a single outlier can have a large distorting influence on a classical statistical methods
-4 -2 0 2 40
10
20
30
40
50
60
70
80
90
100
values-15 -10 -5 0 50
20
40
60
80
100
120
140
160
180
values
outliers
Estimators not sensitive to outliers are said to be robustEstimators not sensitive to
outliers are said to be robust
Without the outliers
2 1
With the outliers
Mean and standard deviation
The sample mean is defined by
A classical estimation for the location (center) of the data For , the sample mean is unbiased with
The sample standard deviation (SD) is defied by
A classical estimation for the dispersion of the data
How much influence a single outlier can have on these classical estimators?
How much influence a single outlier can have on these classical estimators?
1
1 n
ii
x xn
21
1
1
n
ii
s x xn
2,N 2,N n
Example 1 – the flour example Consider the following 24 determinations of the copper
content in wholemeal flour (in parts per million), sorted in ascending order [6]
2.20,2.20,2.40,2.40,2.50,2.70,2.80,2.90,3.03,3.03,3.10,3.37,3.40,3.40,3.40,3.50,3.60,3.70,3.70,3.70,3.70,3.77,5.28,28.95
The value 28.95 considered as an outlier. Two cases:
Case A - Taking into account the whole data Case B - Deleting the suspicious outlier
Case A Case B
0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18Case A - Using the whole data for estimation
Observation value
Pro
babi
lity
data
sample mean
2 2.5 3 3.5 4 4.5 5 5.50
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18Case B - Deleting the outlier
Observation value
Pro
babi
lity
data
sample mean
Example 1 – PDFs
outlier meanmean
4.28, 5.30x s 3.21, 0.69x s
Example 1 – arising question Question: How much influence a single outlier
can have on sample mean and sample SD? Assuming the outlier value 28.95 is replaced by
an arbitrary value varying from −∞ to +∞: The value of the sample mean changes from −∞ to
+∞. The value of the sample SD changes from −∞ to +∞.
Conclusion: A single outlier has an unbounded influence on these two classical estimators!
This is related to sensitivity curve and influence function, as we will see later.
Handling outliers approaches
Detect and remove outliers from the data set Manual screening The normal Q-Q plot The “three-sigma edit” rule
Robust estimators!
Manual screeningWhy screen the data and remove outliers is not
sufficient? Users do not always screen the data. Outliers are not always errors!
Outliers may be correct, and very important for seeing the whole picture including extreme cases.
It can be very difficult to spot outliers in multivariate or highly structured data.
It is a subjective decision Without any unified criterion: Different users different
results It is difficult to determine the statistical behavior of the
complete procedure
The Q-Q normal plot Manual screening tool
for an underlying Normal distribution
A quantile-quantile plot of the sample quantiles of X versus theoretical quantiles from a normal distribution.
If the distribution of X is normal, the plot will be close to linear.
The “three-sigma edit” rule Outlier detection tool for an underlying Normal distribution Define the ratio between xi distance to the sample mean and
the sample SD:
The “three-sigma edit rule”: Observations with |ti|>3 are deemed as suspicious
Example 1 - The largest observation in the flour data has ti=4.65, and so is suspicious
Disadvantages: In a very small samples the rule is ineffective Masking: When there are several outliers, their effects may interact in
such a way that some or all of them remain unnoticed
ii
x xt
s
Example 2 – Velocity of light Consider the following 20 determinations of
the time (in microseconds) needed for light to travel a distance of 7442 m [6].
28,26,33,24,34,-44,27,16,40,-229, 22,24,21,25,30, 23,29,31,19
The actual times are the table values × 0.001 + 24.8.
The values -2 and -44 suspicious as outliers.
Example 2 – QQ plot
-2 -1.5 -1 -0.5 0 0.5 1 1.5 224.75
24.76
24.77
24.78
24.79
24.8
24.81
24.82
24.83
24.84
Standard Normal Quantiles
Qua
ntile
s of
Inp
ut S
ampl
e
QQ Plot of Sample Data versus Standard Normal
Outlier -2
Outlier -44
Example 2 – Masking Results:
Based on the three-sigma edit rule: the value of |ti | for the observation −2 does not
indicate that it is an outlier. the value −44 “masks” the value −2.
2 1.35 , 44 3.73i i i it x t x
Detect and remove outliers There are many other methods for detecting
outliers. Deleting an outlier poses a number of
problems: Affects the distribution theory Underestimating
data variability Depends on the user’s subjective decisions
difficult to determine the statistical behavior of the complete procedure.
Robust estimators provide automatic ways of detecting, and removing outliersRobust estimators provide automatic ways of detecting, and removing outliers
Example 1 – Comparing the sample median to the sample mean
2 2.5 3 3.5 4 4.5 5 5.5
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Case C - Testing the median estimator
Observation value
Pro
babi
lity
data
Case A sample mean
Case A sample medianCase B sample mean
Case B sample median
Median BMean B Mean A
Median A
• Case A: med = 3.3850• Case B: med = 3.3700
• The sample median fits the bulk of the data in both cases
• The value of the sample median does not change from −∞ to +∞ as was the case for the sample mean. • The sample median is a good robust alternative to the sample mean.
• Case A: med = 3.3850• Case B: med = 3.3700
• The sample median fits the bulk of the data in both cases
• The value of the sample median does not change from −∞ to +∞ as was the case for the sample mean. • The sample median is a good robust alternative to the sample mean.
Robust alternative to mean Sample median is a very old method for estimating the
“middle” of the data. The sample median is defined for some integer m by
For large n and , the sample median is approximately
At normal distribution: ARE(median;mean)=2/π≈64%
1
, 2 1
, 22
m
m m
x if n is odd n m
Med x x xif n is even n m
2, 2N n 2,NF
Single outlier affection The sample mean can be upset completely by a
single outlier The sample median is little affected by a single
outlier The median is resistant to gross errors whereas
the mean is not The median will tolerate up to 50% gross errors
before it can be made arbitrarily large Breakdown point:
Median: 50% Mean: 0%
Mean & median – robustness vs. efficiency
For mixture mode:
The sample mean variance is
The sample median variance is approximately
21 ,1 ,F N N
21
n
22 1n
The gain in robustness due to using the median is paid for by a loss in efficiency when F is very close to normal
The gain in robustness due to using the median is paid for by a loss in efficiency when F is very close to normal
So why not always use the sample median? If the data do not contain outliers, the sample median
has statistical performance which is poorer than that of the classical sample mean
Robust estimation goal: “The best of both worlds”
We shall develop estimators which combine the low variance of the mean at the normal, with the robustness of the median under contamination.
3. Measuring robustness
Analysis tools
Sensitivity curve (SC) Influence function (IF) Breakdown point (BP)
Sensitivity curve SC measures the effect of different
locations of an outlier on the sample The sensitivity curve of an estimator
for the samples is:
where x0 is the location of a single outlier Bounded SC(x0) high robustness!
1 2, ,..., nx x x
1 2 1 20 0ˆ ˆ, ,..., , , ,...,n nSC x x x xx xx x
SC of mean & median
n=200 F=N(0,1)
-10 -5 0 5 10-5
-4
-3
-2
-1
0
1
2
3
4
5x 10
-3 Sample mean
outlier
SC
-10 -5 0 5 10-5
-4
-3
-2
-1
0
1
2
3
4
5x 10
-3 Sample median
outlier
SC
Standardized sensitivity curve
The standardized sensitivity curve is defined by
1 2 0 1 20
ˆ ˆ, ,..., , , ,...,
1 1n n
n
x x x x x x xSC x
n
What happens if we add one more observation to a very large sample?What happens if we add one more
observation to a very large sample?
Influence function The influence function of an estimator (Hampel,
1974) is an asymptotic version of its sensitivity curve. It is an approximation to the behavior of θ∞ when the
sample contains a small fraction ε of identical outliers. It is defined as
Where is the point-mass at x0 and stands for “limit from the right”
0x
0
ˆ 00
0 0
ˆ ˆ1, lim
ˆ 1
xF FIF x F
F
IF main uses is the asymptotic value of the
estimate when the underlying distribution is F and a fraction ε of outliers is equal to x0.
IF has two main uses: Assessing the relative influence of individual
observation towards the value of estimate Unbounded IF less robustness
Allowing a simple heuristic assessment of the asymptotic variance of an estimate
0
ˆ 1 xF
IF as limit version of SC The SC is a finite sample version of IF
If ε is small
for large n:where ε=1/(n+1)
0 ˆ 0
ˆ ˆ1 ,xF F IF x F
0 ˆ 0
ˆ ˆ1 ,xbias F F IF x F
ˆ0 0 ,nSC x IF x F
Breakdown Point BP is the proportion of arbitrarily large
observations an estimator can handle before giving an arbitrarily large result
Maximum possible BP = 50% High BP more robustness! As seen before:
Mean: 0% Median: 50%
Summary
SC measures the effect of different outliers on estimation
IF is the asymptotic behavior of SC The IF and the BP consider extreme
situations in the study of contamination. IF deals with “infinitesimal” values of ε BP deals with the largest ε an estimate
can tolerate.
4. M estimators
Maximum likelihood of μ Consider the location model, and assume that
has density
The likelihood function is:
The maximum likelihood estimate (MLE) of μ is
0F
0f
1 2ˆ arg max , ,..., ;nL x x x
1 2 01
, ,..., ;n
n ii
L x x x f x
M estimators of location (μ) MLE-like estimators: generalizing ML estimators If we have density , which is everywhere positive
The MLE would solve:
Let , if this exists, then
M estimator can almost equivalently described by ρ or Ψ If ρ is everywhere differentiable and ψ is monotonic, then the
forms (*) and (**) are equivalent [6] If ψ continuous and increasing, the solution is unique [6]
0f
0log f 1
ˆ arg minn
ii
x
1
ˆ 0n
ii
x
(*)
(**)
Special casesThe sample mean
The sample median
2 2x x x x
x x
1 1
1ˆ ˆ0
n n
i ii i
x xn
1
1
ˆ 0
ˆ ˆ0 0 0
ˆ ˆ ˆ# # 0
n
ii
n
i ii
i i
sign x
I x I x
x x Med x
1 0
0 0
1 0
0 0
x
x sign x x
x
I x I x
Special cases: ρ and ψ
-4 -2 0 2 40
2
4
6Squared errors
x
(x)
-4 -2 0 2 40
1
2
3Absolute errors
x
(x)
-4 -2 0 2 4-4
-2
0
2
4Squared errors
x
(x
)
-4 -2 0 2 4-1
-0.5
0
0.5
1Absolute errors
x
(x
)
Sample
mean
Sample
median
Asymptotic behavior of location M estimators
For a given distribution F, assume ρ is differentiable and ψ is increasing, and define as the solution of
For large n, and the distribution of estimator
is approximately with
If is uniquely defined then is consistent at F [3]
0 0 F
0 0FE x
0ˆp
(***)
0 ,N n 0 0
22
F FE x E x
Desirable properties M estimators are robust to large proportions of
outliers When ψ is odd, bounded and monotonically increasing, the
BP is 0.5 The IF is proportional to ψ
Ψ function may be chosen to bound the influence of outliers and achieve high efficiency
M estimators are asymptotically normal Can be also consistent for μ
M estimators can be chosen to completely reject outliers (called redescending M estimators) , while maintaining a large BP and high efficiency
Disadvantages
They are in general only implicitly defined and must be found by iterative search
They are in general will not be scale equivariant
Huber functions A popular family of M estimator (Huber,1964) This estimator is an odd nondecreasing ψ
function which minimizes the asymptotic variance among all estimator satisfying:where
Advantages: Combines sample mean for small errors with sample
median for gross errors Boundedness of ψ
( )IF x c 2c
Huber ρ and ψ functions
With derivative , where
2
22k
x if x kx
k x k if x k
0
, 0
, 0k
x if x k kx
sign x k if x k k
x sign x
2 k x
Huber ρ and ψ functions
Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust Statistics: Theory and Methods , 2006 John Wiley & Sons
ρ (x)
Ψ(x)
Huber functions – Robustness & efficiency tradeoff
Special cases: K=0 sample median K ∞ sample mean
Asymptotic variances at normal mixture model with G = N(0, 1) and H = N(0, 10),
The larger the asymptotic variance, the less efficient estimator, but the more robust.
Efficiency come on the expense of robustness
mean
median
Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust Statistics: Theory and Methods , 2006 John Wiley & Sons
Increasing v
Decreasing robustness
Redescending M estimators Redescending M estimators have ψ functions which are
non-decreasing near the origin but then decrease toward the axis far from the origin
They usually satisfy ψ(x)=0 for all |x|≥r, where r is the minimum rejection point.
Accept the completely outliers rejection ability, they: Do not suffer form masking effect Has a potential to have high BP Their ψ functions can be chosen to redescend smoothly to
zero information in moderately large outliers is not ignored completely improve efficiency!
A popular family of redescending M estimators (Tukey), called Bisquare or biweight
Bisquare ρ and ψ functions
With derivative , where
32
1 1 /
1
x k if x kx
if x k
22
1x
x x I x kk
26 /x k
Bisquare ρ and ψ functions
Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust Statistics: Theory and Methods , 2006 John Wiley & Sons
ρ (x)
Ψ(x)
Bisquare function – efficiency
ARE(bisquare;MLE)=
Achieved ARE close to 1
ARE 0.8 0.85 0.9 0.95
k 3.14 3.44 3.88 4.68
Choice of ψ and ρ
In practical, the choice of ρ and ψ function is not critical to gaining a good robust estimate (Huber, 1981).
Redescending and bounded ψ functions are to be preferred
Bounded ρ functions are to be preferred Bisquare function is a popular choiceBisquare function is a popular choice
5. Order statistics approaches
The β-trimmed mean Let and The β-trimmed mean is defined by
Where [.] stands for the integer part and denotes the ith order statistic.
is the sample mean after the m largest and the m smallest observation have been discard
1
1
2
n m
ii m
x xn m
0,0.5 1m n
ix
x
The β-trimmed mean – cont.
Limit cases β=0 the sample mean β 0.5 the sample median
Distribution of trimmed mean The exact distribution is intractable For large n the, distribution under
location model is approximately normal BP of β % trimmed mean = β %
Example 1 – trimmed mean
All data Delete outlier
Mean 4.28 3.2
Median 3.38 3.37
Trimmed mean 10% 3.2 3.11
Trimmed mean 25% 3.17 3.17
Median and trimmed mean are less sensitive to outliers existenceMedian and trimmed mean are less sensitive to outliers existence
The W-Winsorized mean The W-Winsorized mean is defied by
The m smallest observations are replaced by the (m+1)’st smallest observation, and the m largest observations are replaced by the (m+1)’st largest observations
1
,2
11 11 n m
w im
mi
m nm x m xx xn
Trimmed and W-Winsorized mean disadvantages Uses more information from the sample
than the sample median Unless the underlying distribution is
symmetric, they are unlikely to produce an unbiased estimator for either the mean or the median.
Does not have a normal distribution
L estimators Trimmed and Winsorized mean are special cases
of L estimators L estimators is defined as Linear combinations
of order statistics:
Where the are given constants
For β-trimmed mean:
1
ˆn
i ii
x
'i s
11
2i I m i n mn m
L vs. M estimators
M estimators are: More flexible Can be generalized straightforwardly
to multi-parameter problems Have high BP
L estimator Less efficient because they
completely ignore part of the data
6. Summary & conclusions
SC of location M estimators
n=20xi~N(0,1)Trimmed mean: α=25%Huber: k=1.37Bisquare: k=4.68
n=20xi~N(0,1)Trimmed mean: α=25%Huber: k=1.37Bisquare: k=4.68
The effect of increasing contamination on a sample
Replace m points by a fixed value x0=1000
0 0 1 1 2ˆ ˆ,..., , ,..., , ,...,m n nbiased SC m x x x x x x x
n=
20
xi~
N(0
,1)
Tri
mm
ed m
ean:
α=
8.5
%H
uber:
k=
1.3
7B
isquare
: k=
4.6
8
n=
20
xi~
N(0
,1)
Tri
mm
ed m
ean:
α=
8.5
%H
uber:
k=
1.3
7B
isquare
: k=
4.6
8
IF of location M estimator
IF is proportional to ψ (Huber, 1981)
In general
0
ˆ 0
ˆ,
ˆ
xIF x F
E x
BP of location M estimator In general
When ψ is odd, bounded and monotonically increasing, the BP is 50%
Assume are finite, then the BP is:
Special cases: Sample mean = 0% Sample median = 50%
1 2,k k 1 2 1 2min ,k k k k
Comparison between different location estimators
Estimator BP SC/IF/ψ Redescending ψ Efficiency in mixture model
Mean 0% unbounded No low
Median 50% bounded No low
Huber 50% bounded No high
Bisquare 50% Bounded at 0 Yes high
x% trimmed mean
x% bounded No
Conclusions Robust statistics provides an alternative
approach to classical statistical methods. Robust statistics seeks to provide methods
that emulate classical methods, but which are not unduly affected by outliers or other small departures from model assumptions.
In order to quantify the robustness of a method, it is necessary to define some measures of robustness
Efficiency vs. Robustness Efficiency can be achieved by taking ψ
proportional to the derivative of the log-likelihood defined by the density of F: ψ(x)=-c(f’/f)(x), where c is constant≠0
Robustness is achieved by choosing ψ that is smooth and bounded to reduce the influence of a small proportion of observations
References1. Robert G. Staudte and Simon J. Sheather, Robust estimation and
testing, Wiley 1990.2. Elvezio Ronchetti, “THE HISTORICAL DEVELOPMENT OF ROBUST
STATISTICS”, ICOTS-7, 2006: Ronchetti, University of Geneva, Switzerland, 2006.
3. Huber, P. (1981). Robust Statistics. New York: Wiley.4. Hampel F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A.
(1986). Robust Statistics: The Approach Based on Influence Functions. New York: Wiley.
5. Tukey6. Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust
Statistics: Theory and Methods , 2006 John Wiley & Sons7. B. D. Ripley , M.Sc. in Applied Statistics MT2004, Robust Statistics,
1992–2004.8. Robust statistics From Wikipedia.