View
228
Download
0
Category
Preview:
Citation preview
7/29/2019 ShortCourse on Robust Statistics
1/92
A SHORT COURSE ON
ROBUST STATISTICS
David E. Tyler
Rutgers
The State University of New Jersey
Web-Site
www.rci.rutgers.edu/ dtyler/ShortCourse.pdf
7/29/2019 ShortCourse on Robust Statistics
2/92
References
Huber, P.J. (1981). Robust Statistics. Wiley, New York.
Hampel, F.R., Ronchetti, E.M., Rousseeuw,P.J. and Stahel, W.A. (1986).
Robust Statistics: The Approach Based on Influence Functions.
Wiley, New York.
Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006).Robust Statistics: Theory and Methods. Wiley, New York.
7/29/2019 ShortCourse on Robust Statistics
3/92
PART 1
CONCEPTS AND BASIC METHODS
7/29/2019 ShortCourse on Robust Statistics
4/92
MOTIVATION Data Set: X1, X2, . . . , X n Parametric Model: F(x1, . . . , xn | )
: Unknown parameter
F: Known function
e.g. X1, X2, . . . , X n i.i.d. Normal(, 2)Q: Is it realistic to believe we dont know (, 2), but we know e.g. the shape of the
tails of the distribution?
A: The model is assumed to be approximately true, e.g. symmetric and unimodal
(past experience).
Q: Are statistical methods which are good under the model reasonably good if themodel is only approximately true?
ROBUST STATISTICS
Formally addresses this issue.
7/29/2019 ShortCourse on Robust Statistics
5/92
CLASSIC EXAMPLE: MEAN .vs. MEDIAN
Symmetric distributions: = population meanpopulation median
Sample mean: X Normal
, 2
n
Sample median: Median Normal , 1n 14f()2 At normal: Median Normal
,
2
n2
Asymptotic Relative Efficiency of Median to Mean
ARE(Median, X) =avar(X)
avar(Median)=
2
= 0.6366
7/29/2019 ShortCourse on Robust Statistics
6/92
CAUCHY DISTRIBUTIONX Cauchy(, 2)
f(x; , ) = 1
1 + x
21
Mean: X Cauchy(, 2) Median Normal
, 22
4n
ARE(Median, X) = or ARE(X,Median) = 0
For t on degrees of freedom:
ARE(Median, X) = 4( 2) ((+ 1)/2)
2
(/2)
2 3 4 5ARE(Median, X) 1.621 1.125 0.960ARE(X,Median) 0 0.617 0.888 1.041
7/29/2019 ShortCourse on Robust Statistics
7/92
MIXTURE OF NORMALSTheory of errors: Central Limit Theorem gives plausibility to normality.
X
Normal(, 2) with probability 1
Normal(, (3)2) with probability
i.e. not all measurements are equally precise.
X (1
) Normal(, 2) + N ormal(, (3)2)
= 0.10
Classic paper: Tukey (1960), A survey of sampling from contaminated distributions. For > 0.10 ARE(M edian, X) > 1 The mean absolute deviation is more effiicent than the sample standard deviation
for > 0.01.
7/29/2019 ShortCourse on Robust Statistics
8/92
PRINCETON ROBUSTNESS STUDIESAndrews, et.al. (1972)
Other estimates of location.
-trimmed mean: Trim a proportion of from both ends of the data set and thentake the mean. (Throwing away data?)
-Windsorized mean: Replace a proportion of from both ends of the data setby the next closest observation and then take the mean.
Example: 2, 4, 5, 10, 200
Mean = 44.2 Median = 5
20% trimmed mean = (4 + 5 + 10) / 3 = 6.33
20% Windsorized mean = (4 + 4 + 5 + 10 + 10) / 5 = 6.6
7/29/2019 ShortCourse on Robust Statistics
9/92
Measuring the robustness of a statistics
Relative Efficiency over a range of distributional models. There exist estimates of location which are asymptotically most efficient for
the center of any symmetric distribution. (Adaptive estimation, semi-parametrics).
Robust?
Influence Function over a range of distributional models. Maximum Bias Function and the Breakdown Point.
7/29/2019 ShortCourse on Robust Statistics
10/92
Measuring the effect of an outlier
(not modeled)
Good Data Set: x1, . . . , xn1Statistic: Tn1 = T(x1, . . . , xn1)
Contaminated Data Set: x1, . . . , xn1, x
Contaminated Value: Tn = T(x1, . . . , xn1, x)
THE SENSITIVITY CURVE (Tukey, 1970)
SCn(x) = n(Tn Tn1) Tn = Tn1 + 1n
SCn(x)
THE INFLUENCE FUNCTION (Hampel, 1969, 1974)
Population version of the sensitivity curve.
7/29/2019 ShortCourse on Robust Statistics
11/92
THE INFLUENCE FUNCTION
Statistic Tn = T(Fn) estimates T(F). Consider F and the -contaminated distribution,F = (1 )F + x
where x is the point mass distribution at x.
F F
Compare Functional Values: T(F) .vs. T(F) Given Qualitative Robustness (Continuity):
T(F) T(F) as 0(e.g. the mode is not qualitatively robust)
Influence Function (Infinitesimal pertubation: Gateaux Derivative)
IF(x; T, F) = lim0
T(F) T(F)
=
T(F)
=0
7/29/2019 ShortCourse on Robust Statistics
12/92
EXAMPLES
Mean: T(F) = EF[X].T(F) = EF(X)
= (1 )EF[X] + E[x]= (1
)T[F] + x
IF(x; T, F) = lim0
(1 )T(F) + x T(F)
= x T(F)
Median: T(F) = F1(1/2)
IF(x; T, F) = {2 f(T(F))}1sign(X T(F))
7/29/2019 ShortCourse on Robust Statistics
13/92
Plots of Influence Functions
Gives insight into the behavior of a statistic.
Tn Tn1 + 1n
IF(x; T, F)
Mean Median
-trimmed mean -Winsorized mean
(somewhat unexpected?)
7/29/2019 ShortCourse on Robust Statistics
14/92
Desirable robustness properties for the influence function
SMALL Gross Error Sensitivity
GES(T; F) = supx
| IF(x; T, F) |GES < B-robust (Bias-robust)
Asymptotic Variance
Note : EF[ IF(X; T, F) ] = 0
AV(T; F) = EF[ IF(X; T, F)2 ]
Under general conditions, e.g. Frechet differentiability,
n(T(X1, . . . , X n) T(F)) Normal(0, AV(T; F))
Trade-off at the normal model: Smaller AV Larger GES SMOOTH (local shift sensitivity): protects e.g. against rounding error. REDESCENDING to 0.
7/29/2019 ShortCourse on Robust Statistics
15/92
REDESCENDING INFLUENCE FUNCTION
Example: Data Set of Male Heights in cm180, 175, 192, . . ., 185, 2020, 190, . . .
Redescender = Automatic Outlier Detector
7/29/2019 ShortCourse on Robust Statistics
16/92
CLASSES OF ESTIMATES
L-statistics: Linear combination of order statistics
Let X(1) . . . X(n) represent the order statistics.T(X1, . . . , X n) =
ni=1
ai,nX(i)
where ai,n are constants.
Examples: Mean. ai,n = 1/n Median.
ai,n =
1i = n+12
0 i = n+12for n odd
ai,n =
12
i = n2 ,n2 + 1
0 otherwise
for n even
-trimmed mean.
-Winsorized mean.
General form for the influence function exists Can obtain any desirable monotonic shape, but not redescending Do not readily generalize to other settings.
7/29/2019 ShortCourse on Robust Statistics
17/92
M-ESTIMATES
Huber (1964, 1967)
Maximum likelihood type estimates
under non-standard conditions
One-Parameter Case. X1, . . . , X n i.i.d. f(x; ),
Maximum likelihood estimates Likelihood function. L( | x1, . . . , xn) =
ni=1
f(xi; )
Minimize the negative log-likelihood.
min
n
i=1(xi; ) where (xi; ) =
log f(x : ).
Solve the likelihood equations
ni=1
(xi; ) = 0 where (xi; ) =(xi; )
7/29/2019 ShortCourse on Robust Statistics
18/92
DEFINITIONS OF M-ESTIMATES
Objective function approach: = arg min ni=1 (xi; ) M-estimating equation approach: ni=1 (xi; ) = 0.
Note: Unique solution when (x; ) is strictly monotone in .
Basic examples. Mean. MLE for Normal: f(x) = (2)1/2e
1
2(x)2 for x .
(x; ) = (x )2 or (x; ) = x
Median. MLE for Double Exponential: f(x) =12e|
x
| for x .
(x; ) = | x | or (x; ) = sign(x )
and need not be related to any density or to each other.
Estimates can be evaluated under various distributions.
7/29/2019 ShortCourse on Robust Statistics
19/92
M-ESTIMATES OF LOCATION
A symmetric and translation equivariant M-estimate.Translation equivariance
Xi Xi + a Tn Tn + agives (x; t) = (x t) and (x; t) = (x t)
Symmetric
Xi Xi Tn Tngives (r) = (r) or (r) = (r).
Alternative derivationGeneralization of MLE for center of symmetry for a given family of symmetric
distributions.
f(x; ) = g(|x |)
7/29/2019 ShortCourse on Robust Statistics
20/92
INFLUENCE FUNCTION OF M-ESTIMATES
M-FUNCTIONAL:T(F)
is the solution toEF[(X; T(F))] = 0
.
IF(x; T, F) = c(T, F)(x; T(F))
where c(t, f) = 1/EF
(X;)
evaluated at = T(F).
Note: EF[ IF(X; T, F) ] = 0.
One can decide what shape is desired for the Influence Function and then construct
an appropriate M-estimate.
Mean Median
7/29/2019 ShortCourse on Robust Statistics
21/92
EXAMPLE
Choose
(r) =
c r cr | r | < c
c r cwhere c is a tuning constant.
Hubers M-estimateAdaptively trimmed mean.
i.e. the proportion trimmed depends upon the data.
7/29/2019 ShortCourse on Robust Statistics
22/92
DERIVATION OF INFLUENCE FUNCTION FROM M-ESTIMATES
Sketch
Let T = T(F), and so
0 = EF[(X; T)] = (1 )EF[(X; T)] + (x; T).Taking the derivative with respect to
0 = EF[(X; T)] + (1 )
EF[(X; T)] + (x; T) +
(x; T).
Let (x, ) = (x, )/. Using the chain rule
0 = EF[(X; T)] + (1 )EF[(X; T)]T
+ (x; T) + (x; T)
T
.
Letting 0 and using qualitative robustness, i.e. T(F) T(F), then gives0 = EF[
(X; T)]IF(x; T(F)) + (x; T(F)) RESULTS.
7/29/2019 ShortCourse on Robust Statistics
23/92
ASYMPTOTIC NORMALITY OF M-ESTIMATES
Sketch
Let = T(F). Using Taylor series expansion on (x; ) about gives0 =
ni=1
(xi;) = n
i=1(xi; ) + (
) ni=1
(xi; ) + . . . ,
or
0 =
n{1n
ni=1
(xi; )} +
n( ){1n
ni=1
(xi; )} + Op(1/
n).
By the CLT,n{1n
ni=1 (xi; )} d Z Normal(0, EF[(X; )2]).
By the WLLN, 1nn
i=1 (xi; ) p EF[(X; )].
Thus, by Slutskys theorem,
n( ) d Z/EF[(X; )] Normal(0, 2),where
2 = EF[(X; )2]/EF[
(X; )]2 = EF[IF(X; T F)2]
NOTE: Proving Frechet differentiability is not necessary.
7/29/2019 ShortCourse on Robust Statistics
24/92
M-estimates of location
Adaptively weighted means
Recall translation equivariance and symmetry implies(x; t) = (x t) and (r) = (r).
Express (r) = ru(r) and let wi = u(xi ), then0 =
ni=1
(xi ) =n
i=1(xi )u(xi ) =
ni=1
wi {xi }
=n
i=1 wi xini=1 wi
The weights are determined by the data cloud.
Heavily Downweighted
7/29/2019 ShortCourse on Robust Statistics
25/92
SOME COMMON M-ESTIMATES OF LOCATION
Hubers M-estimate
(r) =
c r cr | r | < c
c r c
Given bound on the GES, it has maximum efficiency at the normal model. MLE for the least favorable distribution, i.e. symmetric unimodal model withsmallest Fisher Information within a neighborhood of the normal.
LFD = Normal in the middle and double exponential in the tails.
7/29/2019 ShortCourse on Robust Statistics
26/92
Tukeys Bi-weight M-estimate (or bi-square)
u(r) =
1 r2
c2
+
2
where a+ = max{0, a}
Linear near zero Smooth (continuous second derivatives) Strongly redescending to 0 NOT AN MLE.
7/29/2019 ShortCourse on Robust Statistics
27/92
CAUCHY MLE
(r) =r/c
(1 + r2/c2)
NOT A STRONG REDESCENDER.
7/29/2019 ShortCourse on Robust Statistics
28/92
COMPUTATIONS
IRLS: Iterative Re-weighted Least Squares Algorithm.k+1 =
ni=1 wi,kxin
i=1 wi,k
where wi,k = u(xi k) and o is any intial value.
Re-weighted mean = One-step M-estimate1 =
ni=1 wi,o xi
ni=1 wi,o
,
where o is a preliminary robust estimate of location, such as the median.
7/29/2019 ShortCourse on Robust Statistics
29/92
PROOF OF CONVERGENCE
Sketch
k+1 k = ni=1 wi,kxini=1 wi,k
k = ni=1 wi,k(xi k)ni=1 wi,k
=ni=1 (xi k)n
i=1 u(xi k)
Note: If k > , then k > k+1 and k < , then k < k+1.
Decreasing objective function
Let (r) = o(r2) and suppose o(s) 0 and o(s) 0. Then
ni=1
(xi k+1) < ni=1
(xi k).
Examples:
Mean o(s) = s Median o(s) =
s
Cauchy MLE o(s) = log(1 + s). Include redescending M-estimates of location.
Generalization of EM algorithm for mixture of normals.
7/29/2019 ShortCourse on Robust Statistics
30/92
PROOF OF MONOTONE CONVERGENCE
Sketch
Let ri,k = (xi k). By Taylor series with remainder term,o(r
2i,k+1) = o(r
2i,k) + (r
2i,k+1 r2i,k)o(r2i,k) +
1
2(r2i,k+1 r2i,k)2o(r2i,)
So,n
i=1o(r2i,k+1)
Recommended