ShortCourse on Robust Statistics

7/29/2019 ShortCourse on Robust Statistics

1/92

A SHORT COURSE ON

ROBUST STATISTICS

David E. Tyler

Rutgers

The State University of New Jersey

Web-Site

www.rci.rutgers.edu/ dtyler/ShortCourse.pdf


2/92

References

Huber, P.J. (1981). Robust Statistics. Wiley, New York.

Hampel, F.R., Ronchetti, E.M., Rousseeuw,P.J. and Stahel, W.A. (1986).

Robust Statistics: The Approach Based on Influence Functions.

Wiley, New York.

Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006).Robust Statistics: Theory and Methods. Wiley, New York.


3/92

PART 1

CONCEPTS AND BASIC METHODS


4/92

MOTIVATION Data Set: X1, X2, . . . , X n Parametric Model: F(x1, . . . , xn | )

: Unknown parameter

F: Known function

e.g. X1, X2, . . . , X n i.i.d. Normal(, 2)Q: Is it realistic to believe we dont know (, 2), but we know e.g. the shape of the

tails of the distribution?

A: The model is assumed to be approximately true, e.g. symmetric and unimodal

(past experience).

Q: Are statistical methods which are good under the model reasonably good if themodel is only approximately true?

ROBUST STATISTICS

Formally addresses this issue.


5/92

CLASSIC EXAMPLE: MEAN .vs. MEDIAN

Symmetric distributions: = population meanpopulation median

Sample mean: X Normal

, 2

n

Sample median: Median Normal , 1n 14f()2 At normal: Median Normal

,

2

n2

Asymptotic Relative Efficiency of Median to Mean

ARE(Median, X) =avar(X)

avar(Median)=

2

= 0.6366


6/92

CAUCHY DISTRIBUTIONX Cauchy(, 2)

f(x; , ) = 1

1 + x

21

Mean: X Cauchy(, 2) Median Normal

, 22

4n

ARE(Median, X) = or ARE(X,Median) = 0

For t on degrees of freedom:

ARE(Median, X) = 4( 2) ((+ 1)/2)

2

(/2)

2 3 4 5ARE(Median, X) 1.621 1.125 0.960ARE(X,Median) 0 0.617 0.888 1.041


7/92

MIXTURE OF NORMALSTheory of errors: Central Limit Theorem gives plausibility to normality.

X

Normal(, 2) with probability 1

Normal(, (3)2) with probability

i.e. not all measurements are equally precise.

X (1

) Normal(, 2) + N ormal(, (3)2)

= 0.10

Classic paper: Tukey (1960), A survey of sampling from contaminated distributions. For > 0.10 ARE(M edian, X) > 1 The mean absolute deviation is more effiicent than the sample standard deviation

for > 0.01.


8/92

PRINCETON ROBUSTNESS STUDIESAndrews, et.al. (1972)

Other estimates of location.

-trimmed mean: Trim a proportion of from both ends of the data set and thentake the mean. (Throwing away data?)

-Windsorized mean: Replace a proportion of from both ends of the data setby the next closest observation and then take the mean.

Example: 2, 4, 5, 10, 200

Mean = 44.2 Median = 5

20% trimmed mean = (4 + 5 + 10) / 3 = 6.33

20% Windsorized mean = (4 + 4 + 5 + 10 + 10) / 5 = 6.6


9/92

Measuring the robustness of a statistics

Relative Efficiency over a range of distributional models. There exist estimates of location which are asymptotically most efficient for

the center of any symmetric distribution. (Adaptive estimation, semi-parametrics).

Robust?

Influence Function over a range of distributional models. Maximum Bias Function and the Breakdown Point.


10/92

Measuring the effect of an outlier

(not modeled)

Good Data Set: x1, . . . , xn1Statistic: Tn1 = T(x1, . . . , xn1)

Contaminated Data Set: x1, . . . , xn1, x

Contaminated Value: Tn = T(x1, . . . , xn1, x)

THE SENSITIVITY CURVE (Tukey, 1970)

SCn(x) = n(Tn Tn1) Tn = Tn1 + 1n

SCn(x)

THE INFLUENCE FUNCTION (Hampel, 1969, 1974)

Population version of the sensitivity curve.


11/92

THE INFLUENCE FUNCTION

Statistic Tn = T(Fn) estimates T(F). Consider F and the -contaminated distribution,F = (1 )F + x

where x is the point mass distribution at x.

F F

Compare Functional Values: T(F) .vs. T(F) Given Qualitative Robustness (Continuity):

T(F) T(F) as 0(e.g. the mode is not qualitatively robust)

Influence Function (Infinitesimal pertubation: Gateaux Derivative)

IF(x; T, F) = lim0

T(F) T(F)

=

T(F)

=0


12/92

EXAMPLES

Mean: T(F) = EF[X].T(F) = EF(X)

= (1 )EF[X] + E[x]= (1

)T[F] + x

IF(x; T, F) = lim0

(1 )T(F) + x T(F)

= x T(F)

Median: T(F) = F1(1/2)

IF(x; T, F) = {2 f(T(F))}1sign(X T(F))


13/92

Plots of Influence Functions

Gives insight into the behavior of a statistic.

Tn Tn1 + 1n

IF(x; T, F)

Mean Median

-trimmed mean -Winsorized mean

(somewhat unexpected?)


14/92

Desirable robustness properties for the influence function

SMALL Gross Error Sensitivity

GES(T; F) = supx

| IF(x; T, F) |GES < B-robust (Bias-robust)

Asymptotic Variance

Note : EF[ IF(X; T, F) ] = 0

AV(T; F) = EF[ IF(X; T, F)2 ]

Under general conditions, e.g. Frechet differentiability,

n(T(X1, . . . , X n) T(F)) Normal(0, AV(T; F))

Trade-off at the normal model: Smaller AV Larger GES SMOOTH (local shift sensitivity): protects e.g. against rounding error. REDESCENDING to 0.


15/92

REDESCENDING INFLUENCE FUNCTION

Example: Data Set of Male Heights in cm180, 175, 192, . . ., 185, 2020, 190, . . .

Redescender = Automatic Outlier Detector


16/92

CLASSES OF ESTIMATES

L-statistics: Linear combination of order statistics

Let X(1) . . . X(n) represent the order statistics.T(X1, . . . , X n) =

ni=1

ai,nX(i)

where ai,n are constants.

Examples: Mean. ai,n = 1/n Median.

ai,n =

1i = n+12

0 i = n+12for n odd

ai,n =

12

i = n2 ,n2 + 1

0 otherwise

for n even

-trimmed mean.

-Winsorized mean.

General form for the influence function exists Can obtain any desirable monotonic shape, but not redescending Do not readily generalize to other settings.


17/92

M-ESTIMATES

Huber (1964, 1967)

Maximum likelihood type estimates

under non-standard conditions

One-Parameter Case. X1, . . . , X n i.i.d. f(x; ),

Maximum likelihood estimates Likelihood function. L( | x1, . . . , xn) =

ni=1

f(xi; )

Minimize the negative log-likelihood.

min

n

i=1(xi; ) where (xi; ) =

log f(x : ).

Solve the likelihood equations

ni=1

(xi; ) = 0 where (xi; ) =(xi; )


18/92

DEFINITIONS OF M-ESTIMATES

Objective function approach: = arg min ni=1 (xi; ) M-estimating equation approach: ni=1 (xi; ) = 0.

Note: Unique solution when (x; ) is strictly monotone in .

Basic examples. Mean. MLE for Normal: f(x) = (2)1/2e

1

2(x)2 for x .

(x; ) = (x )2 or (x; ) = x

Median. MLE for Double Exponential: f(x) =12e|

x

| for x .

(x; ) = | x | or (x; ) = sign(x )

and need not be related to any density or to each other.

Estimates can be evaluated under various distributions.


19/92

M-ESTIMATES OF LOCATION

A symmetric and translation equivariant M-estimate.Translation equivariance

Xi Xi + a Tn Tn + agives (x; t) = (x t) and (x; t) = (x t)

Symmetric

Xi Xi Tn Tngives (r) = (r) or (r) = (r).

Alternative derivationGeneralization of MLE for center of symmetry for a given family of symmetric

distributions.

f(x; ) = g(|x |)


20/92

INFLUENCE FUNCTION OF M-ESTIMATES

M-FUNCTIONAL:T(F)

is the solution toEF[(X; T(F))] = 0

.

IF(x; T, F) = c(T, F)(x; T(F))

where c(t, f) = 1/EF

(X;)

evaluated at = T(F).

Note: EF[ IF(X; T, F) ] = 0.

One can decide what shape is desired for the Influence Function and then construct

an appropriate M-estimate.

Mean Median


21/92

EXAMPLE

Choose

(r) =

c r cr | r | < c

c r cwhere c is a tuning constant.

Hubers M-estimateAdaptively trimmed mean.

i.e. the proportion trimmed depends upon the data.


22/92

DERIVATION OF INFLUENCE FUNCTION FROM M-ESTIMATES

Sketch

Let T = T(F), and so

0 = EF[(X; T)] = (1 )EF[(X; T)] + (x; T).Taking the derivative with respect to

0 = EF[(X; T)] + (1 )

EF[(X; T)] + (x; T) +

(x; T).

Let (x, ) = (x, )/. Using the chain rule

0 = EF[(X; T)] + (1 )EF[(X; T)]T

+ (x; T) + (x; T)

T

.

Letting 0 and using qualitative robustness, i.e. T(F) T(F), then gives0 = EF[

(X; T)]IF(x; T(F)) + (x; T(F)) RESULTS.


23/92

ASYMPTOTIC NORMALITY OF M-ESTIMATES

Sketch

Let = T(F). Using Taylor series expansion on (x; ) about gives0 =

ni=1

(xi;) = n

i=1(xi; ) + (

) ni=1

(xi; ) + . . . ,

or

0 =

n{1n

ni=1

(xi; )} +

n( ){1n

ni=1

(xi; )} + Op(1/

n).

By the CLT,n{1n

ni=1 (xi; )} d Z Normal(0, EF[(X; )2]).

By the WLLN, 1nn

i=1 (xi; ) p EF[(X; )].

Thus, by Slutskys theorem,

n( ) d Z/EF[(X; )] Normal(0, 2),where

2 = EF[(X; )2]/EF[

(X; )]2 = EF[IF(X; T F)2]

NOTE: Proving Frechet differentiability is not necessary.


24/92

M-estimates of location

Adaptively weighted means

Recall translation equivariance and symmetry implies(x; t) = (x t) and (r) = (r).

Express (r) = ru(r) and let wi = u(xi ), then0 =

ni=1

(xi ) =n

i=1(xi )u(xi ) =

ni=1

wi {xi }

=n

i=1 wi xini=1 wi

The weights are determined by the data cloud.

Heavily Downweighted


25/92

SOME COMMON M-ESTIMATES OF LOCATION

Hubers M-estimate

(r) =

c r cr | r | < c

c r c

Given bound on the GES, it has maximum efficiency at the normal model. MLE for the least favorable distribution, i.e. symmetric unimodal model withsmallest Fisher Information within a neighborhood of the normal.

LFD = Normal in the middle and double exponential in the tails.


26/92

Tukeys Bi-weight M-estimate (or bi-square)

u(r) =

1 r2

c2

+

2

where a+ = max{0, a}

Linear near zero Smooth (continuous second derivatives) Strongly redescending to 0 NOT AN MLE.


27/92

CAUCHY MLE

(r) =r/c

(1 + r2/c2)

NOT A STRONG REDESCENDER.


28/92

COMPUTATIONS

IRLS: Iterative Re-weighted Least Squares Algorithm.k+1 =

ni=1 wi,kxin

i=1 wi,k

where wi,k = u(xi k) and o is any intial value.

Re-weighted mean = One-step M-estimate1 =

ni=1 wi,o xi

ni=1 wi,o

,

where o is a preliminary robust estimate of location, such as the median.


29/92

PROOF OF CONVERGENCE

Sketch

k+1 k = ni=1 wi,kxini=1 wi,k

k = ni=1 wi,k(xi k)ni=1 wi,k

=ni=1 (xi k)n

i=1 u(xi k)

Note: If k > , then k > k+1 and k < , then k < k+1.

Decreasing objective function

Let (r) = o(r2) and suppose o(s) 0 and o(s) 0. Then

ni=1

(xi k+1) < ni=1

(xi k).

Examples:

Mean o(s) = s Median o(s) =

s

Cauchy MLE o(s) = log(1 + s). Include redescending M-estimates of location.

Generalization of EM algorithm for mixture of normals.


30/92

PROOF OF MONOTONE CONVERGENCE

Sketch

Let ri,k = (xi k). By Taylor series with remainder term,o(r

2i,k+1) = o(r

2i,k) + (r

2i,k+1 r2i,k)o(r2i,k) +

1

2(r2i,k+1 r2i,k)2o(r2i,)

So,n

i=1o(r2i,k+1)

ShortCourse on Robust Statistics

Documents

AMAZING JOURNEY TO ROBUST STATISTICS, DISCOVERING OUTLIERS … · 2020-03-14 · DISCOVERING OUTLIERS FOR EFFICIENT PREDICTION AMAZING JOURNEY TO ROBUST STATISTICS, DISCOVERING OUTLIERS

Extreme Value Statistics and Robust Filtering for

Uncourse to shortcourse

ROBUST STATISTICS

Statistics 240: Nonparametric and Robust Methods stark/Teach/S240/Notes/...Statistics 240: Nonparametric and Robust Methods Philip B. Stark Department of Statistics University of California,

Introduction Robust Statistics...Introduction Why robust statistics Math primer Sensitivity curve In uence function Breakdown point Some robust estimation ideas Ad-hoc ideas The M-estimator

FVSysID ShortCourse 5 Models

APPLICATIONS OF ROBUST STATISTICS IN FINANCE

Robust statistics and some non-parametric statistics

Humanism Shortcourse 2011

FVSysID ShortCourse 4 Methods

FVSysID ShortCourse 8 Software

Robust Statistics Theory And Methods

Robust Statistics Part 2: Multivariate location and scatter

Intelligence and statistics for rapid and robust ...alomax.free.fr/posters/iugg2015/iugg_2015_Early_Warning_Talk... · Intelligence and statistics for rapid and robust earthquake

Part 3 Robust Bayesian statistics & applications in

Robust Statistics in Stata - Timberlake Consultants · . est sto CLEAN Ben Jann (University of Bern) Robust Statistics in Stata London, 08.09.2017 14 . // robust regression using

Robust Modifications of U-statistics and Applications to

Robust and exploratory statistics: Applications using

Applied Robust Statistics-David Olive