6
Exploratory plots in the analysis of extremes Bikramjit Das 1 and Souvik Ghosh 2,3 1 Singapore University of Technology and Design, Singapore 2 LinkedIn Corporation, Mountain View, California, USA 3 Corresponding author: Souvik Ghosh, e-mail: [email protected] Abstract Exploratory data analysis is often used to test the goodness-of-fit of sample ob- servations to specific target distributions. A few such graphical tools have been extensively used to detect heavy-tailed behavior or extramal behavior in observed data. In this paper we discuss recent theoretical advancements in the understanding of asymptotic behavior of two such plotting tools: the Quantile-Quantile plot and the Mean Excess plot. As an application we suggest better practices in the use of these plots in practice. Key words: extreme values, heavy-tailed distributions, quantile-quantile plots, mean excess plots 1. Introduction Graphical tools play an important role in statistical analysis. Various kinds of plots are used in most, if not all, data analysis. The Quantile-Quantile (QQ) plot and the Mean Excess (ME) plot are among the most popular plots used in the analysis of heavy-tailed data. The ME plot is more broadly used in studying extreme-valued data. In this paper we survey the recent developments that help us better understand the large sample properties of these plots. We use that knowledge to demonstrate how we can use these plots to conduct better analysis and make better decisions. 1.1. QQ plots for heavy tails The QQ plot is popular graphical tool for detecting the goodness-of-fit for the ob- servations in a dataset to some known distribution F . The QQ plot is a plot of the empirical quantiles from the data against the theoretical quantiles of F . If the true distribution of the sample is F then the QQ plot should converge to a straight line. The QQ plot used in studying heavy-tailed data is a little different and specifically designed to check for distributions F where ¯ F := 1-F is regularly varying with some index -1, ξ> 0, also denoted ¯ F RV -1, cf. Resnick (2007), de Haan and Fer- reira (2006) or Bingham, Goldie and Teugels (1987). For a sample X 1 ,X 2 ,...,X n , the QQ plot in the context of heavy-tailed distributions is defined by Q n = n - log j k , log X (j ) X (k) :1 j k o , k < n, (1.1)

Exploratory plots in the analysis of extremesExploratory plots in the analysis of extremes Bikramjit Das1 and Souvik Ghosh2;3 1Singapore University of Technology and Design, Singapore

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploratory plots in the analysis of extremesExploratory plots in the analysis of extremes Bikramjit Das1 and Souvik Ghosh2;3 1Singapore University of Technology and Design, Singapore

Exploratory plots in the analysis of extremes

Bikramjit Das1 and Souvik Ghosh2,3

1Singapore University of Technology and Design, Singapore2LinkedIn Corporation, Mountain View, California, USA

3Corresponding author: Souvik Ghosh, e-mail: [email protected]

Abstract

Exploratory data analysis is often used to test the goodness-of-fit of sample ob-servations to specific target distributions. A few such graphical tools have beenextensively used to detect heavy-tailed behavior or extramal behavior in observeddata. In this paper we discuss recent theoretical advancements in the understandingof asymptotic behavior of two such plotting tools: the Quantile-Quantile plot andthe Mean Excess plot. As an application we suggest better practices in the use ofthese plots in practice.

Key words: extreme values, heavy-tailed distributions, quantile-quantile plots, meanexcess plots

1. Introduction

Graphical tools play an important role in statistical analysis. Various kinds of plotsare used in most, if not all, data analysis. The Quantile-Quantile (QQ) plot and theMean Excess (ME) plot are among the most popular plots used in the analysis ofheavy-tailed data. The ME plot is more broadly used in studying extreme-valueddata. In this paper we survey the recent developments that help us better understandthe large sample properties of these plots. We use that knowledge to demonstratehow we can use these plots to conduct better analysis and make better decisions.

1.1. QQ plots for heavy tails

The QQ plot is popular graphical tool for detecting the goodness-of-fit for the ob-servations in a dataset to some known distribution F . The QQ plot is a plot of theempirical quantiles from the data against the theoretical quantiles of F . If the truedistribution of the sample is F then the QQ plot should converge to a straight line.The QQ plot used in studying heavy-tailed data is a little different and specificallydesigned to check for distributions F where F := 1−F is regularly varying with someindex −1/ξ, ξ > 0, also denoted F ∈ RV−1/ξ, cf. Resnick (2007), de Haan and Fer-reira (2006) or Bingham, Goldie and Teugels (1987). For a sample X1, X2, . . . , Xn,the QQ plot in the context of heavy-tailed distributions is defined by

Qn ={(− log

j

k, log

X(j)

X(k)

): 1 ≤ j ≤ k

}, k < n, (1.1)

Page 2: Exploratory plots in the analysis of extremesExploratory plots in the analysis of extremes Bikramjit Das1 and Souvik Ghosh2;3 1Singapore University of Technology and Design, Singapore

where X(1) ≥ X(2) ≥ . . . ≥ X(n) denote the decreasing order statistics from thesample. We concentrate on the top k quantiles of the data and that is justified bythe fact that F ∈ RV−1/ξ is only an asymptotic property of the right tail of thedistribution F . The value k in (1.1) is a function of the data size n and is chosensuch that k is large (k →∞) but small compared to n (k/n→ 0). Throughout thepaper we will assume that k satisfies this property.

1.2. ME plots

The generalized Pareto distribution (GPD) is an important class of distributionsand is the centerpiece for the peaks-over-threshold method used in extreme valueanalysis (Davison and Smith, 1990). The GPD is characterized by its cumulativedistribution function Gξ,β :

Gξ,β(x) =

{1− (1 + ξx/β)−1/ξ if ξ 6= 0,1− exp(−x/β) if ξ = 0,

(1.2)

The ME plot is often used as a graphical test to check if data conform to a GPD.The ME plot is the plot of the points {(X(k), M(X(k))) : 1 < k ≤ n} where,

M(u) :=

∑ni=1(Xi − u)I[Xi>u]∑n

i=1 I[Xi>u], u ≥ 0. (1.3)

is the empirical estimator of the ME function

M(u) := E[X − u|X > u

], (1.4)

provided EX+ < ∞. For a random variable X ∼ Gξ,β , we have E(X) < ∞ if andonly if ξ < 1 and in this case, the ME function of X is linear in u:

M(u) =β

1− ξ+

ξ

1− ξu, (1.5)

where 0 ≤ u < ∞ if 0 ≤ ξ < 1 and 0 ≤ u ≤ −β/ξ if ξ < 0. The linearity of theME function characterizes the GPD class, and it suggests that if the ME plot fora data set is close to a straight line for high values of the threshold then there isno evidence against the use of a GPD model. See also McNeil, Frey and Embrechts(2005), Embrechts, Kluppelberg and Mikosch (1997) and Hogg and Klugman (1984)for the traditional implementation of this plot.

2. QQ plots and ME plots: theory and practice

Historically, we have depended a lot on heuristics while interpreting the results ofexploratory plots. While using the traditional QQ plot, in order to test the nullhypothesis that a data set is an IID sample from a distribution F , we check if theQQ plot is close to a diagonal straight line. This is an artifact of the Glivenko-Cantelli theorem, cf. Shorack and Wellner (1986). Still there is a gap that remainsto be bridged given that a plot is a random closed set in R2 and the Glivenko-Cantelli theorem describes convergence of the empirical CDF to the theoretical CDFin the space of functions on R. The heuristic arguments have also been applied ininterpreting the results of the QQ plots and ME plots for the analysis of extremes.

Page 3: Exploratory plots in the analysis of extremesExploratory plots in the analysis of extremes Bikramjit Das1 and Souvik Ghosh2;3 1Singapore University of Technology and Design, Singapore

2.1. Convergence in probability

Das and Resnick (2008) study the convergence of the QQ plot (1.1) in an appro-priate topology of random sets. Molchanov (2005) is a good reference on the topicof convergence of random sets. Das and Resnick (2008) show that under the nullhypothesis of F ∈ RV−1/ξ for some ξ > 0, the QQ plot converges in probability toa straight line with slope 1/ξ.

Ghosh and Resnick (2010) discusses convergence in probability of the ME plots.They show that if F is a distribution in the domain of attraction of a GPD Gξ,βwith ξ < 1, i.e. belongs to the maximal domain of attraction of an extreme valuedistribution, then the suitably normalized ME plot converges in probability to astraight line with slope ξ/(1− ξ). The slope of the limiting line is positive, negativeor zero, depending on whether ξ is positive, negative or zero, which is equivalentto F being in domain of attraction of the Frechet, the Weibull or the Gumbeldistributions respectively. Ghosh and Resnick (2011) proved the converse that if theME plot converges to a straight line for some normalization then F must be in thedomain of attraction of a GPD.

The advantage of the ME plot over the QQ plot is that it works when −∞ < ξ < 1whereas the QQ plot works for ξ > 0 only. Hence the ME plot can be used when-ever the sample is in the maximal domain of attraction of any generalized extremevalue distribution with finite mean (Frechet, Weibull or Gumbel distribution). TheQQ plot is restricted to the domain of attraction of Frechet distribution only. Thedisadvantage of the ME plot is that it requires ξ < 1 to make proper sense of theresult, i.e., the underlying distribution should have a finite mean. Still, limits canand have been obtained for the ME plots even when the distributional mean is notfinite, see Ghosh and Resnick (2010).

Fig 1. QQ Plot for right-skewed stable random variables with ξ = 2/3.

2.2. Weak convergence

One data set leads to just one single QQ or ME plot and a single plot is often notenough to statistically detect proximity between the plot and the intended fixed

Page 4: Exploratory plots in the analysis of extremesExploratory plots in the analysis of extremes Bikramjit Das1 and Souvik Ghosh2;3 1Singapore University of Technology and Design, Singapore

limit set. Creating appropriate confidence bounds around these plots though canhelp us to test the null hypothesis with some degree of confidence.

Fig 2. ME Plot for right-skewed stable random variables with ξ = 2/3.

0 100 200 300 400 500 600

−2.

0−

1.5

−1.

0−

0.5

0.0

Pickands

number of order statistics

0 2000 4000 6000 8000 10000

−6

−4

−2

0

Moment

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

20.

40.

6

scaled order statistics

ME

func

tion

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

20.

40.

6

scaled order statistics

ME

func

tion

ME plot: k= 1000 , delta= 0.2 (Weibull par: 0.52 )

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

20.

40.

6

ME

func

tion

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

20.

40.

6

ME

func

tion

ME plot: k= 1000 , delta= 0.3 (Weibull par: 0.52 )

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

20.

40.

6

scaled order statistics

ME

func

tion

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

20.

40.

6

scaled order statistics

ME

func

tion

ME plot: k= 800 , delta= 0.2 (Weibull par: 0.51 )

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

20.

40.

6

ME

func

tion

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

20.

40.

6

ME

func

tion

ME plot: k= 800 , delta= 0.3 (Weibull par: 0.51 )

Fig 3. ME Plot for Beta(2,2) random variables, ξ = −0.5.

Das and Ghosh (2013a) proved the weak convergence of the QQ and the ME plotsfor samples from heavy-tailed distributions in the space of random sets of R2. Thisstudy is very helpful as the results can be used to to construct confidence boundsaround the plots and test the null hypothesis that the sample conforms to the class ofGPD. In Das and Ghosh (2013b), the authors extend the results to random variablesin the maximal domain of attraction of the Gumbel and the Weibull distributions.

It is important to note that using the weak limit to obtain the confidence boundsis a more stable method compared to other statistical tools like bootstrap. The diffi-culty of using bootstrap for data from heavy-tailed distribution is well documentedin Resnick (2007).

Page 5: Exploratory plots in the analysis of extremesExploratory plots in the analysis of extremes Bikramjit Das1 and Souvik Ghosh2;3 1Singapore University of Technology and Design, Singapore

1995 2000 2005 2010

050

100

150

200

Daily Maximum Ozone content in Zurich urban area

time

max

Ozo

ne (

mcg

/cu.

m)

1995 2000 2005 2010

02

46

8

Daily Max−Ozone content (homoscedastecized)

time

max

Ozo

ne (

mcg

/cu.

m)

1995 2000 2005 2010

−2

02

4

Daily Max−Ozone content (after AR(30)−fit)

time

max

Ozo

ne (

mcg

/cu.

m)

0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF after AR−fit

0 100 200 300 400

−0.

6−

0.4

−0.

20.

00.

20.

4

Pickands

number of order statistics

Pic

kand

s es

timat

e of

gam

ma

0 500 1000 1500 2000 2500 3000

−4

−3

−2

−1

0

Moment

number of order statistics

Mom

ent e

stim

ate

of g

amm

a

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

scaled order statistics

ME

func

tion

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

scaled order statistics

ME

func

tion

ME plot: k= 329 , delta= 0.1 (Gumbel)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

scaled order statistics

ME

func

tion

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

scaled order statistics

ME

func

tion

ME plot: k= 329 , delta= 0.2 (Gumbel)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

50.

00.

51.

01.

52.

02.

5

scaled order statistics

ME

func

tion

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

50.

00.

51.

01.

52.

02.

5

scaled order statistics

ME

func

tion

ME plot: k= 658 , delta= 0.1 (Gumbel)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

50.

00.

51.

01.

52.

02.

5

scaled order statistics

ME

func

tion

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

50.

00.

51.

01.

52.

02.

5

scaled order statistics

ME

func

tion

ME plot: k= 658 , delta= 0.2 (Gumbel)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

scaled order statistics

ME

func

tion

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

scaled order statistics

ME

func

tion

ME plot: k= 329 , delta= 0.1 (Weibull par: 0 )

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

scaled order statistics

ME

func

tion

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

scaled order statistics

ME

func

tion

ME plot: k= 329 , delta= 0.2 (Weibull par: 0 )

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

−0.

20.

00.

20.

40.

60.

81.

0

scaled order statistics

ME

func

tion

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

−0.

20.

00.

20.

40.

60.

81.

0

scaled order statistics

ME

func

tion

ME plot: k= 657 , delta= 0.1 (Frechet par: 0.04 )

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

−0.

20.

00.

20.

40.

60.

81.

0

scaled order statistics

ME

func

tion

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

−0.

20.

00.

20.

40.

60.

81.

0

scaled order statistics

ME

func

tion

ME plot: k= 657 , delta= 0.2 (Frechet par: 0.04 )

Fig 4. ME Plot for Beta(2,2) random variables ξ = −0.5.

3. Examples and conclusion

Figure 1 shows the QQ plot for a sample of simulated IID stable(1.5) distribution(ξ = 2/3). Figure 2 shows the ME plot for the same data set. In this example thedistribution is heavy-tailed and belongs to the maximal domain of attraction ofthe Frechet distribution. Figure 3 shows the ME plot for a Beta(2,2) distribution(ξ = −0.5) which is in the maximal domain of attraction of the Weibull distribution.In both these cases the confidence bounds of the plots allow us to accept the nullhypothesis.

We observe a data set, freely available from www.eea.europa.eu. The data setcontains daily maxima of ozone concentration (in µg/m3) from one station in Zurich,Switzerland (station code CH 0010A, Zurich-Kaserne) located 410 mts above sea-level. Data is observed from January 1, 1992 to December 31, 2009. Measurementswere unavailable for 22 days, which we impute by the average value of ozone con-centration on the same day for other available years. As seen in the plots in Figure

Page 6: Exploratory plots in the analysis of extremesExploratory plots in the analysis of extremes Bikramjit Das1 and Souvik Ghosh2;3 1Singapore University of Technology and Design, Singapore

4 the data clearly admits periodicity. We model the seasonality and the stationarycomponent using an AR process. We then analyze the extremal behavior of theresiduals of the model. The confidence bounds indicates that we should not rejectthe null hypothesis that the residuals belong to the maximal domain of attractionof the Gumbel distribution.

The figures clearly demonstrate the usefulness of having the confidence boundsalong with the plots. In many situations (Figure 2 for example) we might wronglyreject the null hypothesis in absence of the confidence bounds. The bounds showus which portion of the plot we should concentrate on when we make the decision.When on one hand we should concentrate on the tail portion of the data, i.e., takek to be small, on the other hand we must take k to be large enough so as to see anylarge sample behavior in the plot.

References

Bingham, N. H., Goldie, C. M. and Teugels, J. L. (1987). Regular Variation.Cambridge University Press.

Das, B. and Ghosh, S. (2013). Weak limits for exploratory plots in the analysis ofextremes. Bernoulli 19 308-343.

Das, B. and Ghosh, S. (2013). Weak limits for mean excess plots for light tailedrandom variables. Preprint.

Das, B. and Resnick, S. I. (2008). QQ Plots, Random Sets and Data from a HeavyTailed Distribution. Stochastic Models 24 103-132.

Davison, A. C. and Smith, R. L. (1990). Models for exceedances over high thresh-olds (with discussion). Journal of Royal Statistical Society, Series B 52 393-442.

de Haan, L. and Ferreira, A. (2006). Extreme Value Theory: An Introduction.Springer-Verlag, New York.

Embrechts, P., Kluppelberg, C. and Mikosch, T. (1997). Modelling ExtremeEvents for Insurance and Finance. Springer-Verlag, Berlin.

Flaschmeyer, J. (1963). Verschiedene topologisierungen im Raum derabgeschlossenen Mengen. Mathematische Nachrichten 26 321–337.

Ghosh, S. and Resnick, S. I. (2010). A Discussion on Mean Excess Plots. Stochas-tic Processes and their Applications 120 1492-1517.

Ghosh, S. and Resnick, S. I. (2011). When does the mean excess plot look linear?.Stochastic Models 27 705-722.

Hogg, R. and Klugman, S. (1984). Loss Distributions. Wiley, New York.McNeil, A. J., Frey, R. and Embrechts, P. (2005). Quantitative Risk Man-

agement: Concepts, Techniques and Tools. Princeton University Press, Princeton,NJ.

Molchanov, I. (2005). Theory of Random Sets. Probability and its Applications(New York). Springer-Verlag London Ltd., London.

Resnick, S. I. (2007). Heavy Tail Phenomena: Probabilistic and Statistical Model-ing. Springer Series in Operations Research and Financial Engineering. Springer-Verlag, New York.

Shorack, G. R. and Wellner, J. A. (1986). Empirical processes with applicationsto statistics. Wiley Series in Probability and Mathematical Statistics. John Wiley& Sons Inc., New York.