Vector generalized linear and additive extreme value models › ~yee › extremes › yeeStephenson.pdf · Vector generalized linear and additive extreme value models 3 a seamless

Extremes (2007) 10:1–19DOI 10.1007/s10687-007-0032-4

Vector generalized linear and additiveextreme value models

Thomas W. Yee · Alec G. Stephenson

Received: 24 January 2006 / Revised: 30 January 2007 /Accepted: 20 February 2007 / Published online: 8 June 2007© Springer Science + Business Media, LLC 2007

Abstract Over recent years parametric and nonparametric regression has slowlybeen adopted into extreme value data analysis. Its introduction has been character-ized by piecemeal additions and embellishments, which has had a negative effect onsoftware development and usage. The purpose of this article is to convey the classesof vector generalized linear and additive models (VGLMs and VGAMs) as offeringsignificant advantages for extreme value data analysis, providing flexible smoothingwithin a unifying framework. In particular, VGLMs and VGAMs allow all parame-ters of extreme value distributions to be modelled as linear or smooth functionsof covariates. We implement new auxiliary methodology by incorporating a quasi-Newton update for the working weight matrices within an iteratively reweighted leastsquares (IRLS) algorithm. A software implementation by the first author, called thevgam package for R, is used to illustrate the potential of VGLMs and VGAMs.

Keywords Fisher scoring · Iteratively reweighted least squares ·Maximum likelihood estimation · Penalized likelihood · Smoothing ·Extreme value modelling · Vector splines

AMS 2000 Subject Classification 62P99

T. W. Yee (B)Department of Statistics, University of Auckland, Private Bag 92019,Auckland, New Zealande-mail: [email protected]

A. G. StephensonDepartment of Statistics & Applied Probability, National University of Singapore,Singapore 117546, Singaporee-mail: [email protected]

2 T.W. Yee, A.G. Stephenson

1 Introduction

Over the last decade, nonparametric regression or smoothing has had a large andpowerful influence in many areas of applied statistics. Unfortunately, its applicationto extreme value data analysis has been rather slow and piecemeal. Understandably,parameters of distributions associated with extreme value theory were originallymodelled as scalars due to the theoretical and computational limitations at thattime. Parameters were subsequently modelled as linear functions of covariates, e.g.,Smith (1986) extends Weissman (1978) and allows the location parameter of thegeneralized extreme value (GEV) distribution to be a function of time with linearand sinusoidal components, and Tawn (1988) uses a linear trend for a marginallocation parameter of a bivariate extreme value model. More recently, parametershave been modelled as fully smooth functions of covariates, e.g., Rosen and Cohen(1996) extend Smith (1986) by using cubic splines to model both parameters of theGumbel distribution, Pauli and Coles (2001) use a spline smoother for the GEVlocation parameter, and Chavez-Demoulin and Davison (2005) apply splines to theparameters of a peaks-over-threshold model.

In general statistical data analysis, smoothing allows for a data-driven approachrather than a model-driven approach so the data can speak for itself, and this is aninvaluable asset. With extreme value data, care must be taken when allowing a data-driven approach because many extreme value models are based on asymptotics thatrequire extrapolation and thus are fundamentally model-driven. Nevertheless, whenjustifiable, a data-driven approach nested within this model-driven framework can bevery useful. A modern text illustrating the wide applicability and power of smoothingis Schimek (2000).

In recent years, several pieces of software have appeared for extreme value dataanalysis: see Stephenson and Gilleland (2006) for a comprehensive review. Thesehave given extreme value analysts tools for exploring their data and fitting variousstatistical models. Each software package provides useful utilities, but these aregenerally constrained within a narrow modelling context. It is hoped that the readerwill appreciate that a unified theory should result in more streamlined software thatmakes life for the practitioner much easier. Similarly, the extreme value modellingliterature may be assimilated more successfully if seen in the context of an overar-ching structure. Ultimately, we believe the lack of a unified theoretical frameworkamong extreme value models has had negative consequences, and this paper seeksto redress this issue.

The purpose of this article is to convey the advantage obtained by consideringthe classes of vector generalized linear models (VGLMs; Yee and Hastie 2003)and vector generalized additive models (VGAMs; Yee and Hastie 1996) in thecontext of extreme value analysis. We have developed new methodology in order toincorporate extreme value models within the VGAM framework, namely, a quasi-Newton update within the iteratively reweighted least squares (IRLS) algorithm,the details of which are given in Section 2.2. The VGLM/VGAM classes of modelsare very large; they have been successfully applied to categorical data analysis, LMSquantile regression, and scores of standard and nonstandard univariate and contin-uous distributions. VGLMs/VGAMs provide a substantial and unifying frameworkfor many statistical methods applicable for extreme value models. They also provide

Vector generalized linear and additive extreme value models 3

a seamless transition between parametric and nonparametric analyses, allowingparameters to be modelled as linear or smooth functions of covariates.

The VGLM/VGAM classes are implemented in the vgam package (Yee 2007)for the R statistical computing environment (Ihaka and Gentleman 1996). Thispackage provides a number of additional benefits above what is available elsewhere,and the modular nature of vgam also makes it easier for programmers to addnew methodology through an object-oriented programming framework. The vgampackage currently works only in R and not S-PLUS, however, the latter may besupported in the future.

An outline of this article is as follows. Section 2 reviews VGLMs and VGAMs, andSection 3 looks at extreme value models within this context. Section 4 demonstratesthe vgam package with analyses of sea-level and rainfall data, and a simulationstudy. The paper concludes with a discussion in Section 5. Appendix 1 providessome technical implementation details, and Appendix 2 gives sample R code for theextreme value analyses presented in Section 4.

2 Vector Generalized Linear and Additive Models

In this section a skeleton of necessary details is presented regarding VGLMs andVGAMs so that the purposes of this paper can be achieved. Fuller details can befound in Yee and Hastie (2003) and Yee and Wild (1996), respectively.

Suppose the observed response y is a q-dimensional vector. VGLMs are definedas a model for which the conditional distribution of Y given explanatory x is ofthe form

f (y|x; B) = h(y, η1, . . . , ηM) (2.1)

for some known function h(·), where B = (β1 β2 · · · βM) is a p × M matrix ofunknown regression coefficients, and the jth linear predictor is

η j = η j(x) = βTj x =

p∑

k=1

β( j)k xk, j = 1, . . . , M, (2.2)

where x = (x1, . . . , xp)T with x1 = 1 if there is an intercept. VGLMs are thus like

GLMs but allow for multiple linear predictors, and they encompass models outsidethe limited confines of the exponential family. Indeed, the breadth of the class coversa wide range of multivariate response types and models including univariate andmultivariate distributions, categorical data analysis, time series, survival analysis,generalized estimating equations, correlated binary data, bioassay data and nonlinearleast-squares problems.

The η j of VGLMs may be applied directly to parameters of a distribution ratherthan just to means as for GLMs. A simple example is a univariate distribution witha location parameter μ and a scale parameter σ > 0, where we may take η1 = μ andη2 = log σ . In general, η j = g j(θ j) for some parameter link function g j and parameterθ j. In vgam, there are currently over a dozen links to choose from, of which any canbe assigned to any parameter, ensuring maximum flexibility.


There is no relationship between q and M in general: it depends specifically onthe model or distribution to be fitted. For simple extreme value models M is usuallythe number of parameters to be estimated and q may be any positive integer. For thestationary GEV and GP models defined in Section 3 we have q = 1, with M = 3 andM = 2 respectively.

VGLMs are estimated using IRLS. Most models that can be fitted have a log-likelihood � = ∑n

i=1 wi�i and this will be assumed here. The wi are known positiveprior weights. Let xi denote the explanatory vector for the ith observation, for i =1, . . . , n. Then one can write

ηi = η(xi) =⎛

⎜⎝η1(xi)

...

ηM(xi)

⎞

⎟⎠ = BT xi =⎛

⎜⎝βT

1 xi...

βTMxi

⎞

⎟⎠ .

In IRLS, an adjusted dependent vector zi = ηi + W−1i d i is regressed upon a large

model matrix, with d i = wi∂�i/∂ηi. The working weights Wi are usually taken asWi = −wi∂

2�i/(∂ηi ∂ηTi ) or Wi = −wi E[∂2�i/(∂ηi ∂ηT

i )], giving rise to the Newton-Raphson and Fisher-scoring algorithms respectively. Fisher-scoring is to be preferredfor numerical stability because the Wi are positive-definite over a larger region ofparameter space. The price to pay for this stability is typically a slower convergencerate and less accurate standard errors (Efron and Hinkley 1978).

VGAMs provide additive-model extensions to VGLMs, that is, Eq. 2.2 isgeneralized to

η j(x) = β( j)1 +p∑

k=2

f( j)k(xk), j = 1, . . . , M,

a sum of smooth functions of the individual covariates, just as with ordinary GAMs(Hastie and Tibshirani 1990). The f k = ( f(1)k(xk), . . . , f(M)k(xk))

T are centered foruniqueness, and are estimated simultaneously using vector smoothers. VGAMs arethus a visual data-driven method that is well suited to exploring data, and they retainthe simplicity of interpretation that GAMs possess.

In practice we may wish to constrain the effect of a covariate to be the same forsome of the η j and to have no effect for others. For example, for VGAMs we maywish to take

η1 = β(1)1 + f(1)2(x2) + f(1)3(x3),

η2 = β(2)1 + f(1)2(x2),

so that f(1)2 ≡ f(2)2 and f(2)3 ≡ 0. For VGAMs, we can represent these models using

η(x) = β(1) +p∑

k=2

f k(xk) = H1 β∗(1) +

p∑

k=2

Hk f ∗k(xk)

where H1, H2, . . . , Hp are known full-column rank constraint matrices, f ∗k is a vector

containing a possibly reduced set of component functions and β∗(1) is a vector

of unknown intercepts. With no constraints at all, H1 = H2 = · · · = Hp = IM and


β∗(1) = β(1). Like the f k, the f ∗

k are centered for uniqueness. For VGLMs, the f kare linear so that

BT =(

H1β∗(1) H2β

∗(2) · · · Hpβ

∗(p)

).

VGAMs can be estimated by applying a modified vector backfitting algorithm(Buja et al. 1989) to the zi (this is called the local scoring algorithm). This entailsdecomposing f( j)k(xk) = β( j)k xk + g( j)k(xk) and simultaneously fitting all the linearparts first, followed by ordinary backfitting to estimate the nonlinear parts g( j)k(xk).Modified backfitting improves ordinary backfitting in terms of the convergence rateand numerical stability.

2.1 Vector Splines and Penalized Likelihood

Just as GAMs employ smoothers in their estimation, so VGAMs employ vectorsmoothers in their estimation. For scatterplot data (xi, yi, �i), i = 1, . . . , n, vec-tor smoothers fit a vector of smooth functions f (x) = ( f1(x), . . . , fM(x))T to thevector measurement model

yi = f (xi) + εi, i = 1, . . . , n, (2.3)

where εi are independent, E(εi) = 0 and Var(εi) = �i. Here the �i are known sym-metric and positive-definite error covariances. The Wi = �−1

i are known functions ofthe parameters within the Newton-Raphson or Fisher-scoring algorithms.

Vector splines minimize the quantity

n∑

i=1

{yi − f (xi)

}T�−1

i

{yi − f (xi)

} +M∑

j=1

λ j

∫ b

a{ f ′′

j (t)}2 dt (2.4)

where a < min(x1, . . . , xn) and b > max(x1, . . . , xn). The first term measures lackof fit while the second term is a smoothing penalty. Equation 2.4 simplifies toan ordinary cubic smoothing spline when M = 1, (see, e.g., Green and Silverman1994). Each component function f j has a non-negative smoothing parameter λ j

which operates as with an ordinary cubic spline. Fessler (1991) proposed an O(nM3)

algorithm to fit vector splines based essentially on the Reinsch algorithm (Reinsch1967), however Yee (2000) proposed an alternative algorithm based on B-splines(de Boor 1978) that has advantages such as improved numerical stability and ease ofprediction of the functions.

Now consider the penalized likelihood

� − 1

2

p∑

k=1

M∑

j=1

λ( j)k

∫ b k

ak

{ f ′′( j)k(xk)}2 dxk. (2.5)

Special cases of this quantity have appeared in a wide range of situations in thestatistical literature to introduce cubic spline smoothing into a model’s parameterestimation (for a basic overview of this roughness penalty approach see Greenand Silverman 1994). For the framework of VGAMs, Yee (1993) has shown thatthe VGAM local scoring algorithm can be justified via maximizing the penalizedlikelihood (2.5) with vector spline smoothing.


From a practitioner’s viewpoint the smoothness of a smoothing spline can be moreconveniently controlled by specifying the degrees of freedom of the smooth ratherthan λ j. Given a value of the degrees of freedom, the software can search for the λ j

producing this value. In general, the higher the degrees of freedom, the more wigglythe curve. In the examples below, 1 degree of freedom corresponds to a linear fit.

One null hypothesis of interest with VGAMs is to test whether a componentfunction f( j)k(xk) is linear in xk. The alternative hypothesis is that it is a nonlinearfunction. If H0 cannot be rejected then it is more parsimonious to model it as alinear function. The generic function summary() applied to a vgam() objectconducts a very approximate score test for H0, along the lines of the details givenin Section 7.4.5 of Chambers and Hastie (1993).

With VGAMs approximate inference via a deviance test is available. This is de-scribed in Section 6.8 of Hastie and Tibshirani (1990); see also Hastie and Tibshirani(1987). Briefly, for full-likelihood models, the difference in deviance between twomodels (one nested within the other) is approximately χ2 with degrees of freedomequal to the difference in degrees of freedom between the models.

2.2 Quasi-Newton IRLS

In the VGAM context we typically prefer Fisher scoring for the numerical stabilityit provides. Unfortunately, Fisher scoring is not as practical in the extreme valuedomain because of the general difficulty of obtaining the expected informationmatrix. For example, Oakes and Manatunga (1992) and Shi (1995) each give theexpected information matrix for specific extreme value models, but they typicallyinvolve numerical evaluation of integrals and are tedious to compute. The alternativeNewton-Raphson approach can also be problematic for extreme value modelsbecause the individual working weight matrices may not be positive-definite evenat the solution. For some extreme value models we have therefore adopted a quasi-Newton algorithm within the IRLS procedure. To our knowledge, the application ofa quasi-Newton method to update the working weight matrices within IRLS is novel.We use a rank-2 BFGS (Broyden 1970) update for each Wi matrix. The basic detailsare as follows.

The idea is to use a rank-2 update of the Hessian matrix at each iteration step,thus building it up so that it becomes closer to the real value at the solutionas the iterations converge to the solution. Importantly, only first order derivativeinformation is used. At the first iteration we use a steepest descent step W(1)

i = wi IM.In vgam we apply the BFGS update to each weight matrix Wi. We have, for each

i = 1, . . . , n and at iteration a,

W(a+1)

i = W(a)

i + q(a)

i q(a)Ti

s(a)Ti q(a)

i

− W(a)

i s(a)

i s(a)Ti W(a)

i

s(a)Ti W(a)

i s(a)

i

(2.6)

where

q(a)

i = d(a)

i − d(a+1)

i ,

s(a)

i = η(a+1)

i − η(a)

i . (2.7)

For this method, symmetry and positive-definiteness are assured provided s(a)Ti q(a)

i >

0. The update of Wi is skipped if s(a)Ti q(a)

i ≤ 0 to ensure that the weight matrices


remain positive-definite. In practice, we have found that this method has given goodresults.

3 Extreme Value Models

In this section we give a brief outline of the extreme value models and methodologyincluded in the vgam package. Table 1 lists some of the central vgam family andplotting functions at the heart of this paper. A vgam family function specifies thetype of model to be used, and is assigned to the family argument of vglm()and vgam(). The vgam package contains many other family functions related toextreme value analyses, such as the Weibull, bivariate logistic, Fréchet, and log-gamma distributions. We will focus here on models based on the generalized extremevalue (GEV) and generalized Pareto (GP) distributions. For a more detailed reviewof these models see, e.g., Coles (2001) or Beirlant et al. (2004).

Let Z1, . . . , Zm be a random sample from a distribution F, and let Ym =max{Z1, . . . , Zm}. Extreme value theory specifies that as m → ∞ the class of non-degenerate limiting distributions for Ym, under linear normalization, is the GEVdistribution

G(y) = exp[−{1 + ξ (y − μ) /ψ}−1/ξ

+], (3.1)

where μ, ψ > 0 and ξ are the location, scale and shape parameters respectively, andh+ = max(h, 0). Within the VGLM/VGAM framework we implement non-stationarymodels where the ith maximum is assumed to be distributed as GEV with parameters(μi, ψi, ξi) for i = 1, . . . , n.

The above result also leads to an approximation to the conditional exceedanceprobability Pr(Y > y|Y > u) for all y greater than some predetermined large thresh-old u. Specifically, the generalized Pareto (GP) distribution (Pickands 1975) canbe obtained as an approximation of the conditional distribution of exceedances.Given n data points y1, . . . , yn, a non-stationary extension of this model leads to thelikelihood function

n∏

i=1

{1

σi

[1 + ξi(yi − ui)

σi

]−1/ξi−1}δi

(1 − Pi)1−δi Pδi

i (3.2)

Table 1 Some central Rfunctions for extreme valuedata currently available in thevgam package

The first group are vgamfamily (modelling) functionsand the second are plottingfunctions.

gev GEV distribution (matrix responses allowed)gpd Generalized Pareto distributiongumbel Gumbel (matrix responses allowed)egumbel Gumbel (univariate response)egev GEV (univariate response)

guplot Gumbel plotmeplot Mean excess (mean residual life) plotqtplot Quantile plotrlplot Return level plot


where δi is an indicator for the exceedance of the ith data point over the threshold ui

and Pi = Pr(Yi > ui). The likelihood can be separated into two parts. The firstterm approximates the conditional exceedance probability for the ith responseas GP with parameters (σi, ξi) for i = 1, . . . , n. This term can be easily modelledunder the VGLM/VGAM framework. The second term, which only involves theprobability Pi, is the likelihood for a binomial GLM/GAM where δi is the response.A binomial GLM/GAM can therefore be used to model Pi separately. If n is largean alternative approach is to divide the observation period into subintervals, averagethe covariates within each interval, and use a Poisson GLM/GAM where the numberof exceedances within each interval is the response. Offsets can be used to accountfor subintervals containing different numbers of observations.

The models outlined above assume the observations are independent. For GEVmodels this assumption is reasonable if the observed maxima are taken over arelatively large observational period. For GP models this assumption may not betrue because extreme events may form clusters, and observed series of exceedanceswill then show varying levels of short-range dependence. In practice this problem canbe solved by identifying independent clusters and working with the cluster maxima.See, e.g., Leadbetter et al. (1983) for the theoretical basis of this approach and Ferroand Segers (2003) for a discussion of cluster identification.

4 Examples

To illustrate the flavour of vgam modelling the following analyses were performed(R 2.4.1 was used). R is one of two implementations of the S language (the other be-ing S-PLUS), and the modelling ideas of Chambers and Hastie (1993) are prominentin this section. The core of the code used here is given in Appendix 2.

4.1 Fremantle Data Example

Coles (2001) describes a data set consisting of the annual maximum sea-levelsrecorded at Fremantle, near Perth in Australia. There are 86 annual measurementsfrom 1897 to 1989 (some years are missing), and there are two variables, Year andSOI, which give the year and the annual mean values of the Southern OscillationIndex, a proxy for meteorological volatility.

An initial GEV VGAM model was fitted using both variables and giving a nominalthree degrees of freedom to each smooth. The shape parameter ξ was fitted as anintercept-only term for three reasons. Firstly, data provides little information onthis parameter in general. Secondly, estimation is numerically fraught when thisparameter is allowed to be too flexible. Thirdly, the data set is small here. (Thefirst two reasons are why the software default is an intercept-only for ξ .) We choseadditive predictors

μ(x) = η1 = β(1)1 + f(1)2(x2) + f(1)3(x3)

log ψ(x) = η2 = β(2)1 + f(2)2(x2) + f(2)3(x3)

ξ = η3 = β(3)1 (4.1)


where x2 = Year and x3 = SOI. Note that ξ > − 12 corresponds to the regular case

where the classical asymptotic theory of maximum likelihood estimators is applicable(Smith 1985) and the working weight matrices are positive-definite. Figure 1 showsthe fitted functions. It appears that all but one of the functions, f̂(2)2(x2), are linear,and an approximate score test for linearity supported this view. These results agreewith Coles (2001) who concluded that μ is linear in Year, however our method ofseeing this feature directly from a plot that allows nonlinearity is more convenientand reassuring.

The fitted functions of Fig. 1 are plotted with pointwise ±2 standard error bandsabout them. It is common for these types of plots for the standard error bands to bewidest at the boundaries due to there being less data there. The estimate of ξ was−0.27 with standard error 0.06, which indicates a Weibull-type GEV. The functionf̂(2)2(x2), is “L” shaped due to the smaller scale around the year 1945. These resultssuggest the model behind Fig. 2, where the three functions are linear and the smoothterm f(2)2(x2) is modelled using a piecewise-linear regression spline, i.e., a broken-stick model. An informal deviance test between the two models gives a p-value of0.43 ≈ P[χ2

250−242.91 > 2(63.33 − 59.80)], indicating that this simplified model suffices.In the analysis by Coles (2001) of these data, he also concluded that the effect

of SOI is influential on annual maximum sea-levels at Fremantle, even after theallowance for time-variation, i.e., both Year and SOI need to be in the model forthe μ parameter. Our analysis tends to agree with this conclusion. Furthermore,our second model agrees with an estimated time trend in the location parameterat around 2mm per year. The SOI can be interpreted as a measure of meteorologicalvolatility due to ocean-atmosphere phenomenon such as the El Niño effect, and

1900 1920 1940 1960 1980

–0.20

–0.15

–0.10

0.05

0.00

0.05

0.10

Year

s(Y

ear,

df =

3):

1

1900 1920 1940 1960 1980

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

Year

s(Y

ear,

df =

3):

2

–1 0 1 2

–0.2

–0.1

0.0

0.1

0.2

0.3

SOI

s(S

OI,

df =

3):

1

–1 0 1 2

–1.0

–0.5

0.0

0.5

1.0

SOI

s(S

OI,

df =

3):

2

–

Fig. 1 Initial VGAM fitted to the Fremantle data (called fit1 in Appendix 2). Each function hasthree degrees of freedom and the dashed lines are ±2 SE bands. From top left going clockwise, thefitted functions are f̂(1)2(x2), f̂(2)2(x2), f̂(2)3(x3), f̂(1)3(x3) in Eq. 4.1


1900 1920 1940 1960 1980

–0.15

–0.10

–0.05

0.00

0.05

0.10

Year

part

ial f

or Y

ear

1900 1920 1940 1960 1980

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

Year

bs(Y

ear,

deg

ree

= 1

, kno

t = 1

945)

–1 0 1 2 –1 0 1 2

–0.1

0.0

0.1

0.2

SOI

part

ial f

or S

OI:1

–0.5

0.0

0.5

1.0

SOI

part

ial f

or S

OI:2

Fig. 2 A more parsimonious VGAM fitted to the Fremantle data (called fit2 in Appendix 2). It isa parametric version of Fig. 1. The pointwise ±2 standard error bands intersect at the sample meanof each covariate that has a linear function fitted to it

the analysis shows that, accounting for the time trend, such effects explain some ofthe observed variation of extreme sea-levels occurring on the Western Australiancoastline (Coles 2001).

4.2 Rainfall Data Example

In this example we illustrate the GP model for exceedances using a large datasetof daily rainfall records, recorded throughout the period 1935–1970 at a site inSouthwest England. There are some days with missing values, and these were omittedfrom the analysis. We incorporated spatial information through the records gatheredat four neighbouring sites, labelled A–D. Sites A, B, C and D respectively lie to theNorth-East, East, South and West of the central site of interest. The observationtimes are also included in the model, giving five covariates in total. For pairwisebivariate Bayesian analyses of yearly site maxima, see Smith and Walshaw (2003).

One of the difficulties of non-stationary modelling under the GP model is theissue of threshold choice. For stationary models the threshold is typically chosento be constant across time, with the constant threshold specified on the basis ofstandard graphical diagnostics (Beirlant et al. 2004). With non-stationarity it maynot reasonable to choose a constant threshold given the expectation that theremay be more exceedances in some periods than others; yet there are no publishedprocedures (known to the authors) that assist in the determination of a non-constantthreshold. For our analysis we used a fairly ad hoc two-stage procedure. At the firststage, we ignored the non-stationarity and chose a constant threshold using standard


methods. Approximately 4.6% of the observations at our site of interest exceededthis threshold. At the second stage we chose a non-constant threshold across timeusing quantile regression, implementing a cubic regression spline with time as thecovariate, and estimating the quantile corresponding to an upper tail probability of4.6%. There were 436 exceedances of the resulting non-constant threshold.

An additional modelling difficulty is the dependence that may be exhibited bythe data series. Environmental processes typically exhibit little or no long-range de-pendence, but often exhibit short-range dependence in the form of extreme clusters.For our data, the intervals estimator (Ferro and Segers 2003) of the extremal index(Leadbetter et al. 1983) was approximately 0.83, giving an estimated limiting meancluster size of about 1.2, implying that there is a slight tendency to cluster at highlevels of daily rainfall. We therefore decided to de-cluster the data by implementingthe runs method (Coles 2001). Using the non-constant threshold derived earlier anda run length of 3 days, the method identified 326 extreme clusters from our 436exceedances, and the 326 cluster maxima were used for our GP model.

The results shown here are for the model

log σ(x) = η1 = β(1) + fA(xA) + fB(xB) + fC(xC) + fD(xD) + βT xT ,

ξ(x) = η2 = β(2),

where xT is the time covariate and, e.g., xA is the rainfall covariate for site A. Inorder to illustrate the different features of the vgam package we model the functionsas cubic regression splines, rather than smoothing splines as were used in the previousexample. Each spline is implemented using a standard B-spline basis, and hence thismodel lies within the VGLM framework. Given the large number of cluster maximawe also considered a time trend for the shape parameter, but this was subsequentlyfound to be statistically insignificant.

The fitted regression splines are given in Fig. 3. They generally depict a positivelinear relationship, though there is some suggestion of a decreasing scale withincreasing rainfall at site B over the [0, 200] mm interval. The estimated values forthe log-scale and shape intercepts were about 4.4 and −0.4 respectively, and therewas a small but statistically significant decreasing trend on log-scale, decreasing byabout 0.1 every 8 years.

When the site covariates are omitted from the analysis the estimated log-scaleis larger than the intercept of the VGLM model, and consequently the estimatedshape is also larger, taking a value close to zero. When we regress on the sitevalues, the information reduces the extremal variation and hence the the tail ofthe rainfall distribution at the central site becomes lighter under the conditioning.For prediction purposes, this example is merely illustrative; clearly the rainfall atthe neighbouring sites cannot be known in advance. However, it does show thepotential for incorporating other geographical and meteorological processes intomodels of the form presented here. If such processes can be predicted with accuracy,the flexible regression tools provided by vgam offers the potential for considerableimprovements in the estimation of future return levels.

4.3 Simulation Example

In this example we illustrate the GP model for exceedances using simulated data. Wesimulate 500 exceedances of the threshold u = 0 from a GP model with a covariate


0 200 400 600 800 1000 1200

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

SiteA

bs(S

iteA

)

0 200 400 600 800 1000

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

SiteB

bs(S

iteB

)

0 200 400 600 800

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

SiteC

bs(S

iteC

)

0 100 200 300 400 500 600

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

SiteD

bs(S

iteD

)

1935 1940 1945 1950 1955 1960 1965 1970

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

Year

part

ial f

or Y

ear

Fig. 3 A VGAM fitted to the rainfall data (called rainfit in Appendix 2)

x2 generated as an equally spaced sequence between 0 to 1, and x3 ∼ Uniform(0, 1)

independently, with log σ(x) = 12 sin(2πx2) + 0.1x3 and ξ = −0.1 denotes the true

model. This involves a constant shape parameter and a scale parameter with aperiodic component. Our experience with the GP shape parameter have mirroredothers who have found it is often most parsimonious to fit this parameter as anintercept-only term. For this reason, this is the software default. A reasonable firstfitted model can be written

log σ(x) = η1 = β(1)1 + f(1)2(x2) + f(1)3(x3), (4.2)

log

(ξ + 1

2

)= η2 = β(2)1 (4.3)

for cubic smoothing spline functions f(1)2 and f(1)3. Note that if ξ ≈ 0 then η2 ≈− log 2 + 2ξ in Eq. 4.3.


0.0 0.2 0.4 0.6 0.8 1.0

– 0.6

– 0.4

– 0.2

0.0

0.2

0.4

x2

s(x2

)

0.0 0.2 0.4 0.6 0.8 1.0

– 0.4– 0.3– 0.2– 0.1

0.00.10.20.3

x3

s(x3

)

Fig. 4 VGAM model fitted to some simulated GPD data with n = 500. The functions are estimatesof f(1)2(x2) and f(1)3(x3) in Eq. 4.2 and each have four degrees of freedom

A typical fit of Eqs. 4.2 and 4.3 is illustrated in Fig. 4. It can be immediatelyapparent that the VGAM picks up both the sine curve and linearity straightaway.The ±2 SE bands in the latter plot are wide, and upon this result, the user wouldtypically be parsimonious and constrain f(1)3 to be linear in x3 (or drop x3 altogether).Of interest then are the estimates of: (a) the shape parameter ξ = eη2 − 1

2 , and(b) β(1)3 where

log σ(x) = η1 = β(1)1 + f(1)2(x2) + β(1)3 x3. (4.4)

In order to illustrate the small sample properties of our estimators in this particularcase, we sample exceedences from the true model described above for sample sizesn =100, 500, and 1,000. For each sample size we simulate 500 realizations and fit themodel (4.3) and (4.4). Then over these simulations, the standard errors and bias canbe computed for ξ̂ and β̂(1)3. The results are summarized in Table 2. Histograms of theestimates (not given) show that the distributions of ξ̂ and β̂(1)3 are skewed away fromzero for n = 100, and a little skewed for n = 500, and very normal-like for n =1,000.The bias and standard errors can be estimated by computing the sample means andstandard deviations of the estimates. The results of Table 2 tend to agree with what isto be expected: as n increases both the bias and standard errors decrease. The meansquare error (MSE) can be calculated for ξ̂ as 0.0367, 0.00237, 0.00105 (for n = 100,500, 1,000 respectively), and for the slope β̂(1)3 they are 0.116, 0.0216, 0.0104. Thusthe estimates appear to be consistent as n → ∞.

Table 2 Simulation results of the bias and standard deviation of the estimates from a sample size ofn GPD observations

n = 100 n = 500 n = 1000

Bias(̂ξ) −0.136 −0.0231 0.0310Bias(β̂(1)3) −0.0117 −0.0136 0.0143SD(̂ξ) 0.135 0.0429 0.0310SD(β̂(1)3) 0.341 0.146 0.101

The results are from 500 simulations each.


5 Discussion

One of the major breakthroughs of the 1970s in statistics was the proposal of theclass of GLMs (Nelder and Wedderburn 1972). It unified several important anddifferent regression models at that time under a common framework, and has had amajor impact on statistics. Nowadays, courses on GLMs are taught by many statisticsdepartments, and most major statistical software packages now offer routines forfitting and analysing GLMs. In the 1980s GAMs added smoothing capabilities toGLMs. This paper has described the class of VGLMs and VGAMs which extendGLMs and GAMs to handle multiple predictors, multivariate responses and manystatistical models and distributions. We have attempted to convey the many advan-tages that VGLMs/VGAMs can offer for extreme value analyses, for example, theyprovide a broad and attractive family for representing non-stationarity in extremevalue datasets.

Although we have illustrated only GEV and GP modelling, the versatility ofVGLMs and VGAMs for extreme value analysis is far broader. The q-dimensionalmultivariate extreme value distribution, obtained as the class of limiting distribu-tions of linearly normalized componentwise maxima, can also be modelled withinthe VGLM/VGAM framework. The censored likelihood approach for modellingmultivariate threshold exceedances (Ledford and Tawn 1996) can similarly beemployed. The are no additional difficulties in implementing these larger classes ofextreme value models under VGLM/VGAM, other than those which exist under anymodelling framework, such as model choice and increasing computational burden.The implementation of these models in the vgam package is an area of further work.

The virtues of the VGLM/VGAM framework are extensive, but VGLMs/VGAMsare not without some weaknesses. For such a wide range of models, reliability andnumerical stability is an important issue. Complex parameter constraints and non-regular likelihood conditions can be problematic. Expected information matrices areoften difficult to derive or too expensive to evaluate, and this often necessitates theuse of different algorithms for different types of models such as those described inSection 2.2. Furthermore, issues of model choice and diagnostic tools require morespecific methodologies, and different implementations are required to accommodatethem. The addition of modelling diagnostics and improved handling of parameterconstraints are areas of further work. Despite these issues, we feel that the VGAMframework offers many compelling advantages to theoreticians and practitioners ofextreme value analysis.

Acknowledgements The first author wishes to thank Stuart Coles for encouragement to start thiswork during a period of sabbatical leave at the Department of Mathematics, University of Bristol.This work was completed while the first author was on another sabbatical at the Departmentof Statistics & Applied Probability at the National University of Singapore. He thanks bothdepartments for their hospitality. We thank the three reviewers and Editor for their many detailedcomments which improved the paper markedly.

Appendix 1: Implementation Details

This section briefly describes some implementation details used by the vgam package,especially those relating to numerical analysis. Online documentation is provided


with the software, and additional documentation is provided at the first author’swebsite. The package is under continual development, therefore improvements andchanges to the details and usage given here are possible.

As all extreme value models in the vgam package have a log-likelihood function,half-stepsizing is useful because it ensures that an improvement is made at each iter-ation. This extension takes a half step if the next iteration ‘overshoots’ the solutionbecause the quadratic approximation to the log-likelihood function is not accurate.If necessary an even smaller step is taken (e.g., 1

4 or 18 etc.), and an improvement is

guaranteed since the step direction is ascending. Half-stepsizing is particularly usefulat early stages of the iteration if the initial values are poor. Furthermore, a verylarge negative value of the log-likelihood can be returned to prevent an iterationfrom falling outside the parameter space, e.g., 1 + ξi(yi − μi)/ψi > 0 is needed forthe GEV distribution.

Initial values are often obtained using moment estimation. For example, forthe Gumbel distribution the mean and variance are given by μ + σγ and π2σ 2/6respectively, where γ ≈ 0.58 is Euler’s constant, and hence moment estimation leadsto the starting values σ̂

(0)

i = √6 s/π and μ̂

(0)

i = m − σ̂(0)

i γ , where m and s are thesample mean and standard deviation. All the vgam family functions listed in Table 1evaluate initial values internally, i.e., are self-starting. This compares with functionsfext() and forder() in the R package evd (Stephenson 2002) which are not.Self-starting family functions are useful for novices who may not know how toobtain a reasonable initial value. The family functions of Table 1 also allow theuser to input initial values if desired. The user can therefore employ other forms ofmoment estimation such as probability weighted moment estimation (Hosking et al.1985) or L-moment estimation (Hosking et al. 1997). For further details on momentestimation for threshold models, see Hosking and Wallis (1987).

One technicality of extreme value models is the presence of singularities in thederivatives that need to be programmed as special cases. For example, for the GPdistribution,

∂�i

∂ξi= 1

ξ 2i

log

(1 + ξi yi

σi

)− (1 + ξ−1

i )

1 + ξi yi/σi

yi

ξi,

and there is a singularity at ξi = 0. Upon series expansions, one obtains the limit

limξi→0

∂�i

∂ξi= yi

σi

[yi

2σi− 1

].

Another example is the expected second derivatives of the GEV log-likelihood. Forexample, Prescott and Walden (1980) derived

E[

∂2�i

∂μi ∂ψi

]= 1

ψ2i ξi

{(1 + ξi)

2�(1 + 2ξi) − �(2 + ξi)},


and using L’Hospital’s rule, it can be shown that

limξi→0

E[

∂2�i

∂μi ∂ψi

]= 2(1 − γ ) − ψ∗(2)

ψ2i

≈ 0.42278

ψ2i

,

where ψ∗(·) is the digamma function.Parameter link functions are used to increase the chances of successful conver-

gence by avoiding numerical problems, e.g., by using a log link for positive quantities.Additionally this makes a quadratic approximation to the log-likelihood functionmore realistic, and consequently, standard errors are more accurate. Users can veryeasily choose from a range of parameter link functions. Note that if η j = g j(θ j) thense(̂η j) ≈ |g′

j(θ̂ j)| · se(θ̂ j). Suitable parameter link functions also tend to speed up theconvergence rate.

The zero argument in a vgam family function is used to model parameters asintercept only. This is useful for parameters that require large amounts of data forthem to be estimated accurately, since for small datasets the practitioner shouldmodel these parameters as simply as possible. This often occurs in extreme valuemodels, since small datasets contain little information about the shape parameter ξ .

A particular noteworthy innovation incorporated into vgam is smart prediction(written by T. Yee and T. Hastie), whereby the parameters of parameter dependentfunctions are saved on the object. This means that functions such as scale(),poly(), bs() and ns() will always correctly predict. In contrast, in ordinary R,terms such as I(poly()) and bs(scale()) will not predict properly, while inS-PLUS even terms such as poly() and scale() will not predict properly.

Now a note about the software defaults. At present, the vgam family functions forthe GEV and GP distributions have

log

(ξ + 1

2

)= η3 = β(3)1 (5.1)

as the default for the shape parameter ξ , i.e., a log link with an offset of 12 . There are

a number of reasons for this. Firstly, standard ML inference applies when ξ > − 12 so

that generic functions such as summary() and vcov() applied to the fit should givevalid results provided the estimate satisfies this constraint. Secondly, if the estimateactually conflicts with this constraint then this will give numerical problems duringestimation which should alert the user to think more deeply about the ramificationsof allowing ξ ≤ − 1

2 . Thirdly, the default is easily overridden, e.g., by simply settinglshape="identity"; however, output from summary() and vcov() could bevery misleading. Another software default for the GEV and GP distributions iszero=3 and zero=2 respectively. This argument causes the shape parameter tobe modelled as an intercept-only (e.g., as in Eq. 5.1). This choice minimizes thechances of numerical problems arising from allowing the shape parameter to be alinear/additive model in the covariates; and as before, the default can very easily beoverridden (e.g., setting zero=NULL).

For some distributions vgam uses functions such as deriv3() to obtain deriv-atives, meaning laborious derivatives can be handled. However, only observedsecond order derivatives can be evaluated with this method. Advances in symboliccomputing such as mathStatica (Rose and Smith 2002) may result in improvementsto the present code.


In contrast to evd, the unifying theoretical framework advocated here naturallymakes use of object-oriented software. vgam is written in Version 4 of the S language(Chambers 1998) and uses the modelling syntax of Chambers and Hastie (1993),including formula and data frames, as well as object-oriented programming. Thismeans other users can add functionality by writing their own methods functions. Theunified theory of Section 2 also means that inference and model checking is morestreamlined, e.g., resid() can return three common forms of residuals for VGAMs,called working, Pearson and response residuals. These can be plotted to help detectoutliers and lack of fit.

Appendix 2: Code for Extreme Value Analysis

A copy of this code is available from http://www.stat.auckland.ac.nz/~yee/updates, aswell as the rainfall data.

Example 1: Fremantle Data

library(ismev); data(fremantle); fremantle = fremantle; detach()library(VGAM)

fit1 = vgam(SeaLevel ~ s(Year, df=3) + s(SOI, df=3),family=gev(lshape="identity", zero=3),data=fremantle, maxit=50)

par(mfrow=c(2,2), las=1)plot(fit1, se=TRUE, lcol="blue", scol="darkgreen")

cm = list("(Intercept)"=diag(3), "Year"=rbind(1,0,0),"bs(Year, degree = 1, knot = 1945)"=rbind(0,1,0),"SOI"=diag(3)[,1:2])

fit2 = vglm(SeaLevel ~ Year + bs(Year, degree=1, knot=1945)+ SOI,family=gev(lshape="identity", zero=3),data=fremantle, constraints=cm)

par(mfrow=c(2,2), las=1)plotvgam(fit2, se=TRUE, lcol="blue", scol="darkgreen")coef(fit2, matrix=TRUE)1 - pchisq(2*(logLik(fit1)-logLik(fit2)),

df=df.residual(fit2)-df.residual(fit1))# Deviance test

Example 2: Rainfall Data

threshold = ysrain$Thresh[ysrain$CluMax]clmax = ysrain$CluMaxrainfit = vglm(CenSite ~ bs(SiteA)+bs(SiteB)+bs(SiteC)+

bs(SiteD)+Year,family = gpd(threshold, lshape="identity", zero=2),data = ysrain, subset = clmax, maxit = 50)

http://www.stat.auckland.ac.nz/~yee/updates


par(mfrow=c(3,2), las=1)plotvgam(rainfit, se=TRUE, lcol="blue", scol="darkgreen",

ylim=c(-0.5,0.8))

Example 3: Simulation Example

n = 500threshold = 0x2 = seq(0, 1, len=n)x3 = runif(n)y = rgpd(n,loc=threshold, scale=exp(0.5 * sin(2*pi*x2)+ 0.1*x3),

shape = -0.1)fit = vgam(y ~ s(x2) + s(x3),

family=gpd(threshold, lshape="logoff", zero=2))par(mfrow=c(1,2), las=1)plot(fit, se=TRUE, lcol="blue", scol="darkgreen", rug=FALSE)

References

Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J.: Statistics of Extremes. Wiley, Chichester, England(2004)

Broyden, C.G.: The convergence of a class of double-rank minimization algorithms. J. Inst. Math.Appl. 6, 76–90 (1970)

Buja, A., Hastie, T., Tibshirani, R.: Linear smoothers and additive models (with discussion). Ann.Stat. 17, 453–510 (1989)

Chambers, J.M.: Programming with Data: A Guide to the S Language. Springer, New York (1998)Chambers, J.M., Hastie, T.J. (eds.) Statistical Models in S. Chapman & Hall, New York (1993)Chavez-Demoulin, V., Davison, A.C.: Generalized additive modelling of sample extremes. Appl.

Stat. 54, 207–222 (2005)Coles, S.: An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag, London

(2001)de Boor, C.: A Practical Guide to Splines. Springer, Berlin (1978)Efron, B., Hinkley, D.V.: Assessing the accuracy of the maximum likelihood estimator: observed

versus expected Fisher information. Biometrika 65, 457–481 (1978)Ferro, C.A.T., Segers, J.: Inference for clusters of extreme values. J. R. Stat. Soc., B 65, 545–556

(2003)Fessler, J.A.: Nonparametric fixed-interval smoothing with vector splines. IEEE Trans. Acoust.

Speech Signal Process. 39, 852–859 (1991)Green, P.J., Silverman, B.W.: Nonparametric Regression and Generalized Linear Models: A Rough-

ness Penalty Approach. Chapman & Hall, London (1994)Hastie, T., Tibshirani, R.: Generalized additive models: some applications. J. Am. Stat. Assoc. 82,

371–386 (1987)Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Chapman & Hall, London (1990)Hosking, J.R.M., Wallis, J.R.: Parameter and quantile estimation for the generalized Pareto distri-

bution. Technometrics 29, 339–349 (1987)Hosking, J.R.M., Wallis, J.R.: Regional Frequency Analysis. Cambridge University Press,

Cambridge, UK (1997)Hosking, J.R.M., Wallis, J.R., Wood, E.F.: Estimation of the generalized extreme-value distribution

by the method of probability-weighted moments. Technometrics 27, 251–261 (1985)Ihaka, R., Gentleman, R.: R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5,

299–314 (1996)Leadbetter, M.R., Lindgren, G., Rootzén, H.: Extremes and Related Properties of Random Se-

quences and Series. Springer-Verlag, New York (1983)


Ledford, A.W., Tawn, J.A.: Statistics for near independence in multivariate extreme values.Biometrika 83, 169–187 (1996)

Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. J. R. Stat. Soc., A, 135, 370–384(1972)

Oakes, D., Manatunga, A.K.: Fisher information for a bivariate extreme value distribution. Bio-metrika 79, 827–832 (1992)

Pauli, F., Coles, S.: Penalized likelihood inference in extreme value analyses. J. Appl. Stat. 28,547–560 (2001)

Pickands, J.: Statistical inference using extreme order statistics. Ann. Stat. 3, 119–131 (1975)Prescott, P., Walden, A.T.: Maximum likelihood estimation of the parameters of the generalized

extreme-value distribution. Biometrika 67, 723–724 (1980)Reinsch, C.: Smoothing by spline functions. Numer. Math. 10, 177–183 (1967)Rose, C., Smith, M.D.: Mathematical Statistics with Mathematica. Springer, New York (2002)Rosen, O., Cohen, A.: Extreme percentile regression. In: Härdle, W., Schimek, M.G. (eds.) Statistical

Theory and Computational Aspects of Smoothing: Proceedings of the COMPSTAT ’94 SatelliteMeeting held in Semmering, Austria, August 1994, pp. 27–28. Physica-Verlag, Heidelberg (1996)

Schimek, M.G. (ed.): Smoothing and Regression: Approaches, Computation, and Application.Wiley, New York (2000)

Shi, D.: Fisher information for a multivariate extreme value distribution. Biometrika 82, 644–649(1995)

Smith, E.L., Walshaw, D.: Modelling bivariate extremes in a region. In: Bernardo, J.M., Bayarri, M.J.,Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (eds.) Bayesian Statistics 7.Oxford University Press, Oxford, UK (2003)

Smith, R.L.: Maximum likelihood estimation in a class of nonregular cases. Biometrika 72, 67–90(1985)

Smith, R.L.: Extreme value theory based on the r largest annual events. J. Hydrol. 86, 27–43 (1986)Stephenson, A.G.: evd: Extreme value distributions. R News 2(2), 31–32 (2002)Stephenson, A.G., Gilleland, E.: Software for the analysis of extreme events: the current state and

future directions. Extremes 8, 87–109 (2006)Tawn, J.A.: Bivariate extreme value theory: models and estimation. Biometrika 75, 397–415 (1988)Weissman, I.: Estimation of parameters and large quantiles based on the k largest observations. J.

Am. Stat. Assoc. 73, 812–815 (1978)Yee, T.W.: The Analysis of Binary Data in Quantitative Plant Ecology. Ph.D. thesis, University

of Auckland, Department of Mathematics & Statistics, University of Auckland, New Zealand(1993)

Yee, T.W.: Vector splines and other vector smoothers. In: Bethlehem, J.G., van der Heijden, P.G.M.(eds.) Proceedings in Computational Statistics COMPSTAT 2000. Physica-Verlag, Heidelberg(2000)

Yee, T.W.: A User’s Guide to the vgam Package, http://www.stat.auckland.ac.nz/~yee/VGAM (2007)Yee, T.W., Hastie, T.J.: Reduced-rank vector generalized linear models. Stat. Modelling 3, 15–41

(2003)Yee, T.W., Wild, C.J.: Vector generalized additive models. J. R. Stat. Soc., B 58, 481–493 (1996)

http://www.stat.auckland.ac.nz/~yee/VGAM

Documents

Vector generalized linear and additive extreme value models › ~yee › extremes › yeeStephenson.pdf · Vector generalized linear and additive extreme value models 3 a seamless