Likelihood Asymptotics and Location-ScaleShape AnalysisChapter 2 S t atist ical Inference This diapter reviews some key concepts and definitions in parametric statistical inference

Likelihood Asymptotics and Location-ScaleShape Analysis

Nathan Asher Taback

A thesis submitted with the requirements for the degr- of Doctor of Philosophy

Graduate Department of Statist ics University of Toronto

@ Copyright by Nathan Taback 1998

National Libtary Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Senrices services bibliographiques

395 Wellington Street 395, rue Wellington OttawaON K1AON4 Ottawa ON K I A O N 4 Canada Canada

The author has granted a non- exclusive licence dowing the National Library of Canada to reproduce, loan, distnbute or sell copies of this thesis in microform, paper or electronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts ~ o m it may be printed or otherwise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distniuer ou vendre des copies de cette thèse sous la forme de microfiche/fïIm, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

Likelihood Asympt ot ics and Locat ion-Scale-S hape Analysis

Nathan Asher Taback

Doctor of Philosophy 1998

Department of Statistics

University of Toront O

Abstract

Regularity conditions are presented and a rigorous prwf is given showing that the

postenor distribution or normalized likelihood hinction, based on observations from a

stochastic process, of a vector parameter with respect to either a proper or improper pnor

converges, alrnost surely, in distribution to the normal distribution. These conditions

bear a strong resemblance to Wald's conditions under which the maximum likelihood

estimator is consistent (Wald 1949).

A method is presented for generating error distributions with a shape parameter that,

for example, can be used in location-scaleshape models. An error distribution that does

not depend on a shape parameter can be transformed via a one-parameter continuous

group of transformations, constructed using an infinitesimal transformation, into a family

of error distributions indexed by a shape parameter.

A third order approximation to the significance level in testing either the location or

scale parameter without any prior information conceming the remaining parameters in a

location-scale-shape model is induded. The third order approximation is developed via

the asymptotic method, based on exponential rnodels and the saddlepoint approximation,

presented in Fraser and Reid (1995). Techniques are presented in the thesis for numencal

computation of ail quantities required for the third order a p p r d a t i o n . To compare the

accuracy of various asymptotic methods numerical examples and simulations are included

when the error distribution is the Student (A) family. Findy possible extensions are

suggested.

Acknowledgement s

1 would like to thank my supervisor, Professor D.A.S. Fraser, for his guidance, support,

and patience.

1 appreciate the conversations I had with Professors D. Brenner, K. Knight, P. Mc-

Dunnough, and N. Reid while this thesis was being wrïtten.

1 would like to thank Alison Gibbs who provided usehl cornments and criticisms while

1 was preparing my dissertation.

Financial support from the Goverriment of Ontario and the Department of Statistics

University of Toronto is appreciated.

This thesis is dedicated to my wife Monika. Her love and support greatly facilitated

the writing and completion of this thesis.

iii

Contents

1 Introduction

2 Statistical Inference

2.1 Definition of the Likelihood Function & Related Quantities . . . . . . . . 2.2 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Inference Lri Structural Models . . . . . . . . . . . . . . . . . . . . . . . . 2.5 The Relationship Between Bayesian and Structural Inference . . . . . . .

3 First Order Likeiihood Asymptotics

3.1 Likelihood For The Location Normal . . . . . . . . . . . . . . . . . . . . 3.2 First Order Tests and Confidence Intervals Based on Likelihood . . . . .

3.2.1 TheLimitingQuadraticShapeofthe LikelihoodFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Appendix

4 Higher Order Likelihood Asymptotics

4.1 The Edgeworth Expansion . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 The Saddlepoint Expansion Via The Edgeworth Expansion . . . . . . . . 4.3 hference For A Scaler Parameter In the Presence Of Nuisance Parameters

4.3.1 First Derivative Anciilary . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Approximating Exponential Mode1 . . . . . . . . . . - . . . . . . 4.3.3 Marginal Density For A Scalar Interest Parameter . . . . . . . . .

4.3.4 First-Order Significance Fiinction For A Scalar Interest Parameter 37

4.3.5 Third-Order Significance Function For A Scalar Interest Parameter 37

5 Improper Priors. Posterior Asymptotic Nomality. and Conditional In-

ference

5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Appenàix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Inference In Location-Scale-Shape Models with Small Samples

6.1 Conditionai Inference For Location-Scale-Shspe Analysis . . . . . . . . . 6.1.1 Inference For Shape . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Inference For Location . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Inference for Scale . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2 Asymptotic Location-Scale-Shape Analysis . . . . . . . . . . . . . . . . .

6.2.1 Linear Regression with A Shape Parameter . . . . . . . . . . . . .

7 Andysis of Error Models with a Shape Parameter

7.1 Some Familiar Error Distributions with a Shape Parameter . . . . . . . . 7.1.1 Student (p. o. A) Family . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Slash Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Contarninated Normal

. . . . . . . . . . . . . . . . . . . . . . 7.1 -4 Exponential Power Family

7.2 Generating Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 The Construction of a Continuous Group using Infinitesimal Trans-

formations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Generating Error Distributions with a Shape Parameter via the

. . . . . . . . . . . . . . . . . . . . . Infinitesimal Transformation

7.2.3 Power Transformations . . . . . . . . . . . . . . . . . . . . . . . .

8 Location-Sde-Shape Andysis with Stdent (A) Errors 72

. . . 8.1 Asymptotic Location-Scale-Shape Andysis with Student (A) Errors 73

. . . . . 8.2 Maximum Likelihood Estimation in the Stvdmt (p. O. A) Farnily 76

. . . . . . . . . . . . . . . 8.3 Modeling Real Data With Stdent (A) Errors 78

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Cauchy Data 79

. . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Cushny-Peebles Data 83

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Simulation Study 86

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Purpose 86

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Methods 86

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Results 87

. . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Simulation Conclusions 107

9 Conclusion 108

Chapter 1

Introduction

Testing a parameter in a statistical model is often difficult to carry out if the tail area

corresponding to the magnitude of dsparture hom a null hypothesis involves an arduous

computation or, even worse, is infeasible. In these cases it is appropriate to develop an

asymptotic procedure to approximate the tail area. When the parameter of interest is

scalar and nuisance parameters are present there are three cornmon first order asymptotic

procedures available based on the likelihood function: the likelihood ratio statistic, the

score statistic, and the Wald statistic; which are valid approximations provided that the

sample size is large and the statistical model satisfies regularity conditions ensuring that

the normalized likelihood function or posterior distribution, almost surely, converges in

distribution to the normal distribution; see Fraser, McDunnough, and Taback (1997).

The major wealaiess of the aforementioned first order procedures, especially the score

and Wald method, is when the sample size is s m d the approximations c m be very in-

accurate. More accurate methods for approximating tail areas, when the sample size

is s m d , have been developed recently. The saddlepoint approximation has produced

formulas of remarkable practical accwacy for exponentid rnodels. For general statistical

models third order accurate tail area fomulae are available (Barndofi-Nielsen and Cox,

the speci-

iikelihoo d

1994; Fraser and Reid, 1995); although their implementation usuaily requires

fication of an approximate ancillary statistic that complements the maximum

estimator, hence limiting their scope of application to either exponentiai or transforma-

tion models. One way to circumvent this problem, proposed by Fraser and Reid (1995),

is to apprmcimate a general statistical model by an exponential model and apply the

saddlepoint approxirna t ion to the approximating exponential model. Moreover , Fraser

and Reid show that the only information needed conceming the ancillary statistic is its

tangent directions at the obsemd data.

An important statistical model that is neither a trsnsformation model nor an expo-

nential model is the location-scale-shape model. This is an ordinary location-scale model

except that the error distribution includes a parameter to describe the shape of the error

distribution. For instance, the degrees of &dom parameter in the Student distribution

can be regarded as a shape parameter. These models are usefd in reai data sets where,

for example, the assumption of normality is often violated due to there being substan-

t i d y more probability in the tails of the distribution. The Significance level for a scalar

interest parameter, either location or scale, can be approx-hnated using the asymptotic

methods outlined above.

This thesis has three objectives: the first is to provide regularity conditions and a

~ ~ O ~ O U S prwf that the posterior distribution or normalized likelihood hinction, based

on observations from a stochastic process, of a vector parameter with respect to either a

proper or improper prior converges, almost surely, in distribution to the normal distrib-

ution; the second is to present a method for generating error distributions with a shape

parameter that, for example, can be used in location-scale-shape models; and h d y

the third objective is to present a third order approximation to the significance level in

testing either the location or scale parameter without any prior information concerning

the remaining parameters in the location-scaieshape model. For the first objective reg-

ularie conditions are set forth under which it is proven that the nomalized likelihood

function is asyrnptotically nomal. These conditions bear a strong resemblance to Wald's

conditions under which the maximum likelihood estimator is consistent (Wald 1949). An

error distribution that does not depend on a shape parameter can be transformeci via

a one-parameter continuous group of transformations, constructed using an infinitesimal

transformation, into a family of error distributions indexed by a shape parameter. Fi-

ndy, a third order approximation to the significance level for a scalar interest parameter

in location-scale-shape models is developed via the asymptotic method, based on expo-

nential models and the saddlepoint approximation, presented in Fraser and Reid (1995).

Techniques are presented in the thesis to numencaily compute all quantities required for

the approximation. In particular, when the error distribution is the Student (A) fam-

ily software is available from the author for computing the significance level of a scalar

interest parameter using the third order method presented in Fraser and Reid.

Chapter 2 is a review of some concepts and notation used in likelihood based inference,

induding a review of inference in structural models.

Chapter 3 contains some background material on f i s t order tests and confidence

intervals based on the observed likelihood bc t ion .

Chapter 4 is a review of higher order likelihood asymptotics including the derivation

of a third order approximation to the significance level for a scalar interest parameter.

In Chapter 5 regularity conditions are given under which a rigorous proof that the

posterior distribution, based on observations from a stochastic process, of a vector pa-

rameter with respect to either a proper or irnproper prior converges, almost surely, in

distribution to the normal distribution is presented.

Chapter 6 includes a derivation of a third order approximation to the significance level

of a scaiar parameter in location-scale-shape models. Moreover , a review of condit ional

inference in location-scale-shape models is given.

Chapter 7 contains a method for generating error distnbutions with a shape parame-

ter by constructing a one-parameter group of transformations and an andysis of some

common error distributions with a shape parameter.

Chapter 8 is devoted to location-scde-shape analysis when the error distribution is

Student(X). The derivation of ail the formulae required to implement the third order

appremation to the significmce level of a scalar parameter are provided. Numerical

examples are included to compare the accuracies of various methods. A large simulation

study to assess the difference, in repeated sampling, between first order and third order

methods was camed out using Stden t (A) errors.

Chapter 9 contains conclusions and suggestions for extensions and further research.

Chapter 2

S t atist ical Inference

This diapter reviews some key concepts and definitions in parametric statistical inference.

In the first section the likelihood function is dehed and notation is introduced for various

derivatives of the likelihood function. The second section contains the definition, that

will be used throughout this thesis, of the significance function. The remaining sections

discuss Bayesian inference, inference in the structural model, and the relation between

the two.

2.1 Definition of the Likelihood Function & Related

Quant it ies

Let {f ( - ; O ) : 0 E R) be a statisticai model and y =(y,, ...,y,) a sarnple from f . The

likelihood function from the observed response value y is dehed as

where c E (O, m) is an arbitrary constant. It is often more convenient to work with the

log-likelihood functwn

where a E (-00, oc) is an arbitrary constant. We will often suppress the data y, parame-

ter 6, or subscript n and refer to 1 as the likelihood function when there is no arnbiguity

in doing so.

If L, ( 8 ) is differentiable with respect to a parameter 9 then the score junction is

defined as

If Ln (O) is twice continuously differentiable then the obserued Fisher information is

defined as

V j (8) = -1" ( O ) = - - aeael ln (0; Y) -

The ezpected Fisher infonnation is defined as

i ( O ) = Vare (1' (6)) .

Under certain regularity conditions (Lehmann 1991, pp. 118) an alternative expression

for the expected information is given by the expression

Suppose that 0 = (A, $) E 3tP-' x 92 is a partitioned parameter with maximum h

likelihood estimator B^ = (A, $1 and constrained maximum likelihood estimator O,,, =

(X, $) for a particular value of $. We will often frnd it convenient to -te the obsemd

information matrix as a block matrix

with inverse

We will sometimes have occasion ta differentiate the likelihood b c t i o n with respect

to both the parameter and data. These derivatives will be denoted by

If V = {q, vp) is a set of p vectors such that vi E Rn then the likelihood gradient in

the direction of V is given by the 1 x n row vector

2.2 Stat istical Significance

Let y* , ..., y,, be a sample from a distribution with density f (y ; 8) where 0 E S1 C 92.

Definition 1 The significance fiuiction p : l2 + [O, 11 is taken to be

p (O) is often c d e d the confidence distribution bct ion since aIl possible confidence

intenmls are obtained by inverting p (O).

A 100 (1 - a) % confidence interval for 0 is

This defini tion of the significance function is discussed in Fraser (1991).

2.3 Bayesian Inference

Suppose that in addition to the likelihood function it is assumed that the parameter 0

is a reslization of the random variable 8 having density function n (8) du ( O ) , the prior

distribution of 8. Moreover, if the data y = (yl, ...,y,) is assumed to have been generated

from the conditional density of y given 8, f (y$) , then by Bayes theorem we have that

the conditional distribution of 8 given y has density function

feIY (Oly) is called the posterior distribution of 0 and can be used to make probability

statements about the parameter 8 in light of the data y.

2.4 Inference In Structural Models

A structural model is an ordinary statistical model where the error variable z is formdy

taken to be the source of variation for the response variable y. In addition the S ~ N C ~ U ~

model provides a transformation B that presents the response variable in terms of the

error variable.

The structural model provides a more detailed description of the physical process un-

der investigation than the ordinary statisticd model. In fad, "this more detailed model

predicates certain inference methods of analysis" (Raser 1988). Narnely, inference for a

parameter shouid be based on the appropriate conditional distribution, where the con-

ditioning is based on an ancillary statistic. Indeed, the conditioning is seen as necessary

in the context of a structural model (Raser 1979).

Suppose that z has density fimction f (*) on 31" and let 0 : 9t" -r !Rn belong to 0, a

group of transformations. The stmctural equation (Raser 1968, ch.2) is given by

where y = (y*, ... , y,) , a = (zi , . .. , h) . As an illustration consider the location-scale

model y =pl + oz, where z is N (O, 1) and B = (p , O ) . In anticipation of presenting the

relationship between the iikelihood function and the conditional-inference distribution

the following assumptions are introduced (Fraser 1968).

Assumption 1. R is an open subset of RL; the transformations B = Ble2 and y = BlB2z

are continuously differentiable in 64, 02, z.

Assumption 2. There exïsts a continuously differentiable function G n (y) such that

@n (m) = SE (Y) rvg 0-

Let Dn (y) = WG1 (y) y M) that

Moreover we have that

from the equation y = Oz.

Define the sample space Jacobian to be

The sample space Jacobian can be written as

- aEn (Oz) Dn (ûz) a (02)

In a similar way we can define a Jacobian on the group to be

for any g, h E 51 and i is the identity element of 52.

The density of z is

By the diange of variables formula we have that the density of y is

So that the likelihood function of 0 is

Fraser (1968, 1979) argues that inference about 9 should be based on the conditional

distribution of @" (y) given Rn (y) = D,, the conditional-inference distribution, which

is given by:

a density function with respect to Lebesgue measure on 92. As in Brenner, Fraser and Mc-

Dunnough (1982) we record the density of the pivotal quantity h = En (2 ) = 0- ' Kn (y) to illustrate the relationship between (2.8) and (2.7),

a density with respect to left-invariant mesure on 51. If we consider h as a function of 0

we obtain (2.7) but as s hct ion of Wn (y) we obtain (2.8) . The close connection between

the conditional-inference distribution and the likelihood function was first presented by

Fisher (1934) for location and location-scale models.

Example 2 Let 0 = {[a, cl : a E 8, c E R+} be the location scde p u p on Rn. Where

the p u p action is

[a, cl x = al + cx,

for any x E Rn.

The location-scale mode1 can be written as

where e has density f ( 0 ; A) on Rn, X E A, and & (y) , 5, (y)] is any function that sat-

ides Asmunption 2 (e.g.,

{al : a E 8) from Rn.

From the equation z =

and

[Y,%I [ Y - Y ] ) provided that we delete the set

so that (2.8) becomes

where

2.5 The Relationship Between Bayesian and Struc-

t ural Inference

The conditional-inference distribution derived in the previous section (2.8) is a postenor

distribution that is obtained from a structural mode1 together with the data. In this

section it wil l be shown that the conditional-inference distribution is proportional to the

postenor distribution of 0 with respect to a right invariant pnor.

Suppose that we choose a pnor for 8 that has a density b d i o n n ( O ) with respect to

right (invariant) Haar measure on the parameter space R. That is the a pn'ori probability

element for 0 is given by

where v (A) = v (AB) for all Borel sets A c R and O E 52.

It is not difficult to show that the measure defined by

is left Haar measure on (Fraser 1968, 1979), where JL (-) is defined in (2.6).

Similarly we can obtain right Haar measure on 0 by setthg

to produce the measure

To inwstigate the relationship between teft and right Haar mesure we quote one of

the main properties of Haar measure: any two left Raar measures on a group can be

expressed as a constant times the other (see for example Folland 1984 p.317).

Defhe the new measure

Ph (SB) = P (Bh)

clearly p h is left invariant. Hence there is a positive number A (h) such that

The number A (h) is cded the modular function of the group. It can be shown that left

and right Haar measure are related to each other via the modular function of the group

A (9 9

Fraser (1 979, p. 148).

We can now rewrite (2.8) as a density with respect to right Haar measure on $2

Thus, the conditional-inference distribution (2.12) has the same form as the posterior

distribution of 0 (2 -2) . Moreover, since right Haar measure is unique up to a multiplicative

constant the posterior density of 0 with respect to a right invariant pnor is proportional

to the conditional-inference distribution of 0 (Fraser 1961).

Chapter 3

First Order Likelihood Asymptotics

This chapter presents background material on first order likelihood asymptotics. In the

first section the log-likelihood function for the location normal is derived as the difference

of two quadratic terms. The second section records the proof given in Fraser (1979) that,

in a neighborhood of the true value of a scalar parameter, the log-likelihood function

converges almost surely to the aforementioned quadratic form. The relation to three first

order tests for a scdar parameter are also noted.

3.1 Likelihood For The Location

In order to ascertain the limiting form of the likelihood

Normal

function for a scalar parameter

it will be useful to write the log-likelihood function in terms of an adjusted variable

and pararneter. For this, let y1 , ..., y, be a sample from the N (p, 1) distribution. The

likelihood function for p can be written as

where the term exp (- 1/2 CHn-, (yi - y)2) hm been incorporated into the constant term.

Taking logarithms of both sides and letting p* denote the true value of p the log-likelihood

function for p, standardized with respect to p., can be expressed as

A A

where 5 = 6 ( p ) = f i ( p -p . ) and 6 = 6(y1, ..., y,) = f i@- p*) are an adjusted

parameter and variable respectively (Fraser 1976, ch.8).

3.2 First Order Tests and Confidence Intervals Based

on Likelihood

Let yl, ... , y, be a sarnple from a distribution with density function f (-; 0) that depends

on a scalar parameter B. Under mild regularity conditions on the density function f ( 0 ; 8)

it can be established that in a neighborhood of the true value of B the log-likelihood

function of 6 converges in distribution to the (normal) quadratic form

where 6,; will be dehed below.

Fraser (1968, ch.8) provides a careful proof that in the neighborhood of the true

value of a scalar parameter 8 the log-likelihood function converges as. to the quadratic

form (3.1) and as. tends to -cm outside the neighborhood. Raser (1979) discusses the

rela tionship t O first order tests and confidence int ervals.

3.2.1 The Limiting Quadratic Shape of the Likelihood Fundion

In order for the likelihood function of a scalar parameter 0, based on a sample y =

(yl, ... , y,) , to have the limiting quadratic form (3.1) the following conditions on the

density function f (y;@) are sdficient (Fraser 1979).

The expected information Eo ( j (8)) > O and # [log (f (y; O))] /a83 < N (y) where

N (y) f (y; O ) is integrable on W.

Throughout this section assume that the density function f (y; O) satisfies the two

conditions above. Also let

denote the log-likelihood function based on a sample standardized with respect to the

true value O*, and the Fisher information based on a sample of size n = 1.

An application of Taylor's theorem yields the following expression for the log-likelihood

where IR( c 1,l@ log (f (y; O)) /a031 < N (y) and E IN (y)l < 00. Letting B = 0 * + 6 ~ ~ - ~ / ~

we have that

Since 1; ( O * ) is a s u m of 2.i.d. random variables with El; (O*) = O and Var (1; ( O * ) ) =

ni ( O * ) the central limit theorem can be applied to yield the limit

where UI = N O, i (O* ) . " ( 7) Moreover by the Strong Law of Large Numbers

Therefore by Slutsky7s Theorem (see appendix) we have that

s2 d 62 1, (O* + t h - I l 2 ) = 6w,, + -v, + 6~ - -2 (O*) . 2 2

If we complete the square and set

and let n -, w it follows that

which is the (normal) quadratic form (3.1) (Fraser 1976, p.352).

As n - oo, the likelihood b c t i o n depends only on the sample characteristic $(y).

Thus, with first order accuracy 8(y) can be seen as the large-sample likelihood statistic

and it can be used as a basis for testing B.

The significance fundion of 0, to the first order, can be obtained by computing any

of the following:

The quantities z, and d are the cr quantiles of the standard normal distribution and

the chisqwrre (1) distribution respectively. The accuracy of all three approximations is

O (n-1/2) . The first and last approximations seem to perform better in applications than

the rniddle approximation.

Example 3 T h e Pareto (a) distribution has a density functwn defined on R+ given b y

cr(l+f) -(l+a) , a>o.

Let z = (21, . .. , Zn) be an i .i.d sample fiom the Pareto (a). The observed log-likelihood

The score function and observed information are obtained by differentiating the expres-

sion above

The maximum likelihood estimate of a is: 6 = 6 (2) = n/ z, - log (1 + G) . The reg-

ularity conditions stated above are clearly satisfied hence tests and confidence intervals

can be based on (3.3).

First, consider the hypothesis a = a ~ . The hypothesis can be tested by calculating

and comparing the value obtained with the standard normal distribution. On the other

hand, the hypothesis can be tested by cdculating

and comparing the value obtained with the standard normal distribution. Alternatively

it can be tested by calculating

and comparing the value with the chi-square distribution on one degree of fkeedom. As

n + oo these tests are equivalent; for s m d to moderate sarnples they may give different

resul ts .

Consider the artificial data set given by z = (1,2,3,4,4) and suppose that we are

interested in testing Ho : a = 1. The pvalues for the three tests, presented in the same

order as above, are: 0.313, 0.267, 0.434. Indeed, with s m d data sets, the three fist

order approximations can give very different answers.

3.3 Appendix

Theorem 4 (Slutsky) Let (X, , n > 1) and (Yn, n 2 1) be two sequences of mndom d d

variables such t h t Xn + X and Y, % c , where c is a constant then X,, + Y, + X + c.

Proofi A proof can be found in Cramér (1946).

Chapter 4

Higher Order Likelihood

Asympt ot ics

This chapter contains background material on the Edgeworth expansion, the indirect

Edgeworth expansion or saddlepoint expansion, and the method of approximating the

significance function for a scalar component parameter presented in Fraser and Reid

(1995) - the F R method. In chapter 6 the FR method will be applied to the location-

scale-shape model.

4.1 The Edgeworth Expansion

Let x1,x2, ... be a sequence of i.i.d. random variables with density function f(-) , moment

generating function M(-), rth cumulant 4, and standardized eumulants p, = ~Jr1;'~.

Also, set Sn = xi + - - - + x, and Si = (Sn - np) / o f i The i.i.d. assumption implies

that Ms, ( t ) = [M (t)]" . Taking the logarithm of both sides we have that Ks, (t) =

n [log M (t)] = nK (t) . The moment generating function and cumulant generating func-

tion of Sn can be expressed as

By expanding the cumulant generating function in a Taylor senes about t = O we have

that ,

Exponentiating the last equation and expanding exp (x) about x = O we have that

hverting the above expression term by term we obtain the Edgeworth expansion for the

density function fs; of Si

Integrating the above expression term by term we obtain the distribution function of Sn

Where H, (x) is the Hermite polynomid of degree r defined as:

The first few Hermite polynomials are:

In general the odd order Hermite polynomials vanish at x = O. The leading term #(x)

then has error O (n- ') at x = O and error O (n-'12) in the tails of the distribution.

For background on the Edgeworth expansion see Cramér (1946). The discussion in

this section follows the treatment given in Barndofi-Nielsen and Cox (1989, ch.4). For

a s w e y of the multivariate Edgeworth expansion see McCdaugh (1987, Ch.5).

4.2 The Saddlepoint Expansion Via The Edgeworth

Expansion

One major shortcorning of the Edgeworth expansion is its poor performance in the tails

of the distribution. For statistical purposes this is precisely the part of the distribution

where high precision is needed.

Let f ( 0 ) be a density function with moment generating function A4 ( 0 ) and cumulant

generating function K ( 9 ) . The density can be associated with the exponential family

where,

This procedure of forming f (x;8) is c d e d exponential tilting and f (x; O ) is c d e d the

tilted distribution. For a sample XI, ..., xn the density function is given by

n Thus, the tilted distribution of Sn = Ci=l xi is

To obtain an accurate approximation to the density function fs, (s) = 1% (s; 0) the

Edgeworth expansion can be applied to fs_ (s; 8 ) . If O is chosen such that s is in the

center of the distribution then the error incurred would be O (n-l) instead of O (n-ll2) . A

h other words we will choose 8 = O, where Eê (Sn) = S.

The log-likelihood function of the tilted distribution is

where a is a constant. The maximum likelihood estimator of 0 is obtained by solving

1' (8 ;̂ s) = s - nK' (ê) = O. Sine Es (Sn) = nK' 8 it follows that the maximum (-1 likeiihood estimator, ê, corresponds to the value of O such that s is in the center of the

distribution.

Applying the Edgeworth expansion to 1% (s; 0) we obtain the asymptotic expansion

for fs* (4

~ 3 2 (8) ' (O ) HI (z) + -Hô ( r ) ) + O ( 7 C J i 2 ) , H3 (2) + -- 24 n 72n

where

Choosing B = we have Eê (Sn) = s which leads to

Now (4.3) becomes

1 p4 (3) H3 (O) + --

24 n H4 (0) +

The leading term of the last equation in (4.5) ,

is cailed the saddlepoznt appmzimatwn to the density of Sn.

The n-' term at the origin in the Edgeworth expansion of fs,

If (4.6) is independent of 8̂ and we normalize the leading term in (4.5) then (4.6) will

be incorporated into the normalizing constant and the renormalized approximation wiil

have error O uniformiy in S. (Barndofi-Nielsen and Cox 1989, p.107). In the A

one-dimensional case, there are just three families for which (4.6) does not depend on 8:

the normal, gamma, and inverse Gaussian (Blaesild and Jensen 1985). Renomalkation

of (4.5) yields

provided that (4.6) is constant.

From (4.4) the transformation from s to is one-t-ne, with

A A

= jdû.

So (4.7) c m also be expressed as a density with respect to the maximum likelihood

estirnator

The saddlepoint expansion for the exponential model (4.1) has a corresponding dis-

tribution function approximation (Lugananni and Rice 1980)

where

Both r and q are asymptoticdy standard normal to the &st order.

Dsniels (1954) fist introduced saddlepoint met hods to st atistics. An excelient review

of saddlepoint methods in statistics is given in Reid (1988). The exposition presented

here is based on Barndofi-Nielsen and Cox (1989, di.4). Kolassa (1994) and Jensen

(1995) both provide a thorough treatment of the regularity conditions involved in the

saddlepoint approximation.

4.3 Inference For A Scaler Parameter In the Pres-

ence Of Nuisance Parameters

This section is intended to provide background material on the FR method (Fraser and

Reid 1995) for approxirnating the significance function of a scalar interest parameter.

The FR method consists of approximating a general statistical model 6 t h an exponential

model then applying the saddlepoint approximation to the approxirnating model. The

only information needed concerning the anQllary statistic are the tangent directions to

a first derivative ancillary. A general method for computing these directions is outlined

in this section.

4.3.1 FKst Derivative Ancillary

The concept of a &st derivative ancillary was introduced in Fraser (1964) and Fraser

(1968, ch.6) for a scalar parameter and extended in Fraser and Reid (1995) to the vector

parameter case. The importance of a first derivative ancillary is that it provides ail of

the required ancillary information for third order inference without having to specify an

approximate ancillary st atistic explicitly.

Definition 5 A mndom variable A &th density /mctzon g (a; O ) is fist derivative an-

cillary at Bo if and only if

Construction of First Derivative Ancillary

Let y*, ..., y,, be a sample from a distribution with distribution function F (y; 8 ) where

8 E 92. In addition

Assume that F is stochasticdy increasing in a neighborhood of 4, that is Wo, y E !R

Holding F (y; Bo) constant we find that

An increase in B at Bo causes the entire distribution to shift to the right. If we let

then the distribution function of x (y) is

where y (x) is the inverse fimction of x (y) . In which case we find

But then

G ( x ; @ ) = G ( x - (O-eo);eo) + O ( @ ) .

So we Say that the statistical model for x (y) can be written as

G (. (Y) - (9 - C ) ; 00)

Thus the model, to a first derivative approximation at $, c an be seen as a loc ation mode1

x = (0 - Bo) 1 +E, where E has distribution function G (- ); having corresponding ancillary

statistic

whose tangent direction vector dx/d@ = 1.

Let f (2; 4) be a continuous exponential mode1 with p-dimensional parameter 4. Thus

f (x; 4) is known to have the form

but B (4) , y (x) , K (8 (4)) and h (y (x)) may not be explicitly available.

Example 6 Let XI, ..., xn be a andom sample fbm a N (p, a2) population. The demit y

of the sample is

which can be written as

this case 4 = (p,a), I ( 4 ) = (p/g2,-1/202), y ( ~ l , . . * ~ x n ) = (C;='=,x*,Cx:)',

(0 (4)) = iog(l/un) - np2/202, h (y) = log (11 ( 2 ~ ) " ~ ) . The canonical pararneter 0 and variable y, nominal cumulant generating function

K and underlying h are not uniquely determined. To standardize (4.11) with respect h

to an observed value x0 having maximum likelihood estimate 4' = #I (xo) we require:

y (xO) = 0,0 (6') = O, R (O) = O and [ûû (4 ) /a#'],, = 1.

Let

At the data point xo we have that, S (4'; xo) = O. From (4.11) we have that

* [El = [-y w (2) - KI (0 (4) ) y] 4=p = y (z) = S. a4 gk,p

W e can write (4.12) as 1 (4; X) = 0's - n ( O ) which gives

where,

are sample space derivatives and K (6) = -1 (w -' (8) ; xo) is the cumulant generating

function. Thus

which shows that 1 and 1,

Consider a continuous

hUy determine an exponential linear model.

statisticai model where the dimension of the variable is the

same as the dimension of the parameter. The tangent exponential model, at an observed

point in the sample space, is defined (Raser 1990) as the srponential model (4.11) that

agrees with the given model at the observed sample point: this is given by (4.14) in

terms of the score variable defined in (4.12). This approximating exponential model

has canonical parameter given by (4.13) and cumulant generating function given by

Consider the scalar variable, scalar parameter case. An approximation to the left-tail

probability for the maximum likelihood estimator is given by (4.9), with (4.10) for r and

for q.

4.3.3 Marginal Density For A Scalar Interest Parameter

Fraser and Reid (1995) show that a first derivative ancillary at Bo = 6&, ... , y,,) cm

be upgraded to a second order anciliary without changing the tangent directions at the

data. Moreover, they show that it suffices to obtain the tangent directions to a second

order ancillary in order to approximate the significance function, for a scalar interest

parameter, to the third order. Thus, for third order inference it sufnces to obtain the

tangent directions to a fist derivative ancillary.

Let yo = (yl, ...,y,,) be a sample from f (y;B) where 0 = (A,$) E W1 x P The

ancillary directions V for 0 can be found from the method presented in section 4.3.1.

The mcillary direction is given by

where i = 1, ..., n.

Ln the vector case we would have 0 = (O1, . .. , 8,) and the anciIlary directions V = (vij)

are given by

where i = 1 ,..., n and j = 1 , . . . , p .

The tangent exponential mode1 at yo is

where

l0 (0) = 1 (O; yo) and 19 (9) is 9-l.

Since (4.18) is a tilted density of the form (4.1) we can approximate it using the

saddlepoint approximation (4.7)

where 7, is the information calcdated from the tilted likelihood in the exportent of

(4.18). The accuracy of (4.20) is O ( 7 ~ ~ 1 ~ ) in a first derivative neighborhood of yo and

O (n-') in a compact region for the variable Save an O (n-') constant (Calmiak, Fraser

and Reid 1994). Wnting cp = (<pl, 9 2 ) and s' = (si, 4) the marginal distribution for s2

on the maximum likelihood surface ê*,, = or sl = O is the ratio of the joint densiQ

for (si, 4) to the conditional density of sll st

where

Note that <pp (e) is a p x @ - 1) matrix with corresponding rnatrix volume

Formula (4.21) was derived in Fraser and Reid (1995).

4.3.4 First-Order Significance Fhction For A Scalar hterest

Parameter

Suppose that f ( y ; @ ) is a p-dimensional statistical model with 0 = ( A , + ) where h is s

p - 1 dimensional nuisance parameter. The likelihood based quantities:

are asyrnptotically N (0 , l ) to the hst order (Barndofi-Nielsen and Cox 1994). Thus

with accuracy O (n-II2) we have three first order approximations to the significance

function of II, :

4.3.5 Third-Order Significance Fundion For A Scalar Interest

Paramet er

Lïnear Exponential Models

The saddlepoint approximation (4.7) to an exponential model with partitioned parameter

0 = (A, d ) E 92P-' x 82 has a corresponding distribution function approximation for a

scalar interest pararneter

where R = r* and Q = q* (see Reid 1995(b) and the references therein).

An asymptotically equivalent version of (4.21) is

where

Barndofi-Nielsen (1986,1991) introduced the R' version as an alternative to the Lugannani-

Rice formula. R* is standard normal to the third order and (4.25) has the advantage of

always producing pvalues inside the interval [O, 1] .

General Statisticd Models

For a general statistical model we obtained the densi@ approximation in (4.21) for a scalar

variable when 11 is fixed conditional on some third order anciilary for A. Cheah, Fraser

and Reid (1995) have derived the Lugananni and Rice formula (4.24) for an exponential

model multiplied by an adjustment factor as in (4.21) . The Lugananni and Rice formula

for this type of mode1 uses r* and q* in (4.23)

where 6 = IkrAi (&) 1 = 1 jAA (e) 1 1 (e) 1 -2defuied in (4.22) .

Towards obtainùig an expression for Q in (4.26) for the model in (4.21) we have that

where

are the Jacobian from B to yl and its inverse. This gives the connection between t9 and p

in the neighborhood of &. The scalar parameter

behaves like 11 in the neighborhood of &. A standardized maximum likelihood departure for testing then has the form (4.26)

which can be written as

4, = [x (8) - x (b)] t (y2 191' (e)l I7.A (k) IL'* Ive (')/

Thus (4.24) or (4.25) can be used with R = r* and Q = dq to provide a third order

approximation to the signifieance function for $. Calculation of r* and $ is tantamount

to implementing the FR method. If the parameter of interest is a vector parameter

Fraser and &id (1995) suggest testing successive components of the parameter vector as

described in Fraser and McKay (1975).

Chapter 5

Improper Priors, Posterior

Asymptotic Normality, and

Condit ional Inference

There has been a modest arnount of literature on the posterior asymptotic normality of

a scaiar parameter based on a proper prior distribution. Heyde and Johnstone (1979)

extended the proof of Waker (1969) to stochastic processes. A proof of the multipara-

meter case can be found in Johnstone (1978). Sweeting and Adekola (1987) extended

the regularity conditions found in Heyde and Johnstone (1979) to cover a wider class of

processes, again based on a proper prior.

Posterior asymptotic normality where the prior dimibution is not assumed to be

proper was fmt examined in Brenner, Raser and McDunnough (1982) for a sarnple from

a scalar location parameter model. Fraser and McDunnough (1984) extended the result

to a generd model with a scdar parameter and mention that for a random sample the

muitiparameter case should hold. Indeed, the literature does not seem to contain a proof

of the multiparameter case without the assumption of a proper prior. The purpose of

this section is to establish the asymptotic normality of the pasterior distribution of a

vector parameter based on either a random sample or a sample from a stochastic process

without the assumption of a proper prior. The results contained in this section appear

in Fraser, McDunnough and Tabadc (1997).

In the previous section we saw that postenor distributions or normalized likelihood

functions have an intimate connection with the confidence distribution generated by a

transformation or structural model using conditional inference. Theorern 3.1 in Brenner,

Fraser and McDunnough (1982) gives conditions under which convergence almost surely

of the suitably normalized likelihood function or the posterior to the standard multi-

variate normal distribution is sufficient for almost sure convergence of the standardized

condi tional-inference distribution to the standard multivariate normal. In a transfor-

mation model the likelihood function c m be normalized with respect to an invariant

measure on the parameter space, as a natural non-informative prior. If the parameter

space is a compact group then the invariant measure will be proper, but for location,

location-scale, regression models and others with noncompact groups, this pnor is im-

proper. Consider the classical regression mode1 y = XP + oz where z is Nk(O, 1) with

the parameter 0 = (p , a); a proof of the strict asymptotic nonnality of the confidence

distribution of Ci1'2(8 -&) does not seem to be available in the literature, but is the

concern of this section.

Let the parameter space R = @, be k-dimensional euclidean space. We find it

convenient to use the norrn

on 32k rather than the Euclidean nom. If A is a linear transformation on @ dehed by

the matrix [Aj], then the n o m of A will be taken as

We then have that

for all x E 3tk. It also follows from the Cauchy-Schwartz inequdty that

for all x E !P.

An open rectangle in is dehed to be a subset of @- of the form

= { X E !P 1 % <xi < bi, for i = 1, ... ,k).

Closed rectangles are defined in an analogous way

Let f : A -t !R be a bounded function, where A is a rectangle in p. Define

5.1 Assumptions

The foilowing assumptions concerning the mode1 and a prior w(0) are closely related to

those in Fraser and McDunnough (1984), and Johnstone (1978).

Assumption 1.

and w(0) is continuous and positive at the true 8.

Assumption 2. Zn (0) is tarice continuously differentiable, and det ( j ( 8 ) ) E (O, 00).

Defme

GS (0) = j-' (0)

and assume that both C , -r O and & -t O, where Q, = &(ê) with ê the msucirnum

likelihood estimate of 8.

Assumption 3. For every 6 > O

Assumption 4 For every e < 1 there exists a 6 > O such that,

- lirn SUP 1 1 C? (O) {l: ( s ) - 1: (0) ) cn' (e ) 1 n-O" ~:11~-011~a

Assumption 3 is a modified version of Assumption I' in Fraser and McDunnough

(1984). Assumption 4. ensures that in a neighborhood of its maximum the likelihood

function has a multivariate normal form, that is,

For this last staternent expand h in a Taylor series (see appendix) about B^ to obtain

where lies somewhere on the line joining ê to B^+ eh. Assumption 4. then gives (5.1) .

5.2 Main Result

Theorem 7 Let { X t , t E T ) be a stochastic process and suppose that we observe n d-

izutiorr~ of the pmcess x = (xi, ..., x,,) having density /,(xlû) &th respect to a g - f i t e

measure not dependent on 8. If ussumptions 1,2,3,4 hou then

Proofi From Assumption 1. it foilows that

Combining this with (5.1) we see that it d c e s to show that

J ~ ( u ) Ln (u)du Y (2n) '12. det (2) lu($) ~ , ( 6

Towards this we note that

J w ( Y ) L ~ ( u ) ~ u - - lW(B^+ &t) L.(&+ gi t ) dt . det (g!) w(5) L,(@ WC$) z(@

For any rectangle A c @ the dominated-convergence theorem yields

As in Fraser and McDmough (1984) we dehe sets

with 6 > O. B s , is the region outside the rectangle A c p, but in a 6-neighborhood of

the true parameter. While Bn is the region outside a rectangle and not in a neighborhood

of the true parameter. Towards (5.3) we note that it now s a c e s to show that

and

If we can show that

then (5.4) follows immediately. Towards (5.6) let R, = zn(l;(8) - l:(ê)) and t E Ba,,;

t hen the multivariate version of Taylor's theorem yields

The last inequaiity follows from Assumption 4. Thus,

where the constant c, < O for any E < 1. Finaily for suitable 6 > O we have that,

In anticipation of (5.5), we write

where Ln- oc f (xi, . . . , x,; O ) / f (xi; O) for n > 1 and Li oc 1 (xi ; 0). addition have

that

Then an application of Assumption 3 yields

for all t E Bn and for a constant d > O (see Fraser and McDunnough, (1984)). It then

foUows from Assump tion 1. that

where Bo is the tme value of 0.

Proof: The multivariate analog of a theorem due to Scheffé (1947) yields

Let

i.9. where q Stvdent (A). Suppose that for a given value of X we want to make inferences

about p. The structural distribution function of the parameter p (see chapter 6) is

where

and f ( 0 ; A) is the Student (A) density.

A 1 - cr confidence interval for p is given by,

where a < b satisfy

GA ( b ) - GA (a) = 1 - a.

So a 100 (1 - a) % confidence interval has the fom

To irnplement the confidence interval (5.7) we require values of the quantile h c t i o n

G-' (x; A) for selected values of x. Since the assmptions about the error density hold we

can apply the theorem. Thus with accuracy O (n-'I2) we can approximate G ( 9 ; A) with

the N (p (y) , BP (S (y) O (y))) distribution function.

5.3 Appendix

Theorem 9 ('hylor) Let f : U c R P + 3 have contznuow partial derivatives of second

O&. Then we may write

where R = O (112

1 R = - (X - xO)' f" (Z) ( X - x0) ,

2

where E lies somewhere on the [nie joinzng xo to x.

Proof: A proof can be found in Marsden and Tromba (1988).

48

Chapter 6

Inference In Locat ion-Scale-Shape

Models with Small Samples

The &st section of this chapter is devoted to exact conditional inference in the location-

scale-shape model, for a fixed value of the shape parameter. In the next section the FR

method is applied to this model to obtain a third order approximation to the significance

function for either the location or scale parameter.

6.1 Condit ional Inference For Locat ion-Scale-S hape

Analysis

Fraser (1976b, 1979) provides a thorough treatment of conditional inference for a com-

ponent parameter in a locat ion-scaleshape model. The marginal likelihood func tion can

be used for inference about the shape parameter. Then for a fixed value of the shape pa-

rameter inference is available for a component of (p, a), while the remaining components

are assumed to be unknown.

Suppose that the stmctural equation (2.3) takes the form

where e has hown density function f ( 0 ; A) that depends on a scalar parameter A E

(O, m) . Throughout this chapter the parameter X is assumed to be the shape or kurtosïs

parameter of the distribution with density f , but the resdts of this chapter are applicable

to statistical models where the parameter X is, for example, skewness.

The conditional distribution of (p, Z)( D

has density function (2.10) , namely

where

6.1.1 Inference For Shape

The only characteristic of the realized e that is observed is the unit residual vector D.

The probability for this obsenred event is

kA (D) da.

The differential da denotes the volume of the unit sphere formed by the points D :

then Zn-2da is the volume on the sphere of radius 5 (Fraser 1979). Thus the observed

iikelihood function for the parameter X is

where c > O. Different values of the constant c give similady shaped functions of X (Ftaser

6.1.2 Inference For Location

The relation in (2.4) becomes

w i t h j l = p ( e ) , Z = Z ( e ) .

For a given value of X inferences concerning p can be obtained by fmt rearranging

(6.3) to obtain

The marginal distribution of t has density function

For a given value of X a 1 - cr confidence interval for t is (tl, t2) whem

so that a 1 - u confidence i n t e d for p is

6.1.3 Inference for Scale

The marginal distribution for i? can Le obtained by integrating Ci out of (6.2)

h (s; A) = k-' (D) ] -- n f (p + .Di; A) sn-*dP. -do &1

Thus for a given value of h a 1 - a confidence interval for a is

( Z ( Y ) / S ~ < ~ (Y) /SI)

where si < s* satis&

6.2 Asympt otic Locat ion-Scale-Shape Analysis

For a fixed value of X exact inference is available for a component of (p, o) from either

(6.5) or (6.6) . Reid (1995a) points out two problems if X is treated as unknown.

There is no exact ancillary statistic available to condition on, in order

to reduce the dimension of the sufficient statistic to the dimension of the

parameter, and no exact method of eliminating the nuisance parameters to

provide a one dimensional distribution for inference about the parameter of

interest. In the absence of an 'exact' solution, it is difficult to denve an

approximate solution, since it is not clear what should be approximated.

However , the more general work of Barndofi-Nielsen (1991, 1994), DiCiccio

and Martin (1991) , and Fraser and Reid (1995) suggests that given a suitable

method for dimension reduction (such as conditioning on an approximate

ancillary statistic) , the derivation of the tail area is straight forward.

In this section we derive the likelihood related quantities needed in order to imple-

ment the FR method for the Iocation-scale-shape mode1 with general error distn but ion

dependent on a shape parameter.

The likelihood function of 0 for the location-scsle-shape mode1 in (6.1) is

where 1, (p, O, A; y) = log (if (0; A)) . At the beginning of section 4.3.3 it was noted that for third order inference it suffices

to obtain the tangent directions, at the data, to a second order ancillary. Moreover, a fist

derivative ancillary has the same tangent directions as a second order anciliary. Thus,

the tangent directions to some third order ancillary are given by (4.17). The directions

can be described by the vectors V = (vi, v2, v3) and for the location-scale-shape mode1

are given by

where

F (x; A) = [l f (t; A) dt.

If F (2; A) is not available in closed form then we can approximate the numerator of v3i

using

where R = - + E ~ @ F (2; A) /aA3. Press, Teukolsky, Vetterling and Flannery (1988) dis-

cuss the accuracy of using the above equation with R = O and how to choose the value

of E in practice.

The reparameterization in (4.19) c m be expresseci as

The Jacobian of the parameter change is

Suppose, for example, that the interest parameter is p. The scalar linear parameter

that corresponds to p in a neighborhood of $ is given by (4.27)

where ( O ) corresponds to the pth row of K (8) = J-' (O).

The significance hct ion for p is available from ei ther the Lugananni and Rice formula

(4.24) or the Barndofi-Nielsen formula (4.25) with

where J2 (O ) is the matrix obtained from the 1 s t two columns of J3 (O) and the nuisance

information matrix is

6.2.1 Linear Regression with A Shape Parameter

Fraser, Monette, Ng and Wong (1995) applied the FR method to generalized linear mod-

els with non-iinear iink function, under the assumption that the error distribution did

not depend on any unknown parameters. Fraser (1979, ch.6) considered linear regression

rnodels where the error distribution depends on a shape parameter and used exact condi-

tional methods to obtain inference for a scdar interest parameter for a fixed value of the

shape parameter. Lange et al. (1989) fit regression models with Stvdent (A) errors and

used f ist order methods to obtain confidence intervals for a scalar interest parameter

with the shape parameter A unknown.

Consider the linear regression mode1

where X is an n x p design matrix of rank p, /3 = (a, ...,&) and e has density function

f ( 0 ; A) . The likelihood function is

The p + 2 tangent directions to some third order ancillary are given by

The remaining likelihood based quantities are easily obtained. Third-order tests and

confidence intervais for a scalar component of ( f i , .. ., O,, O, A) can be obtained from (4.24)

or (4.25) together with (4.26).

Chapter 7

Analysis of Error Models with a

Shape Parameter

A method for generating error distributions with a shape parsmeter is developed in this

chapter. In addition, an analysis of some familiar error models with a shape parameter

is presented.

7.1 Some Familiar Error Distributions with a Shape

When the population under study has a heavy tailed distribution, so that the occurrence

of large values is not rare, the following families have been cited in the literature as

being appropriate error distributions: Stvdent ( p , o, A) family, slash distributions, conta-

minated normal, and exponential power family. Ail of these families, with the exception

of the exponentid power family, can be expressed as the distribution of the ratio

The scaling random variable T, which is assumed to be independent of r .- N (p, 02),

has density fundion g (- ; A) which form a one-parameter exponential family. The density

of y then has the form

=I t l Y - C ( h (9; P, 0. *) = J__ ;b (-&) g (t; A) dt.

The ratio given above is of no direct importance for the inferential problem, but is useful

for generating different families of error distri butions wit h a shape paramet er where

the practical computation of mmcïmum likelihood estimates is available using the EM

aigorithm (Lange et al. 1989).

If r is chosen such that

T - Gamma (A/2, A/2) X I X : ,

where the Gamma (a, ,O) density is

then the marginal distribution of y is given by (7.1) , namely,

the Student (p, O, A) distribution (Dickey 1968).

7.1 -2 Slash Distributions

Rogers and Tukey (1962) advocated choosing r to have finite support with density func-

tion (A + 1) (7) , where IA (=) is the indicator function. The slash family is defined

as

ki particular when X = O then T is uniform and y has the standard slash distribution

(i,e., p = O and a = 1) with density

This model may have better statistical properties than the Student (A) for data with

extreme outliers , but is less convenient from a computational point of view. In particular

the likelihood func t ion involves the computation of the incomplet e gamma func t ion for

each observation (Lange, Little and Taylor 1989).

7.1.3 Contamïnated Normal

Suppose that T is chosen to be a binary random variable with density function

1 - A , i f r = l

7

O, otherwise

where X E (O, 1) and w > O are assumeci to be known. The marginal densiiy of y is giwn

by (7-1)

a mixture of a N (p , (a/ (1 - x ) ) ~ ) random variable and a N (II, ( a / ~ ) ~ ) random variable.

If X is diosen close to 1 and w is s m d (say 0.1) then the resulting error distribution is an

appropriate rnodel when the data are contaminated with a small fraction of outliers. Little

(1988) compared the contaminated normal and the Student (A) in a simulation study

and found they yielded comparable robust estimates of location and scale, although the

contamuiated model requires specification or estimation of two robustness parameters, X

and w , rather than just one for the Student (A) model.

7.1.4 Exponential Power Family

Another family that comprises distributions with tails that are both longer and shorter

than the normal is the exponential power family (Box and Tiao 1973). This family has

densities of the form

where

The shape parameter P measures the amount of kurtosis indicating the amount of non-

norrnality. When ,8 = O the distribution is normal. When B = 1 the distribution is

Laplace and letting P -, -1 the density function tends to

the uniform distribution. Fraser (1979) and Lange et al. (1989) found that the distri-

butional fonn of the power exponential family is unredistic for modeiling real data as

IPI + 1- Lange et al. reported more computational problems with maximum likelihood

estimation than with the Stvrlent (A) family caused by the shape parameter 1/31 -. 1.

The EM algorithm is unavailable for maximum iikelihood estimation since the power

exponential family is not expressible as a mixture of members of the exponential family

(Lange et al. 1989). Moreover the derivatives of either the density or log density with

respect to the data and location or scale parameters is discontinuous, a very unattractive

feature of the exponential power family.

Generat ing Error Mo dels

In the previous section a method, based on the ratio of a normal random variable to some

independent random variable, was used to generate some farniliar error distributions with

a shape parameter. In this section a method is introduced to generate error distribu-

tions with a shape parameter by transforming an error variable z via a transformation

belonging t o a one-paramet er group.

7.2.1 The Construction of a Continuous Group using Infinites-

imal Transformations

In this section we review the construction of a one-parameter group of diffeomorphisms,

a Lie group, using infinitesimal generators.

A one-parameter group of transformations G = { q (-; A) : X E 92) of a subset M of

Euchdean space is a mapping

such that

0 7 is a differentiable rnapping;

The mapping rj (2; 0 ) : M -r M is a diEeomorphism) for every h E S.

The equation

Y = 9 (2; A) (7.2)

for each X E 31 defines a transformation of a point x E R into a point y E 92. The set

Gx = { q (x; A) : X E W}

is called the orbit of a point x. A point x traces out an orbit as ail transformations q E G

are applied to it.

Suppose that A. is the value of h for the identity transformation, that is,

q (2; h) = 2- (7-3)

The function 7 is the solution of the ordinary digerential equation (Eisenhart 1961)

satisSing the initial condition (7.3) . So that at X = &

Consider the reparameterizat ion

with inverse /3-' (a) . It f o h that a = O yields the identity and the differential equation

of the group (7.4) becomes

with initial condition q (x; O) = x. It then follows that

Expanding as a power series in a we have that

In (7.6) we have used the fact that

where cf ( t ) = 4 ( t ) /dt .

Let U be the linear operator defined by

U will be called the infinitesimal operator.

Using operator notation equation (7.6) may be written as

where Um f is the result of composing U precisely na times on f.

Suppose that the series (7.7) is convergent for a E [O, al] then equation (7.7) is

equivalent to equation (7.2) for X E p-' ([O, ai]). Although, this may mean that only a

portion of the orbit of x is given by (7.7). By a transformation of coordinates (7.7) can

take the form of a translation for limited values of a so that the orbit of x is the line

segment parallel to the 1 vector passing through x. Whence, if P (w) is a point on the

line segment of the orbit of x, for X E p-' ([O, al]), then on applying (7.7) to P (w) we

get another line segment of the orbit. Hence by sufficient repetition of (7.7) we obtain

the entire orbit of 3: in so far as it is defineci by (7.2) (Eisenhart 1961).

Replacing a in (7.6) by an infinitesimal ha and neglecting powers of ha higher than

the first, we obtain

= x+E(x) Aa,

which is d e d the infinitesimal transformation of the group G. The transformation in

(7.7) is said to be generated by the infinitesimal transformation (Eisenhart 1961) in the

sense descnbed in the preceding psragraph.

Let G = (7 ( 0 ; A) : A E W) denote a one-parameter group of transformations of a

domain U and let

Then p (A) = 7 (x; A) is a solution of the partial differential equation (7.5) . Conversely,

if (x) is a differentiable function such that 5 (x) is identicdy zero for sufficiently large

1x1, then the partial differential equation

aîl = (q (z; A)) âA

has a solution 7 (x; A) such that i / (x;O) = x. Moreover, the f d y (7 (-; A) : X E W)

define a group of transformations on W consisting of transformations generated by the

infinitesimal transformation (Eisenhart 1961, pp. 54). The solution 7 is unique in a

neighborhood of the origin and satisfies

provided that 5 (u) # O and q (2; A) = x if 5 (u) = O (Arnold 1973).

If the function { (a) is dehed on a noncompact set, then it is possible that the

set (7 (-; A) : h E W) wiU not form a one-parameter group on R. Arnold (1973, pp.22-

23) provides a simple counterexample by defining (x) = x2 so that the solution of

(7.9) is q (2; A) = -11 (A - x) . The domain of 7 is X < x and h > x, so that the

restriction of 7 to these two intenrals gives two unrelated solutions. One reason why the

set (-11 (A - x) : X E W) does not form a one-parameter group on R is simply due to

the fact that I ) (-; A) (A # O) is not defmed on all of &

Example 10 This is pmblem 1 /iom AmoM (1973, pp.15). Suppose that

( x ) = s (x) , x E L*

The solution q of (7.9) satisfies

du

= J r c s c C ( u ) d u

= [log Icsc (u) - cet (u) l] f

So that,

q (x; A) = 2 arctan (exp (A) jtan (x/2)1).

The identity transformation occurs at X = 0,

To verify that q is a solution of (7.9) first observe that,

Letting y = arctan (exp (A) Itan (x/2)1) and using the double-angle formula for sin (x) we

have that,

= 2 sin (y) cos (y)

= 2 tan (y) cos2 (y)

so that + (x; A) / a A = (7 (x; A)) . In addition, it is not difficult to show that {q (-; A) : A E 92)

is a one-parameter group of transformations, with inverse O-' (1; A) = 2 arctan (exp (-A) tan ( ~ 1 2 ) ) .

Example 11 Thts is ezarnple 2.6 from Arnold (1973, pp. 15). Suppose Wat

where k is a constant. The solution of (7.9) satisfying g (x; 0) = x is simply

q (x; A) = xexp (kh) .

It is not difficult to verifjr that the set (7 (*; A) : A E W) i s a one-parameter group of

transformations, the scale group.

7.2.2 Generating Error Distributions with a S hape Parameter

via the Infiriitesimal Transformation

Let E (-) be a differentiable function defined on R such that 5 (x) is zero for large 1x1.

It then follows that the differential equation (7.9) has a unique solution q (-; A) such

that 7 (2;O) = r and the set G = { q ( 0 ; A) : X E IR) foxm a one-parameter group of

transformations. The transformations are generated by the infinitesimal transformation

The function E (*) is approximately equal to the derivative of q at the parameter value

yielding the identity transformation.

for srnail AX.

In addition suppose that we only consider functions E ( 0 ) , which yield a solution 7 of

(7.9) , such that for each X

where y E {-1,0,1).

Let z be a symmetric continuous error variable with distribution function Fz defined

on P that is standardized so that the interval (-1,l) contains 68.26% of the probability,

that is

F z (1) - Fz (-1) = 0.6826.

Then it must be the case that

where F, is the distribution function of q. The standardized distribution of 9 (2; A) is

symmetric and has 68.26% of its probability in the interval (-1,l) for each A. So that

if p and a are the median and standard error of (2; A) then the intenral (p - o, p + a) will contain 68.26% of the probability for the distribution of i ) (2; A) for each A. Fraser

(1976a, 1976b, 1979) used a similar approach to standardize the Stvdent (A) family.

Thus given a continuous symmetric error variable z with distribution function Fz

which satisfies (7.11) ; a differentiable function 6 ( e ) that vanishes outside some closed

intervai which yields a solution g (2; A) of the partial differentid equation (7.9) such

that (7.10) and (z; 0) = z hold, continuous symmetric error distributions with a shape

parameter can be generated. Examples of functions { which satisfy the above conditions

are currently being investigated by the author and D.A.S. Fraser.

7.2.3 Power Transformations

Instead of generating a continuous group of transformations via the infinitesimal trans-

formation a specific group can be specified. By changing the distribution of the error

variable z different families of error distributions with a shape parameter can be gener-

ated. In this section we consider the group of power transformations as an example.

Suppose that the group action is defined as (2) = 1zlA, z E R, X > 0, the power

transformation of z. It is easy to verify that the set of all power transformations P =

{gA : A > O) fonns a commutative group. The composition of two power transformations

is given by

for a, p > O. The group identity is qi (t) = lzll = lzl E P. The inverse of a power

transformation q, (t) E P is given by the power transformation qil ( z ) = 1 ~ 1 ' ' ~ E P.

Hence P is a group.

Example 12 This is an eze-e in h u e r (1976). Suppose that z - exp (1) so that

f (2) = exp (-2) , z > O. T h a the powm tmnsfonnation w = (2) = 2, X > O, z > O

has d m - t y functzon

the Wa'bull ( l / X ) distribution.

Example 13 Let z -- N (0,l). The power tmnsformation w = II* ( r ) = lzl" , X 2 1, r E

L has distBbutzon functzon

The density function is then obtained by differentiation

The power transformation of a standard normal random variable yields a family of error

distributions where the density functions are right-skewed. In particular, when X = 1

the distribution of w is the standard normal density on (0,m) and when X = 2 the

distribution of q is the chi-square with one degree of freedom.

7.2.4 Location-Sc&-S hape Analysis

Consider the location-scale-shape mode1

where the error variable w, with shape parameter A, is generated from a group trans-

formation acting on another error variable z. The FR method for location-scale-shape

rnodels was derived in Chapter 6. Implementation of the FR method requires the tangent

directions, at the data, to a second order ancillary. In addition, a first derivative ancillary

has the same tangent directions as a second order ancillary. The directions obtained in

(6.7) for a general error variable can be described by the vecton V = (vl, v2, v3) , where

and F is a distribution function or some quivalent pivota1 quantity. When the error

variable is generated from a group of transformations acting on z = q-* ((y-pl) /O; A)

the X direction vector v3 has the form

where p (2; A) = 8q (2; A) /aA.

Chapter 8

Locat ion-Scale-Shape Analysis wit h

Student (A) Errors

The Student (A) distribution

is a family of continuous symmetnc densities that range from the thick-tailed Cauchy

(A = 1) up to the normal (A + oc). The Stdent (A) distribution for small values of

X is often cited as being a more realistic error distribution than the normal. Fraser

(1979) showed that for a given value of X location-scale analysis with Student (A) errors

is robust including resistance to outliers. Lange, Little and Taylor (1989) studied the

robustness of both linear and non-linear regression models (with the possibility of missing

data) assuming that the errors followed a Student ( A ) distribution. Lange et ni. used hst

order methods to study the precision of the least squares estimates they obtained. hdeed

the Stdent (A) family has been cited in the literature as a reasonable family for the error

distribution of a statistical mode1 that combines computational and conceptual simplici ty

with generality. The FR method applied to the location-scaleshape problem will produce

more accurate p-values and confidence intervals than were previously available.

8.1 Asymptotic Location-Seale-Shape Analysis wit h

Student (A) Errors

In this section we derive the likelihood related quantities needed in order to implement

the FR method for the location-scale-shape mode1 with Stdent (A) errors.

The likelihood function for 9 = (p, O, A) is

1 1 (8 ) = n log e (A) - - log (Air) - logo + 2 (q) log (1 + A-'z.z)

2 i=I

where

The maximum iikelihood estimator for 0 can not be obtained in closed form, hence

an iterative algorithm must be used to obtain the maximum likelihood estimate for any

particular data set.

The information matrix is

The entries of j (9) are:

zi" =i 2 -- 2: ) 1 1 n --

'mm = c (y) [$ (A (l + pz;)' + (1 + ~-1t:) 02 (1 + A-lz:) a2 y

We can obtain expressions for both d (A) and d' (A) in tems of c (A), the digamma

function

and the trigamma function W (A) . Indeed, the first derivative of c (A) can be written as

Differentiating again would yield an expression for d' (A) in terms of c (A) , d (A) , @ (A)

and Q' (A) . Asymptotic formulae are available for both @ (A) and ik' (A) (see Abramowitz

and Stegun 1972). The information matrix can also be found by obtaining the Hessian

of - 1 numerically.

Lange, Little and Taylor (1989) derive the expected information matrix for an ellip

t i cdy symmetric family of densities. The expected information is block diagonal with

the mean component in one block and the scale and shape components in another block.

Hence the macimum Iikelihood estimators of p and (cl A) are asymptotically uncorrelated

to the first order. Thus, insofar as first order asymptotics are concemed, treating X as

unknow-n or estimating X from the data and treating it as fixed should not have much

effect on the standard-error of

The tangent directions to some third order ancillary are given by (6.7) with the

numerator of given by

where

F (z; A) = J = (2) + X-1r2] -P+1)/2 dz . -- r (O ) r (4) f i

Since F (2; X j is not available in closed form (8.1) will be approximated using (6.8).

The reparameterization (4.19) is

where

n As X - QQ clearly âF/ôX + O and cp -, (zz1 -svi~, Ci=, -&viz) the reparameteriza-

tion obtained from the N (p, 02) model.

The Jacobian of the parameter change is

The columns of J3 (8) are:

" a al " A + I

= C (T)

for j = l , 2 , 3 and =y, - p .

Thus with accuracy O (ne3I2) we c m use (4.24) or (4.25) with (6.9) from the previous

section to obtain the significance function for p. The significance function for o can be

obtained by similar computations.

8.2 Maximum Likelihood Estimation in the Student (p, o, A)

Family

If we assume that the error distribution is the Student (A) distribution then in order to

implement (4.24) or (4.25) we will need maximum likelihood estimates in the full and

constrained model.

Two different algorithms have been used to compute the maximum likelihood esti-

mate. A quasi-Newton method and the EM algorithm. The EM algorithm is irnplemented

in Lisp-Stat (Tiemey 1990) and the quasi-Newton method in C.

Let

be the Newton iteration at step k. The quasi-Newton method computes

where j (O) is the approximation to the obsenred information at step k and S (8) is the

score vector. The term j- l ( ~ ( ~ 1 ) S giws the direction of the current increment

and r k E (O, II gives it 's length. Taking the full Newton step j- ' S is not

guarsnteed to decrease 1 sdciently, but if we move to a point dong the direction

of the Newton step by decreasing 7 then we can often decrease 1 (o(~+')) according to our

criterion. For further details see Press, Teukolsky, Vetterling and Flannery (1988) and

the references t herein.

The EM algorithm (Dempster, Laird and Rubin 1977) is an algorithm for cornputing

the maximum likelihood estimate from incomplete data. The EM algorithm augments

the data (yl, ...,y,,) with additional data (q, ..., T,) such that the maximum likelihood

estimate of 0 given (y, T) is easy to compute. Suppose that is the value of 8 from

the kth iteration of the algorithm, the (k + 1) st iteration of EM consists of: (1) Estep.

computing the expected value of 1 (8; y, T ) with respect to the conditional distribution

r( (Y, ~ ( ~ 1 ) ; (2) M-step. maximize the resulting function with respect to 8. Maximum

likelihood estimation in the Stvdent ( p l 0, A) family using the EM algorithm is described

in great detail in Liu and Rubin (1995).

Starting values for B = (p, O, A) were the median for p, the interquartile range for (T,

and a grid search over various X values. Personal experience suggests that these starting

values work better, especially in simulations, than the starting values suggested by Lange,

Little and Taylor (1989) (fitting a normal mode1 by least squares and using the kurtosis

of the residuals from the normal fit for the stsrting value of A). Newton's method seemed

to be very sensitive to starting values which was not the case for the EM algorithm.

8.3 Modeling Real Data Wit h Student (A) Errors

In this section one simulatecl data set and one real data set are analyzed using the

location-scale-shape mode1 with Student (A) errors. Both data sets have one extreme

observation inclicating that a normal analysis is inappropriate for these data sets. Three

methods are used to obtain confidence intervals for the location parameter p: the FR

method, the signed likelihood ratio statistic, and numerical integration of the exact condi-

tional densiS. The last method can only be used for a fixed value of the shape parameter

A, since there is no exact conditional density hindion when A is assumed to be unknown.

An appropriate definition for the location and scale parameters p and a in the

Studat ( p p , A) f d y are the median and standard error (twice the distance sym-

metricaliy about the median that contains 68.26% of the probability) respectively. Cor-

respondingiy the Stvdent ( p , a, A) family can be rescaled so that the interval (-1,l)

contains 68.26% of the probability. The scaling factor can be obtained for each X by

solving the equation

where Y = qA)Z , Z - Stuàent ( A ) . From symrnetry we have that

So U(A) = 1/ F- ' (0.8413; A) , where F- ' is the inverse cumulative distribution function

of the Student (A) distribution. The Student (A) distribution has 68.26% probability in

(-tx , tx ) ; some values of tA are

These values were obtained from Raser (1976b , pp.467).

In this section let p be the median and a the standard error such that (p & cr) contains

68.26% of the probability.

8.3.1 Cauchy Data

The following 10 observations were generated from the Cauchy distribution.

Sample of i O from Cauchy

CD-

-- c. r 3

S

CU-

0-

Sample Data

The profile likelihood function for X evaluated at X = 2k, k = 0,1,2,3, oc is

Twice the difference between the likeiihood of the best fitting Student model and the

normal model is 2 (1 (c, â, 1) - 1 (pl ô, m)) = 4.92806 and to the first order follows a chi-

square distribution on 1 degree of freedorn; P & > 4.92806) = 0.02642, a rignificarit

improvement in fit over the normal model.

The following tables record 95% confidence intervals for p, with X = 2k, k = 0,1,2,3, m,

and unkmmm (uk) using the FR method, the signed likelihood ratio statistic R, and nu-

merical integration of the exact conditional distribution.

1 Exact 1 (-0.61012~0.83808) 1 (-0.87620,0.83428) 1 -

Exact

FR

The maximum likelihood estimates are given in the table below:

As X increases the location estimate shifts to the Ieft , correspondhg to the extreme value

-3.02 on the left tail of the distribution; the estimate of the standard error also increases

substantially as X increases. The Student (p, o, A) family, for smaller values of A, is more

resistant to the extreme value on the left tail of the distribution.

!

(0.05506,0.80387)

(0.05716,0.80954)

(-0.15400,0.81528)

(-0.16298,0.81498)

(-0.40106,0.83428)

(-O.41191,0.83652)

Discussion

When the value of X is fked the FR method provides a more accurate approximation

to the exact answer than R. For instance, when X = 1 the lower endpoint for the exact,

FR, and R are: 0.055,0.057,0.134 respectively The approximation using R results in

a lower limit that is considerably larger than the exact, overstating the tightness of the

confidence intervals or precision that is usuaily achieved by a Stvdent (A) analysis for

small values of X (Fraser 1976a, 1979; Lange et al. 1989). The FR method yields an

interval whose resistance to extreme values is similar to the exact. As X becomes larger

the exact lower endpoint is pded to the left due to the extreme obsenration, and the

approximation to the lower endpoint using R is adequate, but is still outperformed by

the FR method.

When the value of X is imknmn there is no exact method to provide a performance

benchmark. Hence, it is difficult to ascertain how well the approximate rnethods compare

to each other. Although, when the shape parameter X is ~ssumed to be h o w n we

might expect the resulting confidence intervals to be a 'blend' of the fixed X intervals,

since we are averaging over the nuisance parameter distribution to obtain the marginal

distribution of S. Indeed, this is the case for the FR method but not for R The FR

method yields: a lower limit when X is u n k n m , -0.01220, that is between the lower

limits for X = 1, 0.05506, and X = 2, -0.15400; and an upper limit when X is unknown,

0.83015, that is between X = 2, 0.81528, and X = m, 0.83428. The fhst order method

using R results in an upper limit, 0.79438, that is less than the smdest exact upper

limit; and a lower limit, -0.04360, that is between the lower lirnits for X = 1 and X = 2.

Thus, when X is sssumed to be unlaiown the FR method produces an interval that has

lower endpoint falling between the fixed X = 1 and X = 2 intervals; smaller values of X

correspond to greater resistmce which is needed due to the extreme value on the left tail

of the distribution. The upper endpoint, when X is unlmown, is s blend of the upper

endpoints between the X = 2 and X = oo cases; larger values of X correspond to less

resis tance.

8.3.2 Cushny-Peebles Data

Cushny and Peebles (1905) collectecl data that was analyzed by Student (1908) and Iater

by Fisher (1925). The 10 observations are a measure of irnprovement under a change in

cimg therapy.

A histogram of the data is shown below.

Cwhny & Peebles Data

The profile likelihood function of X evaluated at X = 2k, k = 0,1 ,2 ,3 , oo is given in

the table below.

Twice the difference between the likelihood of the best fitting A A

n o d model is 2 (1 (p, o, 1) - l(6, â, m)) = 4.69984 and to the

square distribution on 1 degree of freedorn; P & > 4.69984)

improvement in fit over the normal model.

Student model and the

first order follows a chi-

= 0.03017, a significant

As in the previous example, 95% confidence intemals for the location parameter p

are recorded using the three methods discussed previously.

1 Exact 1 (0.89932,1.66168) 1 (0.82153,1.83282) 1 (0.76319,2.0273) 1

The maximum Iikelihood estimates of 9 are recorded in the table below.

Exact

FR

A - 8

(0.72429,2.21011)

(0.71810,2.19654)

X=w

(0.70484,2.45516)

(0.70735,2.45265)

X = uk

-

(0.83020,1.71193)

The location estimate shifts to the right as X becomes large, correspondhg to the extreme

value on the right tail of the distribution. Moreover, the standard error becomes larger

as X increases due to the increasing influence of the extreme value.

Discussion

The confidence intervals shift to the left as X becomes smaller, providing higher resistance

to the extreme value 4.6 on the right tail of the distribution. Both approximate methods

preserve this behavior, although since the FR method is more accurate, the magnitude

of the shift is almost identical to the exact.

The FR method provides a much better approximation to the exact, for h e d A, than

the k s t order R. Indeed, the inaccuracy of the fist order R produces intervals that

are tighter than the exact, overstating the precision of the estimate for p. For example,

when X = 4 the fist order interval is (0.809lO,l.94882) compared to the exact interval

(0.76319,2.0273) and FR i n t e d (0.76950,2.02572) . When X is assumed to be unlmown, as mentioned in the previous example, there is

no exact answer to use as the gold standard. Although, the FR method produces an

interval where the endpoints fall between the endpoints of the X = 1 and X = 2 case.

The first order method R produces an interval where the values fall outside the smallest

exact !ower endpoint and the largest upper exact endpoint.

8.4 Simulation Study

8.4.1 Purpose

To gain a better understanding of how the third order FR method compares to the first

order signed likelihood ratio statistic in repeated sampling a simulation study of 100,000

trials was designed and implemented. The major research interest is to understand how

each method behaves, for a s m d sample size, when the shape parameter X is estimated

from the data or fixeci a priori, across different populations.

8.4.2 Methods

The algorithm used is described in the following pseudecode:

1. for numsim = 1 to N

2. l e t u = generate-uniform(n)

3. f o r pop = 1 t o P

4. l e t sample = inverse-cdf (u,P@

5. f o r 1-ethod = 1 t o 2

6. l e t

f i r s t s r d e r = normal-cdf (r (l-ethod, sample) )

8. thiraorder = normal-cdf (r* (lamsiethod, sample) )

A sample of size n is generated from the uniform distribution on [O, 11; for each of

the P populations a sample is generated by computing the inverse probabiliw transform

inverse-cdf; h d y for fixed (lanunethod = 1) and estimated (lamsiethod = 2) X a

first order pvalue is calculated using the signed likelihood ratio and a third order pvalue

is caldated using the FR method. The procedure is repeated N times.

The simulation was carried out on a sample size of 10. Two distributions were con-

sidered, the Student (6) and N (O, 1), the latter corresponding to the X = oo case in the

Stvdat (A) fady .

8.4.3 Results

The tables below summarïzes the results for 100,000 simulations, the observed pvalues

for testing p = O were computed by the two approximate rnethods for estimated and

fked X : the percentages of one-sided p-values less than the nominal 0.5%, L5%, 2.5%,

5%, 95%, 97.5%, 98.5%, and 99.5% were recorded in the tables below. Two types of Q-Q

plots are also included at the end of the section: detrended Q-Q plot and, standard Q-Q

plot. The detrended Q-Q plot has the ordered quantiles of the theoretical distribution

on the horizontal axis and, the residuals from regressing the observed quantiles on the

theoretical quantiles on the vertical d s ; the line y = O is superimposed on the scatterplot

to aid in the assessment of the fit. The S-Plus code used in constructing the detrended

Q-Q plots for R' and R is:

detrended <- function(data)

I z <- qqnorm(data, plot=F)

plot(z$x,ltsreg(z$x,z$y)$residual,cex=0.5,xlab=c~J > ,ylab=c'JJ,axes=F)

abline (h=O) #draw horizontal Une

1 The titles and axes were added after the plot was constructed using the Function

axes 0. The standard Q-Q plots were obtained by using the two standard S-Plus func-

tions: qqnorm0 ; qqline () (Venables and Ripley 1994).

The Anderson-Darling A2 test was used as a goodness-of-fit test in assessing the

normality of R and R*. D'Agostino and Stephens (1986, pp.406) include the A2 test

smong their recommended tests for checkhg the normality of data. The S-Plus program

for computing the A2 test statistic is:

ad <- function(x) (

xi c- sort(x)

z <- (xl - mean(xl))/sqrt (var(x1))

p <- pnorm(z)

n <- length(x)

res <- O

for ( i in Ln) {

res Ci] C - ( (-1) *(2*i-1) *(log(p Ci] ) +logCl-p b + l - i l ) ) ) /n

1 (sum(res 1 -n) * ( 1+0. ?5/n+2.25/n^S)

}

The null hypothesis of the A2 test, normaliQ, is rejected at level a if the observed

value of the test statistic falls into the critical region. Some of the critical values for the

A2 test are given in the foilowing table (D'Agostino and Stephens 1986, pp.373).

1 a 1 CriticdValueofA2 Test (

1 0m0 1 1.035 1

Tables

Student (6)

Estimated A

Stuàent (6)

Fixed X

Obsewed Value of A2

Estimated X

R* 1 0.16209 (NS) 1 0.42675 (NS) 1 I

Student (6)

1 R 1 1.33768 1 0.51522 (NS) 1

N (0, 1) I

Obsenred Value of A2

Fixed X

- - Q-Q Plot of 100,000 Simulations R*, Student(6), lambda=free

92 O 2

Quantiles of Standard Normal

- Detrended 42-Q Plot of 1 00,000 Simulations R*, Student(G), lambda=free


Q-Q Plot of 100,000 Simulations R, Student(G), lambda=free

Quantiles of Standard N o m l

- - Detrended Q-Q Plot of 100,000 Simulations R, Student(G), lambda=free

QuantiIes of Standard Normal

Q-Q Plot of 100,000 Simulations FI*, Student(G), lambda=6

- - Detrended Q-Q Plot of 100,000 Simulations .

R*, Student(6), lambda=6

Quantiles of Standard Nomial

Q-Q -Plot of 100,000 Simulations R, Student(6), lambda=6


Detrended- Q-Q Plot of 1 00,000 Simulations

1 I I I I

-4 -2 O 2 4


Q-Q Plot of 100,000 Simulations R*, N(0,1), lambda=free

-2 O 2


O . Detrended Q-Q Plot -of 1 00,000 Simulations R*, N(O,l), lambda=free

Quantiles of Standard Nomial

.Q-Q Plot of 100,000 Simulations R, N(O, l), lambda=free

I I I 1 1

4 -2 O 2 4

Quantiles of Standard Noml

Detrended Q-Q Plot of 100,000 Simulations R, N(0,1), lambda=free

Qum tiles of Standard Normal

. . Q-Q .Plot of 1 00,000 Simulations R*, N(0,1), lambda=infinity

-2 O 2


- Detrended Q-Q Plot of 100,000 Simulations R*, N(0,1), lambda=infinity


- . Q-Q Plot .of 100,000 Simulations R, N(O, 1 ), lambda=infinity

Quantiles of Standard Nomal

- ~Detrended Q-Q Plot of 1 00,000 Simulations R, N(O, 1 ), lambda=infinity


8.4.4 S M a t ion Conclusions

When X is estimated the tables of p-values indicate that the distribution for the third

order FR method is much closer to the theoreticd uniform distribution than the f h t

order signed likelihood ratio statistic R, for the Stvdent (6). The Q-Q plots indicate

non-normality and the A2 test rejects the null hypothesis of normality for the fust order

method when X is estimated from the data for the Student (6). For fked X the FR

method performs much better than R when the data is generated from the Student (6).

When the data is generated form the N (0,l) the third order and first order methods

produce satisfactory results.

Chapter 9

Conclusion

A proof of the asymptotic normality of the posterior distribution of a vector parameter

based on a sample from a stochastic process with respect to either a proper or improper

prior is given. The limiting distribution provides a means of approximating the signifi-

came level of an interest parameter to the first order. When the sample size is s m d first

order asymptotic methods can be very inaccurate. For srnall samples in a general sta-

tistical model the third order FR method outperforms the standard first order methods.

A method is introduced to generate error distributions with a shape parameter and the

FR method is applied to the location-scale-shape model. Techniques are presented for

numerical computation of all quantities required for the FR method. Numerical examples

and simulations are included when the error distribution is the Stdent (A).

A generalization in one direction of the posterior asymptotic normality proof is to

obtain the limiting distribution while the number of nuisance parameters is allowed to

grow at, Say, a rate slower than the number of observations from the stochastic process.

The solution to this problem would have applications to inference in branching processes.

A different but related question is: suppose that the parameter space is infinite dirnen-

sional, 9PO for example, can the proof be modified to obtain the limiting distribution of

the posterior distribution? What seems to be required to anmer this is: a nom on the

infinite dimensional parameter space that satisfies llTxll 5 llT 11 -*11x11 , where x belongs

to the parameter space and T is a linear transformation on the parameter space; re-

placing the obsemd Fisher information in assumption 2 (Chapter 3) with another type

of information measure that can be dehed on an infinite dimensional space. Research

is currently under way to find alternative definitions of information that would lead to

posterior asymptotic nomality.

In Chapter 7 a method was introduced for generating error distributions with a shape

parameter by using infinitesimal transformations to construct a one-parameter group of

continuous transformations. It is not obvious which function ( 0 ) would yield a syrnmetnc

family of error distributions with a shape parameter that satisfies (7.10). Research is

currently under way in this area.

Bibliography

[II Abrarnowitz, M. and Stegun, 1. (1972). Handbook of Mathematical finctionî with

Fornulas, Gmphs and Mathematical Tables. Wiley, New York.

[2] Arnold, V.I. (1973). Ordinary Dflerential Equatiom. MIT, Cambridge.

[3] Barndofi-Nielsen, O.E. (1980). Conditionality resolutions. Biometrika 67, 293-310.

[4] Barndofi-Nielsen, O.E. (1983). On a formula for the distribution of the maximum

likelihood estimator . Baometrika 70, 33465.

[SI Barndofi-Nielsen, O.E. (1986). Merence on full and partial parameters based on

the st andardized signed log likelihood ratio. Biometrika 73, 307-22.

[6] Barndofi-Nielsen, O.E. (1990). Approximate interval probabili ties. J. R. Statist.

SOC. B 52, 485-96.

[7] Barndofi-Nielsen, O .E. (1991). Modified signed likelihood ratio. Biometrika 78,

557-63.

[BI Barndofi-Nielsen, O.E. and Cox, D.R. (1979). Edgeworth and saddlepoint approx-

imations with statistical applications. J. R. Statist. B 41, 279-312.

[9] Barndorff-Nielsen, O .E. and Cox, D.R. (1989). Asymptotzc Techniques In Statistics.

Chapman-Hall, London.

(101 BarndorfE-Nielsen, O.E. and Cox, D.R. (1994). Inference and Asppto t ics .

Chapman-Hall, London.

[Il] Blaesild, P. and Jensen, J.L. (1985). Saddlepoint formulas for reproductive exponen-

tial models. Scand J. Statist. 12, 193-202.

[12] Box, G.E.P. and Tiao, G.C. (1973). Bayesian Inference In Statistzcal Analysis.

Addison- Wesley, Reading Mass.

[13] Brenner, D., Fraser, D.A.S., and McDunnough, P. (1982). On asymptotic normaliigr

of likelihood and conditional analysis. Can. J. Statist. 10, 163-72.

[ly Cakmak, S., Fraser, D.A.S., and Reid, N. (1994). Multivariate asymptotic model,

exponential and location type approximations. Utditas Math. 46, 21-31.

[15] Cambell, J.E. (1903). lntmductory lkeatise On Lie's Thwry Of Finite Continuow

ïhnsfonnatzon Gmups. Oxford, London.

[16] Cheah, P.K., Fraser, D.A.Ç., and Reid, N. (1995). Adjustments to likelihoods and

densities; calculating significance. J. Statist. Res. 29, 1-13.

[17] Crarnér , H. (1946). Mathematical Methods of Statistics. Princeton University Press,

Princeton.

[18] Cushny, A.R. and Peebles, A.R. (1905). The action of optical isomers II. Hyoscines.

J. Physiol. 32, 501-10.

[19] D7Agostino, RB. and Stephens, M.A. (1986). Goodness-Of-Fit Techniques. Marcel

Dekker, New York.

[20] Daniels, H.E. (1954). Saddlepoint approximations in statistics. A m . Math. Statist.

25, 631-50.

[21] Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from

incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1-38.

[22] DiCiccio, T.J. and Martin, M.A. (1991). Approximations of marginal tail probabil-

ities for a class of smooth h d i o n s with applications to Bayesian and conditional

inference. Biornetr+ka 78, 891-902.

[23] Di*, J. M. (1968). Three multidimensiond integral identities wit h Bayesian appli-

cations. Ann. Math. Statist. 39 1615-28.

[24] Eisenhart, L.P. (1961). Continuous Groups Of 'Ifnnsfonnations. Dover, New York.

[25] Feller, W. (1971). An htmduction to Pmbability T h w v and zts Applications 2. 2nd

ed. Wiley, New York.

[26] Fisher, R.A. (1925). Statisticol Methods for Reseurch Workers. London, Oliver and

Boyd.

[27] Fisher, R.A. (1934). Two new properties of mathematical likelihood. Pmc. Roy. Soc.

London Ser. A 144, 285-307.

[28] Folland, G.B. (1984). Real Anulysis Modem Techniques and Thezr Applications.

Wiley, New York.

[29] Fraser, D.A.S. (1961). The fiducial method and invariance. Biometrika 48, 261-80.

[30] Fraser, D.A.S. (1964). Local conditional dciency. J. R. Statist. Soc. B 26, 52-62.

[31] Fraser, D.A.S. (1968). The Structure of Inference. Wiley, New York.

[32] Fraser, D. A.S. (1976~3). Necessary analysis and adaptive inference (with discussion).

J- Amer. Statist. Assoc. 71, 99-113.

[33] Fraser, D.A.S. (1976b). Pmbabdity and Statistics. DAI Press, Toronto.

[34] Fraser, D.A.S. (1979). Inference and Linear Models. McGraw Hill, New York.

[35] Fraser, D . A.S . (1988). Enc yclopedia of Statistzcol Sciences. Volume 9. Structural

Inference. (eds S. Kotz, N.L. Johnson). Wiley, New York.

[36] Fraser, D.A.S. (1990). Tai1 probabilities from obsenred likelihood. Biometrika 77,

65-76.

[37] Fraser, D.A.S. (1991). Statistical inference: likelihood to significance. J. Am. Statkt.

Assoc. 86, 258-65.

[38] b e r , D.A.S. and McKay, J. (1975). Parameter factorization and inference based

on significance, iikelihood, and objective posterior. Ann. Statist. 3, 559-72.

[39] Fraser, D.A.S. and McDunnough, P. (1984). Further remarks on asymptotic normal-

ity of likelihood and conditional analysis. Can. J. Statist. 12, 183-90.

[40] Fraser, D.A.S., McDwuiough, P. and Taback, N. (1997). lmproper priors, posterior

asymptotic normality, and conditional inference. in Advonces in the Theory and

Pmctzce of Statistics: A Volume In Honor of Samuel Kotz. 2 e h . N.L. Johnson and

N. Balakrishnan. Wiley, New York.

4 Fraser, D.A.S., Monette, G., Ng, K.W., Wong, A. (1995). Higher order approxi-

mations with general linear models. Pmceedings of the symposium on multivariate

analysis, Hong Kong.

[42] Fraser, D.A.S. and Reid, N. (1990). From multiparameter likelihood to tail proba-

bility for a scalar parameter. Tedinicul Report No. 9003 University of Toronto.

[43] Fraser, D.A.S. and Reid, N. (1995). Ancillaries and third order significance. Utzlztas

Math. 47, 33-53.

[a] Heyde, C.C and Johnstone, LM. (1979). On asymptotic posterior normality for

stochastic processes. J. R. Statist. Soc. B 41, 18489.

[45] Jensen, J.L. (1995). Saddlepoznt Appmzimotions in Statistics. Oxford Press, New

York.

[46] Johnstone, LM. (1978). Problems in Limit theory for martingales and posterior dis-

tribut ions kom stochast ic processes. M. Sc. thesis, Avstmliun National Universit y.

[47] Kolassa, J.E. (1994). Series Appmzimation Methods in Statistics. Springer-Verlag,

New York.

[48] Lange, K.L., Little, J.A. and Taylor, J.M.G. (1989). Robust statistical modelling

using the t distribution. J. Amer. Statist. Assoc. 84, 881-96.

[49] Lehmann, E.L. (1991). Theory of Point Estimation. 2nd ed. Wadsworth, Belmont.

Little, J.A. (1988). Robust estimation of the mean and covariance matrix with miss-

ing values. Appl. Statist. 37, 23-38.

Liu, C. and Rubin, D.B. (1995). ML estimation of the t distribution using EM and

its extensions, ECM and ECME. Statist. Sinica 5, 475-90.

Lugrnami, R. and Rice, S.0. (1980). Saddlepoint approximation for the distribution

of the sums of independent random variables. Adv. Appl. Pmb. 12, 475-90.

Marsden, J.E. and Tromba, J.A. (1988). V i t o r Culcu!us. Freeman, New York.

McCullagh, P. Tensor Methods in Statistics. Chapman-Hall, London.

Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. (1988). Numer-

ical Recipes in C. Cambridge University Press, Cambridge.

Reid, N. (1988). Saddlepoint methods and statistical inference (with discussion).

Statist. Scz. 3, 213-27.

Reid, N. (1995a). Likelihood and higher order approximations to tail areas: a review

and annotated bibliography. Can. J. Statist. 24, 141-66.

Reid, N. (1995b). The roles of conditionhg in inference (with discussion). Statist.

Scz'. 10, 138-57.

[59] Rogers, W .H. and Tukey, LW. (1962). Understanding some long-tailed distributions.

Statistzca Neerlandica 26, 21 1-26.

[60] Scheffé, H. (1947). A useful convergence theorem for probabili* distributions. Ann.

Math. Statist. 18, 434-38.

[61] Skovgaard, LM. (1990). On the density of minimum contrast estimators. Ann. Sta-

t k t . 18, 779-89.

[62] Sweeting, T.J. and Adekola, O.A. (1987). Asymptotic posterior normality for sto-

chastic processes revisited. J. R. Statist. Soc. B 49, 215-22.

[63] Tierney, L. (1990). Lisp-Stat: An Object-Onenlecl Environment For Statistzcal Com-

puting And Dynamzc Graphies. Wiley, New York.

[64] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Ann.

Math. Statist. 20, 595-601.

[65] Walker, A.M. (1969). On the asymptotic behavior of posterior distributions. J. R.

Statist. Soc. B 31, 80-88.

[66] Venables, W.N. and Ripley, B.D. (1994). Modern Applzed Statistics with S-Plus.

Springer-Veriag, New York.

(671 Von Montfort, M.A. J. and Otten, A. (1978). On testing a shape parameter in the

presence of a location and a scale parameter. Math. Op. Statist. 9 , 91-104.

IMAGE EVALUATION TEST TARGET (QA-3)

L , LU..

1.8 111IL

APPLIED IMAGE. Inc fi 1653 East Main Street

,=- Rochester, NY 14609 USA -- --= Phone: 7161482-0300 -- --= Fax: 71 W88-5989

O 1993. Appiled Image. Inc. Ail AlgCILs Resenred

Documents

Likelihood Asymptotics and Location-ScaleShape AnalysisChapter 2 S t atist ical Inference This diapter reviews some key concepts and definitions in parametric statistical inference