Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Likelihood Asymptotics and Location-ScaleShape Analysis
Nathan Asher Taback
A thesis submitted with the requirements for the degr- of Doctor of Philosophy
Graduate Department of Statist ics University of Toronto
@ Copyright by Nathan Taback 1998
National Libtary Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographie Senrices services bibliographiques
395 Wellington Street 395, rue Wellington OttawaON K1AON4 Ottawa ON K I A O N 4 Canada Canada
The author has granted a non- exclusive licence dowing the National Library of Canada to reproduce, loan, distnbute or sell copies of this thesis in microform, paper or electronic formats.
The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts ~ o m it may be printed or otherwise reproduced without the author's permission.
L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distniuer ou vendre des copies de cette thèse sous la forme de microfiche/fïIm, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
Likelihood Asympt ot ics and Locat ion-Scale-S hape Analysis
Nathan Asher Taback
Doctor of Philosophy 1998
Department of Statistics
University of Toront O
Abstract
Regularity conditions are presented and a rigorous prwf is given showing that the
postenor distribution or normalized likelihood hinction, based on observations from a
stochastic process, of a vector parameter with respect to either a proper or improper pnor
converges, alrnost surely, in distribution to the normal distribution. These conditions
bear a strong resemblance to Wald's conditions under which the maximum likelihood
estimator is consistent (Wald 1949).
A method is presented for generating error distributions with a shape parameter that,
for example, can be used in location-scaleshape models. An error distribution that does
not depend on a shape parameter can be transformed via a one-parameter continuous
group of transformations, constructed using an infinitesimal transformation, into a family
of error distributions indexed by a shape parameter.
A third order approximation to the significance level in testing either the location or
scale parameter without any prior information conceming the remaining parameters in a
location-scale-shape model is induded. The third order approximation is developed via
the asymptotic method, based on exponential rnodels and the saddlepoint approximation,
presented in Fraser and Reid (1995). Techniques are presented in the thesis for numencal
computation of ail quantities required for the third order a p p r d a t i o n . To compare the
accuracy of various asymptotic methods numerical examples and simulations are included
when the error distribution is the Student (A) family. Findy possible extensions are
suggested.
Acknowledgement s
1 would like to thank my supervisor, Professor D.A.S. Fraser, for his guidance, support,
and patience.
1 appreciate the conversations I had with Professors D. Brenner, K. Knight, P. Mc-
Dunnough, and N. Reid while this thesis was being wrïtten.
1 would like to thank Alison Gibbs who provided usehl cornments and criticisms while
1 was preparing my dissertation.
Financial support from the Goverriment of Ontario and the Department of Statistics
University of Toronto is appreciated.
This thesis is dedicated to my wife Monika. Her love and support greatly facilitated
the writing and completion of this thesis.
iii
Contents
1 Introduction
2 Statistical Inference
2.1 Definition of the Likelihood Function & Related Quantities . . . . . . . . 2.2 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Inference Lri Structural Models . . . . . . . . . . . . . . . . . . . . . . . . 2.5 The Relationship Between Bayesian and Structural Inference . . . . . . .
3 First Order Likeiihood Asymptotics
3.1 Likelihood For The Location Normal . . . . . . . . . . . . . . . . . . . . 3.2 First Order Tests and Confidence Intervals Based on Likelihood . . . . .
3.2.1 TheLimitingQuadraticShapeofthe LikelihoodFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Appendix
4 Higher Order Likelihood Asymptotics
4.1 The Edgeworth Expansion . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 The Saddlepoint Expansion Via The Edgeworth Expansion . . . . . . . . 4.3 hference For A Scaler Parameter In the Presence Of Nuisance Parameters
4.3.1 First Derivative Anciilary . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Approximating Exponential Mode1 . . . . . . . . . . - . . . . . . 4.3.3 Marginal Density For A Scalar Interest Parameter . . . . . . . . .
4.3.4 First-Order Significance Fiinction For A Scalar Interest Parameter 37
4.3.5 Third-Order Significance Function For A Scalar Interest Parameter 37
5 Improper Priors. Posterior Asymptotic Nomality. and Conditional In-
ference
5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Appenàix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Inference In Location-Scale-Shape Models with Small Samples
6.1 Conditionai Inference For Location-Scale-Shspe Analysis . . . . . . . . . 6.1.1 Inference For Shape . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Inference For Location . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Inference for Scale . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Asymptotic Location-Scale-Shape Analysis . . . . . . . . . . . . . . . . .
6.2.1 Linear Regression with A Shape Parameter . . . . . . . . . . . . .
7 Andysis of Error Models with a Shape Parameter
7.1 Some Familiar Error Distributions with a Shape Parameter . . . . . . . . 7.1.1 Student (p. o. A) Family . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Slash Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Contarninated Normal
. . . . . . . . . . . . . . . . . . . . . . 7.1 -4 Exponential Power Family
7.2 Generating Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 The Construction of a Continuous Group using Infinitesimal Trans-
formations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Generating Error Distributions with a Shape Parameter via the
. . . . . . . . . . . . . . . . . . . . . Infinitesimal Transformation
7.2.3 Power Transformations . . . . . . . . . . . . . . . . . . . . . . . .
8 Location-Sde-Shape Andysis with Stdent (A) Errors 72
. . . 8.1 Asymptotic Location-Scale-Shape Andysis with Student (A) Errors 73
. . . . . 8.2 Maximum Likelihood Estimation in the Stvdmt (p. O. A) Farnily 76
. . . . . . . . . . . . . . . 8.3 Modeling Real Data With Stdent (A) Errors 78
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Cauchy Data 79
. . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Cushny-Peebles Data 83
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Simulation Study 86
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Purpose 86
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Methods 86
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Results 87
. . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Simulation Conclusions 107
9 Conclusion 108
Chapter 1
Introduction
Testing a parameter in a statistical model is often difficult to carry out if the tail area
corresponding to the magnitude of dsparture hom a null hypothesis involves an arduous
computation or, even worse, is infeasible. In these cases it is appropriate to develop an
asymptotic procedure to approximate the tail area. When the parameter of interest is
scalar and nuisance parameters are present there are three cornmon first order asymptotic
procedures available based on the likelihood function: the likelihood ratio statistic, the
score statistic, and the Wald statistic; which are valid approximations provided that the
sample size is large and the statistical model satisfies regularity conditions ensuring that
the normalized likelihood function or posterior distribution, almost surely, converges in
distribution to the normal distribution; see Fraser, McDunnough, and Taback (1997).
The major wealaiess of the aforementioned first order procedures, especially the score
and Wald method, is when the sample size is s m d the approximations c m be very in-
accurate. More accurate methods for approximating tail areas, when the sample size
is s m d , have been developed recently. The saddlepoint approximation has produced
formulas of remarkable practical accwacy for exponentid rnodels. For general statistical
models third order accurate tail area fomulae are available (Barndofi-Nielsen and Cox,
the speci-
iikelihoo d
1994; Fraser and Reid, 1995); although their implementation usuaily requires
fication of an approximate ancillary statistic that complements the maximum
estimator, hence limiting their scope of application to either exponentiai or transforma-
tion models. One way to circumvent this problem, proposed by Fraser and Reid (1995),
is to apprmcimate a general statistical model by an exponential model and apply the
saddlepoint approxirna t ion to the approximating exponential model. Moreover , Fraser
and Reid show that the only information needed conceming the ancillary statistic is its
tangent directions at the obsemd data.
An important statistical model that is neither a trsnsformation model nor an expo-
nential model is the location-scale-shape model. This is an ordinary location-scale model
except that the error distribution includes a parameter to describe the shape of the error
distribution. For instance, the degrees of &dom parameter in the Student distribution
can be regarded as a shape parameter. These models are usefd in reai data sets where,
for example, the assumption of normality is often violated due to there being substan-
t i d y more probability in the tails of the distribution. The Significance level for a scalar
interest parameter, either location or scale, can be approx-hnated using the asymptotic
methods outlined above.
This thesis has three objectives: the first is to provide regularity conditions and a
~ ~ O ~ O U S prwf that the posterior distribution or normalized likelihood hinction, based
on observations from a stochastic process, of a vector parameter with respect to either a
proper or improper prior converges, almost surely, in distribution to the normal distrib-
ution; the second is to present a method for generating error distributions with a shape
parameter that, for example, can be used in location-scale-shape models; and h d y
the third objective is to present a third order approximation to the significance level in
testing either the location or scale parameter without any prior information concerning
the remaining parameters in the location-scaieshape model. For the first objective reg-
ularie conditions are set forth under which it is proven that the nomalized likelihood
function is asyrnptotically nomal. These conditions bear a strong resemblance to Wald's
conditions under which the maximum likelihood estimator is consistent (Wald 1949). An
error distribution that does not depend on a shape parameter can be transformeci via
a one-parameter continuous group of transformations, constructed using an infinitesimal
transformation, into a family of error distributions indexed by a shape parameter. Fi-
ndy, a third order approximation to the significance level for a scalar interest parameter
in location-scale-shape models is developed via the asymptotic method, based on expo-
nential models and the saddlepoint approximation, presented in Fraser and Reid (1995).
Techniques are presented in the thesis to numencaily compute all quantities required for
the approximation. In particular, when the error distribution is the Student (A) fam-
ily software is available from the author for computing the significance level of a scalar
interest parameter using the third order method presented in Fraser and Reid.
Chapter 2 is a review of some concepts and notation used in likelihood based inference,
induding a review of inference in structural models.
Chapter 3 contains some background material on f i s t order tests and confidence
intervals based on the observed likelihood bc t ion .
Chapter 4 is a review of higher order likelihood asymptotics including the derivation
of a third order approximation to the significance level for a scalar interest parameter.
In Chapter 5 regularity conditions are given under which a rigorous proof that the
posterior distribution, based on observations from a stochastic process, of a vector pa-
rameter with respect to either a proper or irnproper prior converges, almost surely, in
distribution to the normal distribution is presented.
Chapter 6 includes a derivation of a third order approximation to the significance level
of a scaiar parameter in location-scale-shape models. Moreover , a review of condit ional
inference in location-scale-shape models is given.
Chapter 7 contains a method for generating error distnbutions with a shape parame-
ter by constructing a one-parameter group of transformations and an andysis of some
common error distributions with a shape parameter.
Chapter 8 is devoted to location-scde-shape analysis when the error distribution is
Student(X). The derivation of ail the formulae required to implement the third order
appremation to the significmce level of a scalar parameter are provided. Numerical
examples are included to compare the accuracies of various methods. A large simulation
study to assess the difference, in repeated sampling, between first order and third order
methods was camed out using Stden t (A) errors.
Chapter 9 contains conclusions and suggestions for extensions and further research.
Chapter 2
S t atist ical Inference
This diapter reviews some key concepts and definitions in parametric statistical inference.
In the first section the likelihood function is dehed and notation is introduced for various
derivatives of the likelihood function. The second section contains the definition, that
will be used throughout this thesis, of the significance function. The remaining sections
discuss Bayesian inference, inference in the structural model, and the relation between
the two.
2.1 Definition of the Likelihood Function & Related
Quant it ies
Let {f ( - ; O ) : 0 E R) be a statisticai model and y =(y,, ...,y,) a sarnple from f . The
likelihood function from the observed response value y is dehed as
where c E (O, m) is an arbitrary constant. It is often more convenient to work with the
log-likelihood functwn
where a E (-00, oc) is an arbitrary constant. We will often suppress the data y, parame-
ter 6, or subscript n and refer to 1 as the likelihood function when there is no arnbiguity
in doing so.
If L, ( 8 ) is differentiable with respect to a parameter 9 then the score junction is
defined as
If Ln (O) is twice continuously differentiable then the obserued Fisher information is
defined as
V j (8) = -1" ( O ) = - - aeael ln (0; Y) -
The ezpected Fisher infonnation is defined as
i ( O ) = Vare (1' (6)) .
Under certain regularity conditions (Lehmann 1991, pp. 118) an alternative expression
for the expected information is given by the expression
Suppose that 0 = (A, $) E 3tP-' x 92 is a partitioned parameter with maximum h
likelihood estimator B^ = (A, $1 and constrained maximum likelihood estimator O,,, =
(X, $) for a particular value of $. We will often frnd it convenient to -te the obsemd
information matrix as a block matrix
with inverse
We will sometimes have occasion ta differentiate the likelihood b c t i o n with respect
to both the parameter and data. These derivatives will be denoted by
If V = {q, vp) is a set of p vectors such that vi E Rn then the likelihood gradient in
the direction of V is given by the 1 x n row vector
2.2 Stat istical Significance
Let y* , ..., y,, be a sample from a distribution with density f (y ; 8) where 0 E S1 C 92.
Definition 1 The significance fiuiction p : l2 + [O, 11 is taken to be
p (O) is often c d e d the confidence distribution bct ion since aIl possible confidence
intenmls are obtained by inverting p (O).
A 100 (1 - a) % confidence interval for 0 is
This defini tion of the significance function is discussed in Fraser (1991).
2.3 Bayesian Inference
Suppose that in addition to the likelihood function it is assumed that the parameter 0
is a reslization of the random variable 8 having density function n (8) du ( O ) , the prior
distribution of 8. Moreover, if the data y = (yl, ...,y,) is assumed to have been generated
from the conditional density of y given 8, f (y$) , then by Bayes theorem we have that
the conditional distribution of 8 given y has density function
feIY (Oly) is called the posterior distribution of 0 and can be used to make probability
statements about the parameter 8 in light of the data y.
2.4 Inference In Structural Models
A structural model is an ordinary statistical model where the error variable z is formdy
taken to be the source of variation for the response variable y. In addition the S ~ N C ~ U ~
model provides a transformation B that presents the response variable in terms of the
error variable.
The structural model provides a more detailed description of the physical process un-
der investigation than the ordinary statisticd model. In fad, "this more detailed model
predicates certain inference methods of analysis" (Raser 1988). Narnely, inference for a
parameter shouid be based on the appropriate conditional distribution, where the con-
ditioning is based on an ancillary statistic. Indeed, the conditioning is seen as necessary
in the context of a structural model (Raser 1979).
Suppose that z has density fimction f (*) on 31" and let 0 : 9t" -r !Rn belong to 0, a
group of transformations. The stmctural equation (Raser 1968, ch.2) is given by
where y = (y*, ... , y,) , a = (zi , . .. , h) . As an illustration consider the location-scale
model y =pl + oz, where z is N (O, 1) and B = (p , O ) . In anticipation of presenting the
relationship between the iikelihood function and the conditional-inference distribution
the following assumptions are introduced (Fraser 1968).
Assumption 1. R is an open subset of RL; the transformations B = Ble2 and y = BlB2z
are continuously differentiable in 64, 02, z.
Assumption 2. There exïsts a continuously differentiable function G n (y) such that
@n (m) = SE (Y) rvg 0-
Let Dn (y) = WG1 (y) y M) that
Moreover we have that
from the equation y = Oz.
Define the sample space Jacobian to be
The sample space Jacobian can be written as
- aEn (Oz) Dn (ûz) a (02)
In a similar way we can define a Jacobian on the group to be
for any g, h E 51 and i is the identity element of 52.
The density of z is
By the diange of variables formula we have that the density of y is
So that the likelihood function of 0 is
Fraser (1968, 1979) argues that inference about 9 should be based on the conditional
distribution of @" (y) given Rn (y) = D,, the conditional-inference distribution, which
is given by:
a density function with respect to Lebesgue measure on 92. As in Brenner, Fraser and Mc-
Dunnough (1982) we record the density of the pivotal quantity h = En (2 ) = 0- ' Kn (y) to illustrate the relationship between (2.8) and (2.7),
a density with respect to left-invariant mesure on 51. If we consider h as a function of 0
we obtain (2.7) but as s hct ion of Wn (y) we obtain (2.8) . The close connection between
the conditional-inference distribution and the likelihood function was first presented by
Fisher (1934) for location and location-scale models.
Example 2 Let 0 = {[a, cl : a E 8, c E R+} be the location scde p u p on Rn. Where
the p u p action is
[a, cl x = al + cx,
for any x E Rn.
The location-scale mode1 can be written as
where e has density f ( 0 ; A) on Rn, X E A, and & (y) , 5, (y)] is any function that sat-
ides Asmunption 2 (e.g.,
{al : a E 8) from Rn.
From the equation z =
and
[Y,%I [ Y - Y ] ) provided that we delete the set
so that (2.8) becomes
where
2.5 The Relationship Between Bayesian and Struc-
t ural Inference
The conditional-inference distribution derived in the previous section (2.8) is a postenor
distribution that is obtained from a structural mode1 together with the data. In this
section it wil l be shown that the conditional-inference distribution is proportional to the
postenor distribution of 0 with respect to a right invariant pnor.
Suppose that we choose a pnor for 8 that has a density b d i o n n ( O ) with respect to
right (invariant) Haar measure on the parameter space R. That is the a pn'ori probability
element for 0 is given by
where v (A) = v (AB) for all Borel sets A c R and O E 52.
It is not difficult to show that the measure defined by
is left Haar measure on (Fraser 1968, 1979), where JL (-) is defined in (2.6).
Similarly we can obtain right Haar measure on 0 by setthg
to produce the measure
To inwstigate the relationship between teft and right Haar mesure we quote one of
the main properties of Haar measure: any two left Raar measures on a group can be
expressed as a constant times the other (see for example Folland 1984 p.317).
Defhe the new measure
Ph (SB) = P (Bh)
clearly p h is left invariant. Hence there is a positive number A (h) such that
The number A (h) is cded the modular function of the group. It can be shown that left
and right Haar measure are related to each other via the modular function of the group
A (9 9
Fraser (1 979, p. 148).
We can now rewrite (2.8) as a density with respect to right Haar measure on $2
Thus, the conditional-inference distribution (2.12) has the same form as the posterior
distribution of 0 (2 -2) . Moreover, since right Haar measure is unique up to a multiplicative
constant the posterior density of 0 with respect to a right invariant pnor is proportional
to the conditional-inference distribution of 0 (Fraser 1961).
Chapter 3
First Order Likelihood Asymptotics
This chapter presents background material on first order likelihood asymptotics. In the
first section the log-likelihood function for the location normal is derived as the difference
of two quadratic terms. The second section records the proof given in Fraser (1979) that,
in a neighborhood of the true value of a scalar parameter, the log-likelihood function
converges almost surely to the aforementioned quadratic form. The relation to three first
order tests for a scdar parameter are also noted.
3.1 Likelihood For The Location
In order to ascertain the limiting form of the likelihood
Normal
function for a scalar parameter
it will be useful to write the log-likelihood function in terms of an adjusted variable
and pararneter. For this, let y1 , ..., y, be a sample from the N (p, 1) distribution. The
likelihood function for p can be written as
where the term exp (- 1/2 CHn-, (yi - y)2) hm been incorporated into the constant term.
Taking logarithms of both sides and letting p* denote the true value of p the log-likelihood
function for p, standardized with respect to p., can be expressed as
A A
where 5 = 6 ( p ) = f i ( p -p . ) and 6 = 6(y1, ..., y,) = f i@- p*) are an adjusted
parameter and variable respectively (Fraser 1976, ch.8).
3.2 First Order Tests and Confidence Intervals Based
on Likelihood
Let yl, ... , y, be a sarnple from a distribution with density function f (-; 0) that depends
on a scalar parameter B. Under mild regularity conditions on the density function f ( 0 ; 8)
it can be established that in a neighborhood of the true value of B the log-likelihood
function of 6 converges in distribution to the (normal) quadratic form
where 6,; will be dehed below.
Fraser (1968, ch.8) provides a careful proof that in the neighborhood of the true
value of a scalar parameter 8 the log-likelihood function converges as. to the quadratic
form (3.1) and as. tends to -cm outside the neighborhood. Raser (1979) discusses the
rela tionship t O first order tests and confidence int ervals.
3.2.1 The Limiting Quadratic Shape of the Likelihood Fundion
In order for the likelihood function of a scalar parameter 0, based on a sample y =
(yl, ... , y,) , to have the limiting quadratic form (3.1) the following conditions on the
density function f (y;@) are sdficient (Fraser 1979).
The expected information Eo ( j (8)) > O and # [log (f (y; O))] /a83 < N (y) where
N (y) f (y; O ) is integrable on W.
Throughout this section assume that the density function f (y; O) satisfies the two
conditions above. Also let
denote the log-likelihood function based on a sample standardized with respect to the
true value O*, and the Fisher information based on a sample of size n = 1.
An application of Taylor's theorem yields the following expression for the log-likelihood
where IR( c 1,l@ log (f (y; O)) /a031 < N (y) and E IN (y)l < 00. Letting B = 0 * + 6 ~ ~ - ~ / ~
we have that
Since 1; ( O * ) is a s u m of 2.i.d. random variables with El; (O*) = O and Var (1; ( O * ) ) =
ni ( O * ) the central limit theorem can be applied to yield the limit
where UI = N O, i (O* ) . " ( 7) Moreover by the Strong Law of Large Numbers
Therefore by Slutsky7s Theorem (see appendix) we have that
s2 d 62 1, (O* + t h - I l 2 ) = 6w,, + -v, + 6~ - -2 (O*) . 2 2
If we complete the square and set
and let n -, w it follows that
which is the (normal) quadratic form (3.1) (Fraser 1976, p.352).
As n - oo, the likelihood b c t i o n depends only on the sample characteristic $(y).
Thus, with first order accuracy 8(y) can be seen as the large-sample likelihood statistic
and it can be used as a basis for testing B.
The significance fundion of 0, to the first order, can be obtained by computing any
of the following:
The quantities z, and d are the cr quantiles of the standard normal distribution and
the chisqwrre (1) distribution respectively. The accuracy of all three approximations is
O (n-1/2) . The first and last approximations seem to perform better in applications than
the rniddle approximation.
Example 3 T h e Pareto (a) distribution has a density functwn defined on R+ given b y
cr(l+f) -(l+a) , a>o.
Let z = (21, . .. , Zn) be an i .i.d sample fiom the Pareto (a). The observed log-likelihood
The score function and observed information are obtained by differentiating the expres-
sion above
The maximum likelihood estimate of a is: 6 = 6 (2) = n/ z, - log (1 + G) . The reg-
ularity conditions stated above are clearly satisfied hence tests and confidence intervals
can be based on (3.3).
First, consider the hypothesis a = a ~ . The hypothesis can be tested by calculating
and comparing the value obtained with the standard normal distribution. On the other
hand, the hypothesis can be tested by cdculating
and comparing the value obtained with the standard normal distribution. Alternatively
it can be tested by calculating
and comparing the value with the chi-square distribution on one degree of fkeedom. As
n + oo these tests are equivalent; for s m d to moderate sarnples they may give different
resul ts .
Consider the artificial data set given by z = (1,2,3,4,4) and suppose that we are
interested in testing Ho : a = 1. The pvalues for the three tests, presented in the same
order as above, are: 0.313, 0.267, 0.434. Indeed, with s m d data sets, the three fist
order approximations can give very different answers.
3.3 Appendix
Theorem 4 (Slutsky) Let (X, , n > 1) and (Yn, n 2 1) be two sequences of mndom d d
variables such t h t Xn + X and Y, % c , where c is a constant then X,, + Y, + X + c.
Proofi A proof can be found in Cramér (1946).
Chapter 4
Higher Order Likelihood
Asympt ot ics
This chapter contains background material on the Edgeworth expansion, the indirect
Edgeworth expansion or saddlepoint expansion, and the method of approximating the
significance function for a scalar component parameter presented in Fraser and Reid
(1995) - the F R method. In chapter 6 the FR method will be applied to the location-
scale-shape model.
4.1 The Edgeworth Expansion
Let x1,x2, ... be a sequence of i.i.d. random variables with density function f(-) , moment
generating function M(-), rth cumulant 4, and standardized eumulants p, = ~Jr1;'~.
Also, set Sn = xi + - - - + x, and Si = (Sn - np) / o f i The i.i.d. assumption implies
that Ms, ( t ) = [M (t)]" . Taking the logarithm of both sides we have that Ks, (t) =
n [log M (t)] = nK (t) . The moment generating function and cumulant generating func-
tion of Sn can be expressed as
By expanding the cumulant generating function in a Taylor senes about t = O we have
that ,
Exponentiating the last equation and expanding exp (x) about x = O we have that
hverting the above expression term by term we obtain the Edgeworth expansion for the
density function fs; of Si
Integrating the above expression term by term we obtain the distribution function of Sn
Where H, (x) is the Hermite polynomid of degree r defined as:
The first few Hermite polynomials are:
In general the odd order Hermite polynomials vanish at x = O. The leading term #(x)
then has error O (n- ') at x = O and error O (n-'12) in the tails of the distribution.
For background on the Edgeworth expansion see Cramér (1946). The discussion in
this section follows the treatment given in Barndofi-Nielsen and Cox (1989, ch.4). For
a s w e y of the multivariate Edgeworth expansion see McCdaugh (1987, Ch.5).
4.2 The Saddlepoint Expansion Via The Edgeworth
Expansion
One major shortcorning of the Edgeworth expansion is its poor performance in the tails
of the distribution. For statistical purposes this is precisely the part of the distribution
where high precision is needed.
Let f ( 0 ) be a density function with moment generating function A4 ( 0 ) and cumulant
generating function K ( 9 ) . The density can be associated with the exponential family
where,
This procedure of forming f (x;8) is c d e d exponential tilting and f (x; O ) is c d e d the
tilted distribution. For a sample XI, ..., xn the density function is given by
n Thus, the tilted distribution of Sn = Ci=l xi is
To obtain an accurate approximation to the density function fs, (s) = 1% (s; 0) the
Edgeworth expansion can be applied to fs_ (s; 8 ) . If O is chosen such that s is in the
center of the distribution then the error incurred would be O (n-l) instead of O (n-ll2) . A
h other words we will choose 8 = O, where Eê (Sn) = S.
The log-likelihood function of the tilted distribution is
where a is a constant. The maximum likelihood estimator of 0 is obtained by solving
1' (8 ;̂ s) = s - nK' (ê) = O. Sine Es (Sn) = nK' 8 it follows that the maximum (-1 likeiihood estimator, ê, corresponds to the value of O such that s is in the center of the
distribution.
Applying the Edgeworth expansion to 1% (s; 0) we obtain the asymptotic expansion
for fs* (4
~ 3 2 (8) ' (O ) HI (z) + -Hô ( r ) ) + O ( 7 C J i 2 ) , H3 (2) + -- 24 n 72n
where
Choosing B = we have Eê (Sn) = s which leads to
Now (4.3) becomes
1 p4 (3) H3 (O) + --
24 n H4 (0) +
The leading term of the last equation in (4.5) ,
is cailed the saddlepoznt appmzimatwn to the density of Sn.
The n-' term at the origin in the Edgeworth expansion of fs,
If (4.6) is independent of 8̂ and we normalize the leading term in (4.5) then (4.6) will
be incorporated into the normalizing constant and the renormalized approximation wiil
have error O uniformiy in S. (Barndofi-Nielsen and Cox 1989, p.107). In the A
one-dimensional case, there are just three families for which (4.6) does not depend on 8:
the normal, gamma, and inverse Gaussian (Blaesild and Jensen 1985). Renomalkation
of (4.5) yields
provided that (4.6) is constant.
From (4.4) the transformation from s to is one-t-ne, with
A A
= jdû.
So (4.7) c m also be expressed as a density with respect to the maximum likelihood
estirnator
The saddlepoint expansion for the exponential model (4.1) has a corresponding dis-
tribution function approximation (Lugananni and Rice 1980)
where
Both r and q are asymptoticdy standard normal to the &st order.
Dsniels (1954) fist introduced saddlepoint met hods to st atistics. An excelient review
of saddlepoint methods in statistics is given in Reid (1988). The exposition presented
here is based on Barndofi-Nielsen and Cox (1989, di.4). Kolassa (1994) and Jensen
(1995) both provide a thorough treatment of the regularity conditions involved in the
saddlepoint approximation.
4.3 Inference For A Scaler Parameter In the Pres-
ence Of Nuisance Parameters
This section is intended to provide background material on the FR method (Fraser and
Reid 1995) for approxirnating the significance function of a scalar interest parameter.
The FR method consists of approximating a general statistical model 6 t h an exponential
model then applying the saddlepoint approximation to the approxirnating model. The
only information needed concerning the anQllary statistic are the tangent directions to
a first derivative ancillary. A general method for computing these directions is outlined
in this section.
4.3.1 FKst Derivative Ancillary
The concept of a &st derivative ancillary was introduced in Fraser (1964) and Fraser
(1968, ch.6) for a scalar parameter and extended in Fraser and Reid (1995) to the vector
parameter case. The importance of a first derivative ancillary is that it provides ail of
the required ancillary information for third order inference without having to specify an
approximate ancillary st atistic explicitly.
Definition 5 A mndom variable A &th density /mctzon g (a; O ) is fist derivative an-
cillary at Bo if and only if
Construction of First Derivative Ancillary
Let y*, ..., y,, be a sample from a distribution with distribution function F (y; 8 ) where
8 E 92. In addition
Assume that F is stochasticdy increasing in a neighborhood of 4, that is Wo, y E !R
Holding F (y; Bo) constant we find that
An increase in B at Bo causes the entire distribution to shift to the right. If we let
then the distribution function of x (y) is
where y (x) is the inverse fimction of x (y) . In which case we find
But then
G ( x ; @ ) = G ( x - (O-eo);eo) + O ( @ ) .
So we Say that the statistical model for x (y) can be written as
G (. (Y) - (9 - C ) ; 00)
Thus the model, to a first derivative approximation at $, c an be seen as a loc ation mode1
x = (0 - Bo) 1 +E, where E has distribution function G (- ); having corresponding ancillary
statistic
whose tangent direction vector dx/d@ = 1.
Let f (2; 4) be a continuous exponential mode1 with p-dimensional parameter 4. Thus
f (x; 4) is known to have the form
but B (4) , y (x) , K (8 (4)) and h (y (x)) may not be explicitly available.
Example 6 Let XI, ..., xn be a andom sample fbm a N (p, a2) population. The demit y
of the sample is
which can be written as
this case 4 = (p,a), I ( 4 ) = (p/g2,-1/202), y ( ~ l , . . * ~ x n ) = (C;='=,x*,Cx:)',
(0 (4)) = iog(l/un) - np2/202, h (y) = log (11 ( 2 ~ ) " ~ ) . The canonical pararneter 0 and variable y, nominal cumulant generating function
K and underlying h are not uniquely determined. To standardize (4.11) with respect h
to an observed value x0 having maximum likelihood estimate 4' = #I (xo) we require:
y (xO) = 0,0 (6') = O, R (O) = O and [ûû (4 ) /a#'],, = 1.
Let
At the data point xo we have that, S (4'; xo) = O. From (4.11) we have that
* [El = [-y w (2) - KI (0 (4) ) y] 4=p = y (z) = S. a4 gk,p
W e can write (4.12) as 1 (4; X) = 0's - n ( O ) which gives
where,
are sample space derivatives and K (6) = -1 (w -' (8) ; xo) is the cumulant generating
function. Thus
which shows that 1 and 1,
Consider a continuous
hUy determine an exponential linear model.
statisticai model where the dimension of the variable is the
same as the dimension of the parameter. The tangent exponential model, at an observed
point in the sample space, is defined (Raser 1990) as the srponential model (4.11) that
agrees with the given model at the observed sample point: this is given by (4.14) in
terms of the score variable defined in (4.12). This approximating exponential model
has canonical parameter given by (4.13) and cumulant generating function given by
Consider the scalar variable, scalar parameter case. An approximation to the left-tail
probability for the maximum likelihood estimator is given by (4.9), with (4.10) for r and
for q.
4.3.3 Marginal Density For A Scalar Interest Parameter
Fraser and Reid (1995) show that a first derivative ancillary at Bo = 6&, ... , y,,) cm
be upgraded to a second order anciliary without changing the tangent directions at the
data. Moreover, they show that it suffices to obtain the tangent directions to a second
order ancillary in order to approximate the significance function, for a scalar interest
parameter, to the third order. Thus, for third order inference it sufnces to obtain the
tangent directions to a fist derivative ancillary.
Let yo = (yl, ...,y,,) be a sample from f (y;B) where 0 = (A,$) E W1 x P The
ancillary directions V for 0 can be found from the method presented in section 4.3.1.
The mcillary direction is given by
where i = 1, ..., n.
Ln the vector case we would have 0 = (O1, . .. , 8,) and the anciIlary directions V = (vij)
are given by
where i = 1 ,..., n and j = 1 , . . . , p .
The tangent exponential mode1 at yo is
where
l0 (0) = 1 (O; yo) and 19 (9) is 9-l.
Since (4.18) is a tilted density of the form (4.1) we can approximate it using the
saddlepoint approximation (4.7)
where 7, is the information calcdated from the tilted likelihood in the exportent of
(4.18). The accuracy of (4.20) is O ( 7 ~ ~ 1 ~ ) in a first derivative neighborhood of yo and
O (n-') in a compact region for the variable Save an O (n-') constant (Calmiak, Fraser
and Reid 1994). Wnting cp = (<pl, 9 2 ) and s' = (si, 4) the marginal distribution for s2
on the maximum likelihood surface ê*,, = or sl = O is the ratio of the joint densiQ
for (si, 4) to the conditional density of sll st
where
Note that <pp (e) is a p x @ - 1) matrix with corresponding rnatrix volume
Formula (4.21) was derived in Fraser and Reid (1995).
4.3.4 First-Order Significance Fhction For A Scalar hterest
Parameter
Suppose that f ( y ; @ ) is a p-dimensional statistical model with 0 = ( A , + ) where h is s
p - 1 dimensional nuisance parameter. The likelihood based quantities:
are asyrnptotically N (0 , l ) to the hst order (Barndofi-Nielsen and Cox 1994). Thus
with accuracy O (n-II2) we have three first order approximations to the significance
function of II, :
4.3.5 Third-Order Significance Fundion For A Scalar Interest
Paramet er
Lïnear Exponential Models
The saddlepoint approximation (4.7) to an exponential model with partitioned parameter
0 = (A, d ) E 92P-' x 82 has a corresponding distribution function approximation for a
scalar interest pararneter
where R = r* and Q = q* (see Reid 1995(b) and the references therein).
An asymptotically equivalent version of (4.21) is
where
Barndofi-Nielsen (1986,1991) introduced the R' version as an alternative to the Lugannani-
Rice formula. R* is standard normal to the third order and (4.25) has the advantage of
always producing pvalues inside the interval [O, 1] .
General Statisticd Models
For a general statistical model we obtained the densi@ approximation in (4.21) for a scalar
variable when 11 is fixed conditional on some third order anciilary for A. Cheah, Fraser
and Reid (1995) have derived the Lugananni and Rice formula (4.24) for an exponential
model multiplied by an adjustment factor as in (4.21) . The Lugananni and Rice formula
for this type of mode1 uses r* and q* in (4.23)
where 6 = IkrAi (&) 1 = 1 jAA (e) 1 1 (e) 1 -2defuied in (4.22) .
Towards obtainùig an expression for Q in (4.26) for the model in (4.21) we have that
where
are the Jacobian from B to yl and its inverse. This gives the connection between t9 and p
in the neighborhood of &. The scalar parameter
behaves like 11 in the neighborhood of &. A standardized maximum likelihood departure for testing then has the form (4.26)
which can be written as
4, = [x (8) - x (b)] t (y2 191' (e)l I7.A (k) IL'* Ive (')/
Thus (4.24) or (4.25) can be used with R = r* and Q = dq to provide a third order
approximation to the signifieance function for $. Calculation of r* and $ is tantamount
to implementing the FR method. If the parameter of interest is a vector parameter
Fraser and &id (1995) suggest testing successive components of the parameter vector as
described in Fraser and McKay (1975).
Chapter 5
Improper Priors, Posterior
Asymptotic Normality, and
Condit ional Inference
There has been a modest arnount of literature on the posterior asymptotic normality of
a scaiar parameter based on a proper prior distribution. Heyde and Johnstone (1979)
extended the proof of Waker (1969) to stochastic processes. A proof of the multipara-
meter case can be found in Johnstone (1978). Sweeting and Adekola (1987) extended
the regularity conditions found in Heyde and Johnstone (1979) to cover a wider class of
processes, again based on a proper prior.
Posterior asymptotic normality where the prior dimibution is not assumed to be
proper was fmt examined in Brenner, Raser and McDunnough (1982) for a sarnple from
a scalar location parameter model. Fraser and McDunnough (1984) extended the result
to a generd model with a scdar parameter and mention that for a random sample the
muitiparameter case should hold. Indeed, the literature does not seem to contain a proof
of the multiparameter case without the assumption of a proper prior. The purpose of
this section is to establish the asymptotic normality of the pasterior distribution of a
vector parameter based on either a random sample or a sample from a stochastic process
without the assumption of a proper prior. The results contained in this section appear
in Fraser, McDunnough and Tabadc (1997).
In the previous section we saw that postenor distributions or normalized likelihood
functions have an intimate connection with the confidence distribution generated by a
transformation or structural model using conditional inference. Theorern 3.1 in Brenner,
Fraser and McDunnough (1982) gives conditions under which convergence almost surely
of the suitably normalized likelihood function or the posterior to the standard multi-
variate normal distribution is sufficient for almost sure convergence of the standardized
condi tional-inference distribution to the standard multivariate normal. In a transfor-
mation model the likelihood function c m be normalized with respect to an invariant
measure on the parameter space, as a natural non-informative prior. If the parameter
space is a compact group then the invariant measure will be proper, but for location,
location-scale, regression models and others with noncompact groups, this pnor is im-
proper. Consider the classical regression mode1 y = XP + oz where z is Nk(O, 1) with
the parameter 0 = (p , a); a proof of the strict asymptotic nonnality of the confidence
distribution of Ci1'2(8 -&) does not seem to be available in the literature, but is the
concern of this section.
Let the parameter space R = @, be k-dimensional euclidean space. We find it
convenient to use the norrn
on 32k rather than the Euclidean nom. If A is a linear transformation on @ dehed by
the matrix [Aj], then the n o m of A will be taken as
We then have that
for all x E 3tk. It also follows from the Cauchy-Schwartz inequdty that
for all x E !P.
An open rectangle in is dehed to be a subset of @- of the form
= { X E !P 1 % <xi < bi, for i = 1, ... ,k).
Closed rectangles are defined in an analogous way
Let f : A -t !R be a bounded function, where A is a rectangle in p. Define
5.1 Assumptions
The foilowing assumptions concerning the mode1 and a prior w(0) are closely related to
those in Fraser and McDunnough (1984), and Johnstone (1978).
Assumption 1.
and w(0) is continuous and positive at the true 8.
Assumption 2. Zn (0) is tarice continuously differentiable, and det ( j ( 8 ) ) E (O, 00).
Defme
GS (0) = j-' (0)
and assume that both C , -r O and & -t O, where Q, = &(ê) with ê the msucirnum
likelihood estimate of 8.
Assumption 3. For every 6 > O
Assumption 4 For every e < 1 there exists a 6 > O such that,
- lirn SUP 1 1 C? (O) {l: ( s ) - 1: (0) ) cn' (e ) 1 n-O" ~:11~-011~a
Assumption 3 is a modified version of Assumption I' in Fraser and McDunnough
(1984). Assumption 4. ensures that in a neighborhood of its maximum the likelihood
function has a multivariate normal form, that is,
For this last staternent expand h in a Taylor series (see appendix) about B^ to obtain
where lies somewhere on the line joining ê to B^+ eh. Assumption 4. then gives (5.1) .
5.2 Main Result
Theorem 7 Let { X t , t E T ) be a stochastic process and suppose that we observe n d-
izutiorr~ of the pmcess x = (xi, ..., x,,) having density /,(xlû) &th respect to a g - f i t e
measure not dependent on 8. If ussumptions 1,2,3,4 hou then
Proofi From Assumption 1. it foilows that
Combining this with (5.1) we see that it d c e s to show that
J ~ ( u ) Ln (u)du Y (2n) '12. det (2) lu($) ~ , ( 6
Towards this we note that
J w ( Y ) L ~ ( u ) ~ u - - lW(B^+ &t) L.(&+ gi t ) dt . det (g!) w(5) L,(@ WC$) z(@
For any rectangle A c @ the dominated-convergence theorem yields
As in Fraser and McDmough (1984) we dehe sets
with 6 > O. B s , is the region outside the rectangle A c p, but in a 6-neighborhood of
the true parameter. While Bn is the region outside a rectangle and not in a neighborhood
of the true parameter. Towards (5.3) we note that it now s a c e s to show that
and
If we can show that
then (5.4) follows immediately. Towards (5.6) let R, = zn(l;(8) - l:(ê)) and t E Ba,,;
t hen the multivariate version of Taylor's theorem yields
The last inequaiity follows from Assumption 4. Thus,
where the constant c, < O for any E < 1. Finaily for suitable 6 > O we have that,
In anticipation of (5.5), we write
where Ln- oc f (xi, . . . , x,; O ) / f (xi; O) for n > 1 and Li oc 1 (xi ; 0). addition have
that
Then an application of Assumption 3 yields
for all t E Bn and for a constant d > O (see Fraser and McDunnough, (1984)). It then
foUows from Assump tion 1. that
where Bo is the tme value of 0.
Proof: The multivariate analog of a theorem due to Scheffé (1947) yields
Let
i.9. where q Stvdent (A). Suppose that for a given value of X we want to make inferences
about p. The structural distribution function of the parameter p (see chapter 6) is
where
and f ( 0 ; A) is the Student (A) density.
A 1 - cr confidence interval for p is given by,
where a < b satisfy
GA ( b ) - GA (a) = 1 - a.
So a 100 (1 - a) % confidence interval has the fom
To irnplement the confidence interval (5.7) we require values of the quantile h c t i o n
G-' (x; A) for selected values of x. Since the assmptions about the error density hold we
can apply the theorem. Thus with accuracy O (n-'I2) we can approximate G ( 9 ; A) with
the N (p (y) , BP (S (y) O (y))) distribution function.
5.3 Appendix
Theorem 9 ('hylor) Let f : U c R P + 3 have contznuow partial derivatives of second
O&. Then we may write
where R = O (112
1 R = - (X - xO)' f" (Z) ( X - x0) ,
2
where E lies somewhere on the [nie joinzng xo to x.
Proof: A proof can be found in Marsden and Tromba (1988).
48
Chapter 6
Inference In Locat ion-Scale-Shape
Models with Small Samples
The &st section of this chapter is devoted to exact conditional inference in the location-
scale-shape model, for a fixed value of the shape parameter. In the next section the FR
method is applied to this model to obtain a third order approximation to the significance
function for either the location or scale parameter.
6.1 Condit ional Inference For Locat ion-Scale-S hape
Analysis
Fraser (1976b, 1979) provides a thorough treatment of conditional inference for a com-
ponent parameter in a locat ion-scaleshape model. The marginal likelihood func tion can
be used for inference about the shape parameter. Then for a fixed value of the shape pa-
rameter inference is available for a component of (p, a), while the remaining components
are assumed to be unknown.
Suppose that the stmctural equation (2.3) takes the form
where e has hown density function f ( 0 ; A) that depends on a scalar parameter A E
(O, m) . Throughout this chapter the parameter X is assumed to be the shape or kurtosïs
parameter of the distribution with density f , but the resdts of this chapter are applicable
to statistical models where the parameter X is, for example, skewness.
The conditional distribution of (p, Z)( D
has density function (2.10) , namely
where
6.1.1 Inference For Shape
The only characteristic of the realized e that is observed is the unit residual vector D.
The probability for this obsenred event is
kA (D) da.
The differential da denotes the volume of the unit sphere formed by the points D :
then Zn-2da is the volume on the sphere of radius 5 (Fraser 1979). Thus the observed
iikelihood function for the parameter X is
where c > O. Different values of the constant c give similady shaped functions of X (Ftaser
6.1.2 Inference For Location
The relation in (2.4) becomes
w i t h j l = p ( e ) , Z = Z ( e ) .
For a given value of X inferences concerning p can be obtained by fmt rearranging
(6.3) to obtain
The marginal distribution of t has density function
For a given value of X a 1 - cr confidence interval for t is (tl, t2) whem
so that a 1 - u confidence i n t e d for p is
6.1.3 Inference for Scale
The marginal distribution for i? can Le obtained by integrating Ci out of (6.2)
h (s; A) = k-' (D) ] -- n f (p + .Di; A) sn-*dP. -do &1
Thus for a given value of h a 1 - a confidence interval for a is
( Z ( Y ) / S ~ < ~ (Y) /SI)
where si < s* satis&
6.2 Asympt otic Locat ion-Scale-Shape Analysis
For a fixed value of X exact inference is available for a component of (p, o) from either
(6.5) or (6.6) . Reid (1995a) points out two problems if X is treated as unknown.
There is no exact ancillary statistic available to condition on, in order
to reduce the dimension of the sufficient statistic to the dimension of the
parameter, and no exact method of eliminating the nuisance parameters to
provide a one dimensional distribution for inference about the parameter of
interest. In the absence of an 'exact' solution, it is difficult to denve an
approximate solution, since it is not clear what should be approximated.
However , the more general work of Barndofi-Nielsen (1991, 1994), DiCiccio
and Martin (1991) , and Fraser and Reid (1995) suggests that given a suitable
method for dimension reduction (such as conditioning on an approximate
ancillary statistic) , the derivation of the tail area is straight forward.
In this section we derive the likelihood related quantities needed in order to imple-
ment the FR method for the Iocation-scale-shape mode1 with general error distn but ion
dependent on a shape parameter.
The likelihood function of 0 for the location-scsle-shape mode1 in (6.1) is
where 1, (p, O, A; y) = log (if (0; A)) . At the beginning of section 4.3.3 it was noted that for third order inference it suffices
to obtain the tangent directions, at the data, to a second order ancillary. Moreover, a fist
derivative ancillary has the same tangent directions as a second order anciliary. Thus,
the tangent directions to some third order ancillary are given by (4.17). The directions
can be described by the vectors V = (vi, v2, v3) and for the location-scale-shape mode1
are given by
where
F (x; A) = [l f (t; A) dt.
If F (2; A) is not available in closed form then we can approximate the numerator of v3i
using
where R = - + E ~ @ F (2; A) /aA3. Press, Teukolsky, Vetterling and Flannery (1988) dis-
cuss the accuracy of using the above equation with R = O and how to choose the value
of E in practice.
The reparameterization in (4.19) c m be expresseci as
The Jacobian of the parameter change is
Suppose, for example, that the interest parameter is p. The scalar linear parameter
that corresponds to p in a neighborhood of $ is given by (4.27)
where ( O ) corresponds to the pth row of K (8) = J-' (O).
The significance hct ion for p is available from ei ther the Lugananni and Rice formula
(4.24) or the Barndofi-Nielsen formula (4.25) with
where J2 (O ) is the matrix obtained from the 1 s t two columns of J3 (O) and the nuisance
information matrix is
6.2.1 Linear Regression with A Shape Parameter
Fraser, Monette, Ng and Wong (1995) applied the FR method to generalized linear mod-
els with non-iinear iink function, under the assumption that the error distribution did
not depend on any unknown parameters. Fraser (1979, ch.6) considered linear regression
rnodels where the error distribution depends on a shape parameter and used exact condi-
tional methods to obtain inference for a scdar interest parameter for a fixed value of the
shape parameter. Lange et al. (1989) fit regression models with Stvdent (A) errors and
used f ist order methods to obtain confidence intervals for a scalar interest parameter
with the shape parameter A unknown.
Consider the linear regression mode1
where X is an n x p design matrix of rank p, /3 = (a, ...,&) and e has density function
f ( 0 ; A) . The likelihood function is
The p + 2 tangent directions to some third order ancillary are given by
The remaining likelihood based quantities are easily obtained. Third-order tests and
confidence intervais for a scalar component of ( f i , .. ., O,, O, A) can be obtained from (4.24)
or (4.25) together with (4.26).
Chapter 7
Analysis of Error Models with a
Shape Parameter
A method for generating error distributions with a shape parsmeter is developed in this
chapter. In addition, an analysis of some familiar error models with a shape parameter
is presented.
7.1 Some Familiar Error Distributions with a Shape
When the population under study has a heavy tailed distribution, so that the occurrence
of large values is not rare, the following families have been cited in the literature as
being appropriate error distributions: Stvdent ( p , o, A) family, slash distributions, conta-
minated normal, and exponential power family. Ail of these families, with the exception
of the exponentid power family, can be expressed as the distribution of the ratio
The scaling random variable T, which is assumed to be independent of r .- N (p, 02),
has density fundion g (- ; A) which form a one-parameter exponential family. The density
of y then has the form
=I t l Y - C ( h (9; P, 0. *) = J__ ;b (-&) g (t; A) dt.
The ratio given above is of no direct importance for the inferential problem, but is useful
for generating different families of error distri butions wit h a shape paramet er where
the practical computation of mmcïmum likelihood estimates is available using the EM
aigorithm (Lange et al. 1989).
If r is chosen such that
T - Gamma (A/2, A/2) X I X : ,
where the Gamma (a, ,O) density is
then the marginal distribution of y is given by (7.1) , namely,
the Student (p, O, A) distribution (Dickey 1968).
7.1 -2 Slash Distributions
Rogers and Tukey (1962) advocated choosing r to have finite support with density func-
tion (A + 1) (7) , where IA (=) is the indicator function. The slash family is defined
as
ki particular when X = O then T is uniform and y has the standard slash distribution
(i,e., p = O and a = 1) with density
This model may have better statistical properties than the Student (A) for data with
extreme outliers , but is less convenient from a computational point of view. In particular
the likelihood func t ion involves the computation of the incomplet e gamma func t ion for
each observation (Lange, Little and Taylor 1989).
7.1.3 Contamïnated Normal
Suppose that T is chosen to be a binary random variable with density function
1 - A , i f r = l
7
O, otherwise
where X E (O, 1) and w > O are assumeci to be known. The marginal densiiy of y is giwn
by (7-1)
a mixture of a N (p , (a/ (1 - x ) ) ~ ) random variable and a N (II, ( a / ~ ) ~ ) random variable.
If X is diosen close to 1 and w is s m d (say 0.1) then the resulting error distribution is an
appropriate rnodel when the data are contaminated with a small fraction of outliers. Little
(1988) compared the contaminated normal and the Student (A) in a simulation study
and found they yielded comparable robust estimates of location and scale, although the
contamuiated model requires specification or estimation of two robustness parameters, X
and w , rather than just one for the Student (A) model.
7.1.4 Exponential Power Family
Another family that comprises distributions with tails that are both longer and shorter
than the normal is the exponential power family (Box and Tiao 1973). This family has
densities of the form
where
The shape parameter P measures the amount of kurtosis indicating the amount of non-
norrnality. When ,8 = O the distribution is normal. When B = 1 the distribution is
Laplace and letting P -, -1 the density function tends to
the uniform distribution. Fraser (1979) and Lange et al. (1989) found that the distri-
butional fonn of the power exponential family is unredistic for modeiling real data as
IPI + 1- Lange et al. reported more computational problems with maximum likelihood
estimation than with the Stvrlent (A) family caused by the shape parameter 1/31 -. 1.
The EM algorithm is unavailable for maximum iikelihood estimation since the power
exponential family is not expressible as a mixture of members of the exponential family
(Lange et al. 1989). Moreover the derivatives of either the density or log density with
respect to the data and location or scale parameters is discontinuous, a very unattractive
feature of the exponential power family.
Generat ing Error Mo dels
In the previous section a method, based on the ratio of a normal random variable to some
independent random variable, was used to generate some farniliar error distributions with
a shape parameter. In this section a method is introduced to generate error distribu-
tions with a shape parameter by transforming an error variable z via a transformation
belonging t o a one-paramet er group.
7.2.1 The Construction of a Continuous Group using Infinites-
imal Transformations
In this section we review the construction of a one-parameter group of diffeomorphisms,
a Lie group, using infinitesimal generators.
A one-parameter group of transformations G = { q (-; A) : X E 92) of a subset M of
Euchdean space is a mapping
such that
0 7 is a differentiable rnapping;
The mapping rj (2; 0 ) : M -r M is a diEeomorphism) for every h E S.
The equation
Y = 9 (2; A) (7.2)
for each X E 31 defines a transformation of a point x E R into a point y E 92. The set
Gx = { q (x; A) : X E W}
is called the orbit of a point x. A point x traces out an orbit as ail transformations q E G
are applied to it.
Suppose that A. is the value of h for the identity transformation, that is,
q (2; h) = 2- (7-3)
The function 7 is the solution of the ordinary digerential equation (Eisenhart 1961)
satisSing the initial condition (7.3) . So that at X = &
Consider the reparameterizat ion
with inverse /3-' (a) . It f o h that a = O yields the identity and the differential equation
of the group (7.4) becomes
with initial condition q (x; O) = x. It then follows that
Expanding as a power series in a we have that
In (7.6) we have used the fact that
where cf ( t ) = 4 ( t ) /dt .
Let U be the linear operator defined by
U will be called the infinitesimal operator.
Using operator notation equation (7.6) may be written as
where Um f is the result of composing U precisely na times on f.
Suppose that the series (7.7) is convergent for a E [O, al] then equation (7.7) is
equivalent to equation (7.2) for X E p-' ([O, ai]). Although, this may mean that only a
portion of the orbit of x is given by (7.7). By a transformation of coordinates (7.7) can
take the form of a translation for limited values of a so that the orbit of x is the line
segment parallel to the 1 vector passing through x. Whence, if P (w) is a point on the
line segment of the orbit of x, for X E p-' ([O, al]), then on applying (7.7) to P (w) we
get another line segment of the orbit. Hence by sufficient repetition of (7.7) we obtain
the entire orbit of 3: in so far as it is defineci by (7.2) (Eisenhart 1961).
Replacing a in (7.6) by an infinitesimal ha and neglecting powers of ha higher than
the first, we obtain
= x+E(x) Aa,
which is d e d the infinitesimal transformation of the group G. The transformation in
(7.7) is said to be generated by the infinitesimal transformation (Eisenhart 1961) in the
sense descnbed in the preceding psragraph.
Let G = (7 ( 0 ; A) : A E W) denote a one-parameter group of transformations of a
domain U and let
Then p (A) = 7 (x; A) is a solution of the partial differential equation (7.5) . Conversely,
if (x) is a differentiable function such that 5 (x) is identicdy zero for sufficiently large
1x1, then the partial differential equation
aîl = (q (z; A)) âA
has a solution 7 (x; A) such that i / (x;O) = x. Moreover, the f d y (7 (-; A) : X E W)
define a group of transformations on W consisting of transformations generated by the
infinitesimal transformation (Eisenhart 1961, pp. 54). The solution 7 is unique in a
neighborhood of the origin and satisfies
provided that 5 (u) # O and q (2; A) = x if 5 (u) = O (Arnold 1973).
If the function { (a) is dehed on a noncompact set, then it is possible that the
set (7 (-; A) : h E W) wiU not form a one-parameter group on R. Arnold (1973, pp.22-
23) provides a simple counterexample by defining (x) = x2 so that the solution of
(7.9) is q (2; A) = -11 (A - x) . The domain of 7 is X < x and h > x, so that the
restriction of 7 to these two intenrals gives two unrelated solutions. One reason why the
set (-11 (A - x) : X E W) does not form a one-parameter group on R is simply due to
the fact that I ) (-; A) (A # O) is not defmed on all of &
Example 10 This is pmblem 1 /iom AmoM (1973, pp.15). Suppose that
( x ) = s (x) , x E L*
The solution q of (7.9) satisfies
du
= J r c s c C ( u ) d u
= [log Icsc (u) - cet (u) l] f
So that,
q (x; A) = 2 arctan (exp (A) jtan (x/2)1).
The identity transformation occurs at X = 0,
To verify that q is a solution of (7.9) first observe that,
Letting y = arctan (exp (A) Itan (x/2)1) and using the double-angle formula for sin (x) we
have that,
= 2 sin (y) cos (y)
= 2 tan (y) cos2 (y)
so that + (x; A) / a A = (7 (x; A)) . In addition, it is not difficult to show that {q (-; A) : A E 92)
is a one-parameter group of transformations, with inverse O-' (1; A) = 2 arctan (exp (-A) tan ( ~ 1 2 ) ) .
Example 11 Thts is ezarnple 2.6 from Arnold (1973, pp. 15). Suppose Wat
where k is a constant. The solution of (7.9) satisfying g (x; 0) = x is simply
q (x; A) = xexp (kh) .
It is not difficult to verifjr that the set (7 (*; A) : A E W) i s a one-parameter group of
transformations, the scale group.
7.2.2 Generating Error Distributions with a S hape Parameter
via the Infiriitesimal Transformation
Let E (-) be a differentiable function defined on R such that 5 (x) is zero for large 1x1.
It then follows that the differential equation (7.9) has a unique solution q (-; A) such
that 7 (2;O) = r and the set G = { q ( 0 ; A) : X E IR) foxm a one-parameter group of
transformations. The transformations are generated by the infinitesimal transformation
The function E (*) is approximately equal to the derivative of q at the parameter value
yielding the identity transformation.
for srnail AX.
In addition suppose that we only consider functions E ( 0 ) , which yield a solution 7 of
(7.9) , such that for each X
where y E {-1,0,1).
Let z be a symmetric continuous error variable with distribution function Fz defined
on P that is standardized so that the interval (-1,l) contains 68.26% of the probability,
that is
F z (1) - Fz (-1) = 0.6826.
Then it must be the case that
where F, is the distribution function of q. The standardized distribution of 9 (2; A) is
symmetric and has 68.26% of its probability in the interval (-1,l) for each A. So that
if p and a are the median and standard error of (2; A) then the intenral (p - o, p + a) will contain 68.26% of the probability for the distribution of i ) (2; A) for each A. Fraser
(1976a, 1976b, 1979) used a similar approach to standardize the Stvdent (A) family.
Thus given a continuous symmetric error variable z with distribution function Fz
which satisfies (7.11) ; a differentiable function 6 ( e ) that vanishes outside some closed
intervai which yields a solution g (2; A) of the partial differentid equation (7.9) such
that (7.10) and (z; 0) = z hold, continuous symmetric error distributions with a shape
parameter can be generated. Examples of functions { which satisfy the above conditions
are currently being investigated by the author and D.A.S. Fraser.
7.2.3 Power Transformations
Instead of generating a continuous group of transformations via the infinitesimal trans-
formation a specific group can be specified. By changing the distribution of the error
variable z different families of error distributions with a shape parameter can be gener-
ated. In this section we consider the group of power transformations as an example.
Suppose that the group action is defined as (2) = 1zlA, z E R, X > 0, the power
transformation of z. It is easy to verify that the set of all power transformations P =
{gA : A > O) fonns a commutative group. The composition of two power transformations
is given by
for a, p > O. The group identity is qi (t) = lzll = lzl E P. The inverse of a power
transformation q, (t) E P is given by the power transformation qil ( z ) = 1 ~ 1 ' ' ~ E P.
Hence P is a group.
Example 12 This is an eze-e in h u e r (1976). Suppose that z - exp (1) so that
f (2) = exp (-2) , z > O. T h a the powm tmnsfonnation w = (2) = 2, X > O, z > O
has d m - t y functzon
the Wa'bull ( l / X ) distribution.
Example 13 Let z -- N (0,l). The power tmnsformation w = II* ( r ) = lzl" , X 2 1, r E
L has distBbutzon functzon
The density function is then obtained by differentiation
The power transformation of a standard normal random variable yields a family of error
distributions where the density functions are right-skewed. In particular, when X = 1
the distribution of w is the standard normal density on (0,m) and when X = 2 the
distribution of q is the chi-square with one degree of freedom.
7.2.4 Location-Sc&-S hape Analysis
Consider the location-scale-shape mode1
where the error variable w, with shape parameter A, is generated from a group trans-
formation acting on another error variable z. The FR method for location-scale-shape
rnodels was derived in Chapter 6. Implementation of the FR method requires the tangent
directions, at the data, to a second order ancillary. In addition, a first derivative ancillary
has the same tangent directions as a second order ancillary. The directions obtained in
(6.7) for a general error variable can be described by the vecton V = (vl, v2, v3) , where
and F is a distribution function or some quivalent pivota1 quantity. When the error
variable is generated from a group of transformations acting on z = q-* ((y-pl) /O; A)
the X direction vector v3 has the form
where p (2; A) = 8q (2; A) /aA.
Chapter 8
Locat ion-Scale-Shape Analysis wit h
Student (A) Errors
The Student (A) distribution
is a family of continuous symmetnc densities that range from the thick-tailed Cauchy
(A = 1) up to the normal (A + oc). The Stdent (A) distribution for small values of
X is often cited as being a more realistic error distribution than the normal. Fraser
(1979) showed that for a given value of X location-scale analysis with Student (A) errors
is robust including resistance to outliers. Lange, Little and Taylor (1989) studied the
robustness of both linear and non-linear regression models (with the possibility of missing
data) assuming that the errors followed a Student ( A ) distribution. Lange et ni. used hst
order methods to study the precision of the least squares estimates they obtained. hdeed
the Stdent (A) family has been cited in the literature as a reasonable family for the error
distribution of a statistical mode1 that combines computational and conceptual simplici ty
with generality. The FR method applied to the location-scaleshape problem will produce
more accurate p-values and confidence intervals than were previously available.
8.1 Asymptotic Location-Seale-Shape Analysis wit h
Student (A) Errors
In this section we derive the likelihood related quantities needed in order to implement
the FR method for the location-scale-shape mode1 with Stdent (A) errors.
The likelihood function for 9 = (p, O, A) is
1 1 (8 ) = n log e (A) - - log (Air) - logo + 2 (q) log (1 + A-'z.z)
2 i=I
where
The maximum iikelihood estimator for 0 can not be obtained in closed form, hence
an iterative algorithm must be used to obtain the maximum likelihood estimate for any
particular data set.
The information matrix is
The entries of j (9) are:
zi" =i 2 -- 2: ) 1 1 n --
'mm = c (y) [$ (A (l + pz;)' + (1 + ~-1t:) 02 (1 + A-lz:) a2 y
We can obtain expressions for both d (A) and d' (A) in tems of c (A), the digamma
function
and the trigamma function W (A) . Indeed, the first derivative of c (A) can be written as
Differentiating again would yield an expression for d' (A) in terms of c (A) , d (A) , @ (A)
and Q' (A) . Asymptotic formulae are available for both @ (A) and ik' (A) (see Abramowitz
and Stegun 1972). The information matrix can also be found by obtaining the Hessian
of - 1 numerically.
Lange, Little and Taylor (1989) derive the expected information matrix for an ellip
t i cdy symmetric family of densities. The expected information is block diagonal with
the mean component in one block and the scale and shape components in another block.
Hence the macimum Iikelihood estimators of p and (cl A) are asymptotically uncorrelated
to the first order. Thus, insofar as first order asymptotics are concemed, treating X as
unknow-n or estimating X from the data and treating it as fixed should not have much
effect on the standard-error of
The tangent directions to some third order ancillary are given by (6.7) with the
numerator of given by
where
F (z; A) = J = (2) + X-1r2] -P+1)/2 dz . -- r (O ) r (4) f i
Since F (2; X j is not available in closed form (8.1) will be approximated using (6.8).
The reparameterization (4.19) is
where
n As X - QQ clearly âF/ôX + O and cp -, (zz1 -svi~, Ci=, -&viz) the reparameteriza-
tion obtained from the N (p, 02) model.
The Jacobian of the parameter change is
The columns of J3 (8) are:
" a al " A + I
= C (T)
for j = l , 2 , 3 and =y, - p .
Thus with accuracy O (ne3I2) we c m use (4.24) or (4.25) with (6.9) from the previous
section to obtain the significance function for p. The significance function for o can be
obtained by similar computations.
8.2 Maximum Likelihood Estimation in the Student (p, o, A)
Family
If we assume that the error distribution is the Student (A) distribution then in order to
implement (4.24) or (4.25) we will need maximum likelihood estimates in the full and
constrained model.
Two different algorithms have been used to compute the maximum likelihood esti-
mate. A quasi-Newton method and the EM algorithm. The EM algorithm is irnplemented
in Lisp-Stat (Tiemey 1990) and the quasi-Newton method in C.
Let
be the Newton iteration at step k. The quasi-Newton method computes
where j (O) is the approximation to the obsenred information at step k and S (8) is the
score vector. The term j- l ( ~ ( ~ 1 ) S giws the direction of the current increment
and r k E (O, II gives it 's length. Taking the full Newton step j- ' S is not
guarsnteed to decrease 1 sdciently, but if we move to a point dong the direction
of the Newton step by decreasing 7 then we can often decrease 1 (o(~+')) according to our
criterion. For further details see Press, Teukolsky, Vetterling and Flannery (1988) and
the references t herein.
The EM algorithm (Dempster, Laird and Rubin 1977) is an algorithm for cornputing
the maximum likelihood estimate from incomplete data. The EM algorithm augments
the data (yl, ...,y,,) with additional data (q, ..., T,) such that the maximum likelihood
estimate of 0 given (y, T) is easy to compute. Suppose that is the value of 8 from
the kth iteration of the algorithm, the (k + 1) st iteration of EM consists of: (1) Estep.
computing the expected value of 1 (8; y, T ) with respect to the conditional distribution
r( (Y, ~ ( ~ 1 ) ; (2) M-step. maximize the resulting function with respect to 8. Maximum
likelihood estimation in the Stvdent ( p l 0, A) family using the EM algorithm is described
in great detail in Liu and Rubin (1995).
Starting values for B = (p, O, A) were the median for p, the interquartile range for (T,
and a grid search over various X values. Personal experience suggests that these starting
values work better, especially in simulations, than the starting values suggested by Lange,
Little and Taylor (1989) (fitting a normal mode1 by least squares and using the kurtosis
of the residuals from the normal fit for the stsrting value of A). Newton's method seemed
to be very sensitive to starting values which was not the case for the EM algorithm.
8.3 Modeling Real Data Wit h Student (A) Errors
In this section one simulatecl data set and one real data set are analyzed using the
location-scale-shape mode1 with Student (A) errors. Both data sets have one extreme
observation inclicating that a normal analysis is inappropriate for these data sets. Three
methods are used to obtain confidence intervals for the location parameter p: the FR
method, the signed likelihood ratio statistic, and numerical integration of the exact condi-
tional densiS. The last method can only be used for a fixed value of the shape parameter
A, since there is no exact conditional density hindion when A is assumed to be unknown.
An appropriate definition for the location and scale parameters p and a in the
Studat ( p p , A) f d y are the median and standard error (twice the distance sym-
metricaliy about the median that contains 68.26% of the probability) respectively. Cor-
respondingiy the Stvdent ( p , a, A) family can be rescaled so that the interval (-1,l)
contains 68.26% of the probability. The scaling factor can be obtained for each X by
solving the equation
where Y = qA)Z , Z - Stuàent ( A ) . From symrnetry we have that
So U(A) = 1/ F- ' (0.8413; A) , where F- ' is the inverse cumulative distribution function
of the Student (A) distribution. The Student (A) distribution has 68.26% probability in
(-tx , tx ) ; some values of tA are
These values were obtained from Raser (1976b , pp.467).
In this section let p be the median and a the standard error such that (p & cr) contains
68.26% of the probability.
8.3.1 Cauchy Data
The following 10 observations were generated from the Cauchy distribution.
Sample of i O from Cauchy
CD-
-- c. r 3
S
CU-
0-
Sample Data
The profile likelihood function for X evaluated at X = 2k, k = 0,1,2,3, oc is
Twice the difference between the likeiihood of the best fitting Student model and the
normal model is 2 (1 (c, â, 1) - 1 (pl ô, m)) = 4.92806 and to the first order follows a chi-
square distribution on 1 degree of freedorn; P & > 4.92806) = 0.02642, a rignificarit
improvement in fit over the normal model.
The following tables record 95% confidence intervals for p, with X = 2k, k = 0,1,2,3, m,
and unkmmm (uk) using the FR method, the signed likelihood ratio statistic R, and nu-
merical integration of the exact conditional distribution.
1 Exact 1 (-0.61012~0.83808) 1 (-0.87620,0.83428) 1 -
Exact
FR
The maximum likelihood estimates are given in the table below:
As X increases the location estimate shifts to the Ieft , correspondhg to the extreme value
-3.02 on the left tail of the distribution; the estimate of the standard error also increases
substantially as X increases. The Student (p, o, A) family, for smaller values of A, is more
resistant to the extreme value on the left tail of the distribution.
!
(0.05506,0.80387)
(0.05716,0.80954)
(-0.15400,0.81528)
(-0.16298,0.81498)
(-0.40106,0.83428)
(-O.41191,0.83652)
Discussion
When the value of X is fked the FR method provides a more accurate approximation
to the exact answer than R. For instance, when X = 1 the lower endpoint for the exact,
FR, and R are: 0.055,0.057,0.134 respectively The approximation using R results in
a lower limit that is considerably larger than the exact, overstating the tightness of the
confidence intervals or precision that is usuaily achieved by a Stvdent (A) analysis for
small values of X (Fraser 1976a, 1979; Lange et al. 1989). The FR method yields an
interval whose resistance to extreme values is similar to the exact. As X becomes larger
the exact lower endpoint is pded to the left due to the extreme obsenration, and the
approximation to the lower endpoint using R is adequate, but is still outperformed by
the FR method.
When the value of X is imknmn there is no exact method to provide a performance
benchmark. Hence, it is difficult to ascertain how well the approximate rnethods compare
to each other. Although, when the shape parameter X is ~ssumed to be h o w n we
might expect the resulting confidence intervals to be a 'blend' of the fixed X intervals,
since we are averaging over the nuisance parameter distribution to obtain the marginal
distribution of S. Indeed, this is the case for the FR method but not for R The FR
method yields: a lower limit when X is u n k n m , -0.01220, that is between the lower
limits for X = 1, 0.05506, and X = 2, -0.15400; and an upper limit when X is unknown,
0.83015, that is between X = 2, 0.81528, and X = m, 0.83428. The fhst order method
using R results in an upper limit, 0.79438, that is less than the smdest exact upper
limit; and a lower limit, -0.04360, that is between the lower lirnits for X = 1 and X = 2.
Thus, when X is sssumed to be unlaiown the FR method produces an interval that has
lower endpoint falling between the fixed X = 1 and X = 2 intervals; smaller values of X
correspond to greater resistmce which is needed due to the extreme value on the left tail
of the distribution. The upper endpoint, when X is unlmown, is s blend of the upper
endpoints between the X = 2 and X = oo cases; larger values of X correspond to less
resis tance.
8.3.2 Cushny-Peebles Data
Cushny and Peebles (1905) collectecl data that was analyzed by Student (1908) and Iater
by Fisher (1925). The 10 observations are a measure of irnprovement under a change in
cimg therapy.
A histogram of the data is shown below.
Cwhny & Peebles Data
The profile likelihood function of X evaluated at X = 2k, k = 0,1 ,2 ,3 , oo is given in
the table below.
Twice the difference between the likelihood of the best fitting A A
n o d model is 2 (1 (p, o, 1) - l(6, â, m)) = 4.69984 and to the
square distribution on 1 degree of freedorn; P & > 4.69984)
improvement in fit over the normal model.
Student model and the
first order follows a chi-
= 0.03017, a significant
As in the previous example, 95% confidence intemals for the location parameter p
are recorded using the three methods discussed previously.
1 Exact 1 (0.89932,1.66168) 1 (0.82153,1.83282) 1 (0.76319,2.0273) 1
The maximum Iikelihood estimates of 9 are recorded in the table below.
Exact
FR
A - 8
(0.72429,2.21011)
(0.71810,2.19654)
X=w
(0.70484,2.45516)
(0.70735,2.45265)
X = uk
-
(0.83020,1.71193)
The location estimate shifts to the right as X becomes large, correspondhg to the extreme
value on the right tail of the distribution. Moreover, the standard error becomes larger
as X increases due to the increasing influence of the extreme value.
Discussion
The confidence intervals shift to the left as X becomes smaller, providing higher resistance
to the extreme value 4.6 on the right tail of the distribution. Both approximate methods
preserve this behavior, although since the FR method is more accurate, the magnitude
of the shift is almost identical to the exact.
The FR method provides a much better approximation to the exact, for h e d A, than
the k s t order R. Indeed, the inaccuracy of the fist order R produces intervals that
are tighter than the exact, overstating the precision of the estimate for p. For example,
when X = 4 the fist order interval is (0.809lO,l.94882) compared to the exact interval
(0.76319,2.0273) and FR i n t e d (0.76950,2.02572) . When X is assumed to be unlmown, as mentioned in the previous example, there is
no exact answer to use as the gold standard. Although, the FR method produces an
interval where the endpoints fall between the endpoints of the X = 1 and X = 2 case.
The first order method R produces an interval where the values fall outside the smallest
exact !ower endpoint and the largest upper exact endpoint.
8.4 Simulation Study
8.4.1 Purpose
To gain a better understanding of how the third order FR method compares to the first
order signed likelihood ratio statistic in repeated sampling a simulation study of 100,000
trials was designed and implemented. The major research interest is to understand how
each method behaves, for a s m d sample size, when the shape parameter X is estimated
from the data or fixeci a priori, across different populations.
8.4.2 Methods
The algorithm used is described in the following pseudecode:
1. for numsim = 1 to N
2. l e t u = generate-uniform(n)
3. f o r pop = 1 t o P
4. l e t sample = inverse-cdf (u,P@
5. f o r 1-ethod = 1 t o 2
6. l e t
f i r s t s r d e r = normal-cdf (r (l-ethod, sample) )
8. thiraorder = normal-cdf (r* (lamsiethod, sample) )
A sample of size n is generated from the uniform distribution on [O, 11; for each of
the P populations a sample is generated by computing the inverse probabiliw transform
inverse-cdf; h d y for fixed (lanunethod = 1) and estimated (lamsiethod = 2) X a
first order pvalue is calculated using the signed likelihood ratio and a third order pvalue
is caldated using the FR method. The procedure is repeated N times.
The simulation was carried out on a sample size of 10. Two distributions were con-
sidered, the Student (6) and N (O, 1), the latter corresponding to the X = oo case in the
Stvdat (A) fady .
8.4.3 Results
The tables below summarïzes the results for 100,000 simulations, the observed pvalues
for testing p = O were computed by the two approximate rnethods for estimated and
fked X : the percentages of one-sided p-values less than the nominal 0.5%, L5%, 2.5%,
5%, 95%, 97.5%, 98.5%, and 99.5% were recorded in the tables below. Two types of Q-Q
plots are also included at the end of the section: detrended Q-Q plot and, standard Q-Q
plot. The detrended Q-Q plot has the ordered quantiles of the theoretical distribution
on the horizontal axis and, the residuals from regressing the observed quantiles on the
theoretical quantiles on the vertical d s ; the line y = O is superimposed on the scatterplot
to aid in the assessment of the fit. The S-Plus code used in constructing the detrended
Q-Q plots for R' and R is:
detrended <- function(data)
I z <- qqnorm(data, plot=F)
plot(z$x,ltsreg(z$x,z$y)$residual,cex=0.5,xlab=c~J > ,ylab=c'JJ,axes=F)
abline (h=O) #draw horizontal Une
1 The titles and axes were added after the plot was constructed using the Function
axes 0. The standard Q-Q plots were obtained by using the two standard S-Plus func-
tions: qqnorm0 ; qqline () (Venables and Ripley 1994).
The Anderson-Darling A2 test was used as a goodness-of-fit test in assessing the
normality of R and R*. D'Agostino and Stephens (1986, pp.406) include the A2 test
smong their recommended tests for checkhg the normality of data. The S-Plus program
for computing the A2 test statistic is:
ad <- function(x) (
xi c- sort(x)
z <- (xl - mean(xl))/sqrt (var(x1))
p <- pnorm(z)
n <- length(x)
res <- O
for ( i in Ln) {
res Ci] C - ( (-1) *(2*i-1) *(log(p Ci] ) +logCl-p b + l - i l ) ) ) /n
1 (sum(res 1 -n) * ( 1+0. ?5/n+2.25/n^S)
}
The null hypothesis of the A2 test, normaliQ, is rejected at level a if the observed
value of the test statistic falls into the critical region. Some of the critical values for the
A2 test are given in the foilowing table (D'Agostino and Stephens 1986, pp.373).
1 a 1 CriticdValueofA2 Test (
1 0m0 1 1.035 1
Tables
Student (6)
Estimated A
Stuàent (6)
Fixed X
Obsewed Value of A2
Estimated X
R* 1 0.16209 (NS) 1 0.42675 (NS) 1 I
Student (6)
1 R 1 1.33768 1 0.51522 (NS) 1
N (0, 1) I
Obsenred Value of A2
Fixed X
- - Q-Q Plot of 100,000 Simulations R*, Student(6), lambda=free
92 O 2
Quantiles of Standard Normal
- Detrended 42-Q Plot of 1 00,000 Simulations R*, Student(G), lambda=free
Quantiles of Standard Normal
Q-Q Plot of 100,000 Simulations R, Student(G), lambda=free
Quantiles of Standard N o m l
- - Detrended Q-Q Plot of 100,000 Simulations R, Student(G), lambda=free
QuantiIes of Standard Normal
Q-Q Plot of 100,000 Simulations FI*, Student(G), lambda=6
- - Detrended Q-Q Plot of 100,000 Simulations .
R*, Student(6), lambda=6
Quantiles of Standard Nomial
Q-Q -Plot of 100,000 Simulations R, Student(6), lambda=6
Quantiles of Standard Normal
Detrended- Q-Q Plot of 1 00,000 Simulations
1 I I I I
-4 -2 O 2 4
Quantiles of Standard Normal
Q-Q Plot of 100,000 Simulations R*, N(0,1), lambda=free
-2 O 2
Quantiles of Standard Normal
O . Detrended Q-Q Plot -of 1 00,000 Simulations R*, N(O,l), lambda=free
Quantiles of Standard Nomial
.Q-Q Plot of 100,000 Simulations R, N(O, l), lambda=free
I I I 1 1
4 -2 O 2 4
Quantiles of Standard Noml
Detrended Q-Q Plot of 100,000 Simulations R, N(0,1), lambda=free
Qum tiles of Standard Normal
. . Q-Q .Plot of 1 00,000 Simulations R*, N(0,1), lambda=infinity
-2 O 2
Quantiles of Standard Normal
- Detrended Q-Q Plot of 100,000 Simulations R*, N(0,1), lambda=infinity
Quantiles of Standard Normal
- . Q-Q Plot .of 100,000 Simulations R, N(O, 1 ), lambda=infinity
Quantiles of Standard Nomal
- ~Detrended Q-Q Plot of 1 00,000 Simulations R, N(O, 1 ), lambda=infinity
Quantiles of Standard Normal
8.4.4 S M a t ion Conclusions
When X is estimated the tables of p-values indicate that the distribution for the third
order FR method is much closer to the theoreticd uniform distribution than the f h t
order signed likelihood ratio statistic R, for the Stvdent (6). The Q-Q plots indicate
non-normality and the A2 test rejects the null hypothesis of normality for the fust order
method when X is estimated from the data for the Student (6). For fked X the FR
method performs much better than R when the data is generated from the Student (6).
When the data is generated form the N (0,l) the third order and first order methods
produce satisfactory results.
Chapter 9
Conclusion
A proof of the asymptotic normality of the posterior distribution of a vector parameter
based on a sample from a stochastic process with respect to either a proper or improper
prior is given. The limiting distribution provides a means of approximating the signifi-
came level of an interest parameter to the first order. When the sample size is s m d first
order asymptotic methods can be very inaccurate. For srnall samples in a general sta-
tistical model the third order FR method outperforms the standard first order methods.
A method is introduced to generate error distributions with a shape parameter and the
FR method is applied to the location-scale-shape model. Techniques are presented for
numerical computation of all quantities required for the FR method. Numerical examples
and simulations are included when the error distribution is the Stdent (A).
A generalization in one direction of the posterior asymptotic normality proof is to
obtain the limiting distribution while the number of nuisance parameters is allowed to
grow at, Say, a rate slower than the number of observations from the stochastic process.
The solution to this problem would have applications to inference in branching processes.
A different but related question is: suppose that the parameter space is infinite dirnen-
sional, 9PO for example, can the proof be modified to obtain the limiting distribution of
the posterior distribution? What seems to be required to anmer this is: a nom on the
infinite dimensional parameter space that satisfies llTxll 5 llT 11 -*11x11 , where x belongs
to the parameter space and T is a linear transformation on the parameter space; re-
placing the obsemd Fisher information in assumption 2 (Chapter 3) with another type
of information measure that can be dehed on an infinite dimensional space. Research
is currently under way to find alternative definitions of information that would lead to
posterior asymptotic nomality.
In Chapter 7 a method was introduced for generating error distributions with a shape
parameter by using infinitesimal transformations to construct a one-parameter group of
continuous transformations. It is not obvious which function ( 0 ) would yield a syrnmetnc
family of error distributions with a shape parameter that satisfies (7.10). Research is
currently under way in this area.
Bibliography
[II Abrarnowitz, M. and Stegun, 1. (1972). Handbook of Mathematical finctionî with
Fornulas, Gmphs and Mathematical Tables. Wiley, New York.
[2] Arnold, V.I. (1973). Ordinary Dflerential Equatiom. MIT, Cambridge.
[3] Barndofi-Nielsen, O.E. (1980). Conditionality resolutions. Biometrika 67, 293-310.
[4] Barndofi-Nielsen, O.E. (1983). On a formula for the distribution of the maximum
likelihood estimator . Baometrika 70, 33465.
[SI Barndofi-Nielsen, O.E. (1986). Merence on full and partial parameters based on
the st andardized signed log likelihood ratio. Biometrika 73, 307-22.
[6] Barndofi-Nielsen, O.E. (1990). Approximate interval probabili ties. J. R. Statist.
SOC. B 52, 485-96.
[7] Barndofi-Nielsen, O .E. (1991). Modified signed likelihood ratio. Biometrika 78,
557-63.
[BI Barndofi-Nielsen, O.E. and Cox, D.R. (1979). Edgeworth and saddlepoint approx-
imations with statistical applications. J. R. Statist. B 41, 279-312.
[9] Barndorff-Nielsen, O .E. and Cox, D.R. (1989). Asymptotzc Techniques In Statistics.
Chapman-Hall, London.
(101 BarndorfE-Nielsen, O.E. and Cox, D.R. (1994). Inference and Asppto t ics .
Chapman-Hall, London.
[Il] Blaesild, P. and Jensen, J.L. (1985). Saddlepoint formulas for reproductive exponen-
tial models. Scand J. Statist. 12, 193-202.
[12] Box, G.E.P. and Tiao, G.C. (1973). Bayesian Inference In Statistzcal Analysis.
Addison- Wesley, Reading Mass.
[13] Brenner, D., Fraser, D.A.S., and McDunnough, P. (1982). On asymptotic normaliigr
of likelihood and conditional analysis. Can. J. Statist. 10, 163-72.
[ly Cakmak, S., Fraser, D.A.S., and Reid, N. (1994). Multivariate asymptotic model,
exponential and location type approximations. Utditas Math. 46, 21-31.
[15] Cambell, J.E. (1903). lntmductory lkeatise On Lie's Thwry Of Finite Continuow
ïhnsfonnatzon Gmups. Oxford, London.
[16] Cheah, P.K., Fraser, D.A.Ç., and Reid, N. (1995). Adjustments to likelihoods and
densities; calculating significance. J. Statist. Res. 29, 1-13.
[17] Crarnér , H. (1946). Mathematical Methods of Statistics. Princeton University Press,
Princeton.
[18] Cushny, A.R. and Peebles, A.R. (1905). The action of optical isomers II. Hyoscines.
J. Physiol. 32, 501-10.
[19] D7Agostino, RB. and Stephens, M.A. (1986). Goodness-Of-Fit Techniques. Marcel
Dekker, New York.
[20] Daniels, H.E. (1954). Saddlepoint approximations in statistics. A m . Math. Statist.
25, 631-50.
[21] Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1-38.
[22] DiCiccio, T.J. and Martin, M.A. (1991). Approximations of marginal tail probabil-
ities for a class of smooth h d i o n s with applications to Bayesian and conditional
inference. Biornetr+ka 78, 891-902.
[23] Di*, J. M. (1968). Three multidimensiond integral identities wit h Bayesian appli-
cations. Ann. Math. Statist. 39 1615-28.
[24] Eisenhart, L.P. (1961). Continuous Groups Of 'Ifnnsfonnations. Dover, New York.
[25] Feller, W. (1971). An htmduction to Pmbability T h w v and zts Applications 2. 2nd
ed. Wiley, New York.
[26] Fisher, R.A. (1925). Statisticol Methods for Reseurch Workers. London, Oliver and
Boyd.
[27] Fisher, R.A. (1934). Two new properties of mathematical likelihood. Pmc. Roy. Soc.
London Ser. A 144, 285-307.
[28] Folland, G.B. (1984). Real Anulysis Modem Techniques and Thezr Applications.
Wiley, New York.
[29] Fraser, D.A.S. (1961). The fiducial method and invariance. Biometrika 48, 261-80.
[30] Fraser, D.A.S. (1964). Local conditional dciency. J. R. Statist. Soc. B 26, 52-62.
[31] Fraser, D.A.S. (1968). The Structure of Inference. Wiley, New York.
[32] Fraser, D. A.S. (1976~3). Necessary analysis and adaptive inference (with discussion).
J- Amer. Statist. Assoc. 71, 99-113.
[33] Fraser, D.A.S. (1976b). Pmbabdity and Statistics. DAI Press, Toronto.
[34] Fraser, D.A.S. (1979). Inference and Linear Models. McGraw Hill, New York.
[35] Fraser, D . A.S . (1988). Enc yclopedia of Statistzcol Sciences. Volume 9. Structural
Inference. (eds S. Kotz, N.L. Johnson). Wiley, New York.
[36] Fraser, D.A.S. (1990). Tai1 probabilities from obsenred likelihood. Biometrika 77,
65-76.
[37] Fraser, D.A.S. (1991). Statistical inference: likelihood to significance. J. Am. Statkt.
Assoc. 86, 258-65.
[38] b e r , D.A.S. and McKay, J. (1975). Parameter factorization and inference based
on significance, iikelihood, and objective posterior. Ann. Statist. 3, 559-72.
[39] Fraser, D.A.S. and McDunnough, P. (1984). Further remarks on asymptotic normal-
ity of likelihood and conditional analysis. Can. J. Statist. 12, 183-90.
[40] Fraser, D.A.S., McDwuiough, P. and Taback, N. (1997). lmproper priors, posterior
asymptotic normality, and conditional inference. in Advonces in the Theory and
Pmctzce of Statistics: A Volume In Honor of Samuel Kotz. 2 e h . N.L. Johnson and
N. Balakrishnan. Wiley, New York.
4 Fraser, D.A.S., Monette, G., Ng, K.W., Wong, A. (1995). Higher order approxi-
mations with general linear models. Pmceedings of the symposium on multivariate
analysis, Hong Kong.
[42] Fraser, D.A.S. and Reid, N. (1990). From multiparameter likelihood to tail proba-
bility for a scalar parameter. Tedinicul Report No. 9003 University of Toronto.
[43] Fraser, D.A.S. and Reid, N. (1995). Ancillaries and third order significance. Utzlztas
Math. 47, 33-53.
[a] Heyde, C.C and Johnstone, LM. (1979). On asymptotic posterior normality for
stochastic processes. J. R. Statist. Soc. B 41, 18489.
[45] Jensen, J.L. (1995). Saddlepoznt Appmzimotions in Statistics. Oxford Press, New
York.
[46] Johnstone, LM. (1978). Problems in Limit theory for martingales and posterior dis-
tribut ions kom stochast ic processes. M. Sc. thesis, Avstmliun National Universit y.
[47] Kolassa, J.E. (1994). Series Appmzimation Methods in Statistics. Springer-Verlag,
New York.
[48] Lange, K.L., Little, J.A. and Taylor, J.M.G. (1989). Robust statistical modelling
using the t distribution. J. Amer. Statist. Assoc. 84, 881-96.
[49] Lehmann, E.L. (1991). Theory of Point Estimation. 2nd ed. Wadsworth, Belmont.
Little, J.A. (1988). Robust estimation of the mean and covariance matrix with miss-
ing values. Appl. Statist. 37, 23-38.
Liu, C. and Rubin, D.B. (1995). ML estimation of the t distribution using EM and
its extensions, ECM and ECME. Statist. Sinica 5, 475-90.
Lugrnami, R. and Rice, S.0. (1980). Saddlepoint approximation for the distribution
of the sums of independent random variables. Adv. Appl. Pmb. 12, 475-90.
Marsden, J.E. and Tromba, J.A. (1988). V i t o r Culcu!us. Freeman, New York.
McCullagh, P. Tensor Methods in Statistics. Chapman-Hall, London.
Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. (1988). Numer-
ical Recipes in C. Cambridge University Press, Cambridge.
Reid, N. (1988). Saddlepoint methods and statistical inference (with discussion).
Statist. Scz. 3, 213-27.
Reid, N. (1995a). Likelihood and higher order approximations to tail areas: a review
and annotated bibliography. Can. J. Statist. 24, 141-66.
Reid, N. (1995b). The roles of conditionhg in inference (with discussion). Statist.
Scz'. 10, 138-57.
[59] Rogers, W .H. and Tukey, LW. (1962). Understanding some long-tailed distributions.
Statistzca Neerlandica 26, 21 1-26.
[60] Scheffé, H. (1947). A useful convergence theorem for probabili* distributions. Ann.
Math. Statist. 18, 434-38.
[61] Skovgaard, LM. (1990). On the density of minimum contrast estimators. Ann. Sta-
t k t . 18, 779-89.
[62] Sweeting, T.J. and Adekola, O.A. (1987). Asymptotic posterior normality for sto-
chastic processes revisited. J. R. Statist. Soc. B 49, 215-22.
[63] Tierney, L. (1990). Lisp-Stat: An Object-Onenlecl Environment For Statistzcal Com-
puting And Dynamzc Graphies. Wiley, New York.
[64] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Ann.
Math. Statist. 20, 595-601.
[65] Walker, A.M. (1969). On the asymptotic behavior of posterior distributions. J. R.
Statist. Soc. B 31, 80-88.
[66] Venables, W.N. and Ripley, B.D. (1994). Modern Applzed Statistics with S-Plus.
Springer-Veriag, New York.
(671 Von Montfort, M.A. J. and Otten, A. (1978). On testing a shape parameter in the
presence of a location and a scale parameter. Math. Op. Statist. 9 , 91-104.
IMAGE EVALUATION TEST TARGET (QA-3)
L , LU..
1.8 111IL
APPLIED IMAGE. Inc fi 1653 East Main Street
,=- Rochester, NY 14609 USA -- --= Phone: 7161482-0300 -- --= Fax: 71 W88-5989
O 1993. Appiled Image. Inc. Ail AlgCILs Resenred