Skript

Nonparametric Statistics

Olaf Wittich

TU/e 2009

2

This course was intended as a two block course (2h lecture + 1h in-struction/week) serving as an introduction to nonparametrics. Due topersonal preferences, the focus is on two basic ideas namely

• using invariance under group actions as a construction principle,

• using empirical processes as a tool for asymptotics.

As paradigms, rank tests and goodness of fit tests were used.

Most material is stolen elsewhere, the treatment of rank statisticsfrom [4], the functional delta method from [6] and the considerationsabout invariance from [5].

Contents

1 Rank Tests 5

1.1 Nonparametric assumptions . . . . . . . . . . . . . . . . . . . 5

1.2 A first example: The one-sided sign test . . . . . . . . . . . . 6

1.3 The sign test in a parametric setting . . . . . . . . . . . . . . 8

1.3.1 The parametric test . . . . . . . . . . . . . . . . . . . . 9

1.3.2 The nonparametric test . . . . . . . . . . . . . . . . . . 9

1.3.3 Pitman’s asymptotic efficiency . . . . . . . . . . . . . . 11

1.4 Group actions and invariant tests . . . . . . . . . . . . . . . . 11

1.4.1 Group actions . . . . . . . . . . . . . . . . . . . . . . . 12

1.4.2 Example 1. Permutations and order statistics . . . . . 13

1.4.3 Example 2. Monotone maps and rank statistics . . . . 14

1.4.4 Invariant tests . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 A testing problem on domination . . . . . . . . . . . . . . . . 18

1.6 A preliminary remark . . . . . . . . . . . . . . . . . . . . . . . 22

1.7 Construction of critical regions . . . . . . . . . . . . . . . . . . 23

1.8 Three two sample rank tests . . . . . . . . . . . . . . . . . . . 27

1.9 Two sample problems and linear rank tests . . . . . . . . . . . 32

1.9.1 Tests on Location . . . . . . . . . . . . . . . . . . . . . 37

1.9.2 Tests on Scale . . . . . . . . . . . . . . . . . . . . . . . 37

1.9.3 The distribution of the Wilcoxon test statistic . . . . . 40

1.10 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . 41

2 Goodness of Fit 45

2.1 A functional limit theorem . . . . . . . . . . . . . . . . . . . . 45

2.2 The Kolmogorov Smirnov test . . . . . . . . . . . . . . . . . . 49

2.3 The Chi-square idea . . . . . . . . . . . . . . . . . . . . . . . 54

2.4 A Chi square test on independence . . . . . . . . . . . . . . . 58

3

4 CONTENTS

A The functional delta method 61A.1 The Mann-Whitney statistic . . . . . . . . . . . . . . . . . . . 61A.2 Differentiability and asymptotic normality . . . . . . . . . . . 62

B Some Exercises 67

Chapter 1

Rank Tests

1.1 Nonparametric assumptions

In contrast to the situation in parametric statistics where we usually assumethat the underlying distribution belongs to a family indexed by a few realparameters, nonparametric statistics deals with problems where hardly any-thing is known about the underlying distribution. In a sense that will bemade precise below, nonparametric families can not be indexed by finitelymany parameters.

For instance, a typical parametric assumption is that the true distributionunderlying our random experiment is N(µ, σ2) where µ ∈ R and σ2 ≥ 0are two real parameters. In contrast to that, we consider two examples ofnonparametric assumptions.

Example 1. The probability distribution of a random variable X is calledsubgaussian, iff

E[eaX]≤ e−βa2

for all a ∈ R. The subgaussian distributions form a nonparametric family.The characterising condition above is called an (exponential) moment condi-tion.

Example 2. The probability distribution of a random variable X is calledsymmetric around its mean µX , iff X −µX ∼ −(X −µX) are equal in distri-bution. The symmetric distributions also form a nonparametric family.

5

6 CHAPTER 1. RANK TESTS

Whereas nonparametric ideas are even required for some testing problemssuch as for instance for goodness-of-fit tests where we want to check whethera given assumption on the underlying distribution is reasonable, we must bewell aware of the fact that less information about the underlying distributionsalso inevitably leads to weaker results such that the more powerful parametricmethods should be used whenever a reasonable assumption on the underlyingdistribution family is available.

Remark. Since every set which has the same cardinality as the set of realnumbers can be mapped one-to-one and onto R, we could in principle also pa-rameterize the non-parametric families above by even a single real parameter.So we can not use the distinction into parametric and nonparametric familiesof distributions given above without an actual restriction on which kind ofparameterizations are considered. It is customary to allow only parameter-izations that are continuous in the weak topology on the set of probabilitymeasures. Doing so, the nonparametric families considered above can indeednot be parameterized by a finite number of real values.

1.2 A first example: The one-sided sign test

As the basic paradigm for the construction of nonparametric tests, we con-sider a one-sided test on location. Note that we will observe here a basicfeature that we will meet again and again in the sequel: The nonparametricproblem will be translated into a parametric problem while on the way theinitial information that was contained in the sample is considerably reducedand only some rough qualitative properties of the actual sample actuallyserve as input for the final parametric test.

Let X be a continuous random variable with cumulative distribution functionF . Consider the median of X defined as follows: Let

m−(X) := supt ∈ R : P (X ≤ t) <1

2,

m+(X) := inft ∈ R : P (X ≤ t) >1

2.

1.2. A FIRST EXAMPLE: THE ONE-SIDED SIGN TEST 7

Then, the median is given by

med(X) :=1

2(m+(x) +m−(x)).

Without any further knowledge about X, we will now construct a test of thehypothesis H0 : med(X) = m0 against the alternative H1 : med(x) > m0:

Since X was assumed to be continuous, we have P (X = med(X)) = 0 andhence

P (X > med(X)) = P (X < med(X)) =1

2. (1.1)

The basic idea to construct the test is now the following: Consider a randomsample X1, ..., Xn and derive from that the number of realisations that arelarger than m0. To be precise, let

Si := 1Xi>m0 =

1 , Xi > m0

0 , else.

Then, S given by

S :=n∑

i=1

Si

is the required random variable which – under H0 – is by formula (1.1) bi-nomially distributed with parameter 1/2, i.e S ∼ Bin(n, 1/2).

On the other hand, under the alternative med(X) > m0, the random variableS is distributed according to a binomial distribution Bin(n, p) with p > 1/2.Thus we have reduced the non-parametric problem to a parametric one,namely a test for the parameter p of a binomial distribution Bin(n, p) wherethe hypothesis H ′

0 : p = 1/2 is tested against the alternative H ′1 : p > 1/2.

The solution of this problem is well known: Fix a level of significance α > 0.Then the corresponding rejection- or critical region is given by

C := n0, n0 + 1, ..., n

where n0 is chosen to be the smallest integer such that (assuming H ′0)

P (S ≥ n0) =1

2n

n∑k=n0

(nk

)≤ α.


Observe that the information in the sample is reduced from real valuesto the information whether the value is larger or smaller than zero(coarsening of information). For the final parametric test, only thisinformation about the initial observations is used.

1.3 Performance of the sign test in a para-

metric setting

To underline the statement from the introduction that using more informa-tion usually yields stronger results, i.e. that parametric methods are usuallymore powerful in dealing with parametric problems, we compare the signtest with one of the usual parametric tests where the distribution family isassumed to be Xµ ∼ N(µ, σ2) and we assume for simplicity that the varianceσ2 > 0 is known.

The test on location is here H0 : µ = µ0 versus H1 : µ > µ0, we fix thesignificance level again to some α > 0, the random sample is X1, ..., Xn. Wenow strive for a comparison of the two tests in terms of efficiency, meaning:Which sample size is necessary to achieve a given power ?

You might want to argue that this comparison is not entirely fair, becausewe only compare the two tests exactly in the situation where the parametrictest is designed for. By the following criterion, absolutely no informationis provided how the two tests compare when the underlying distributionfamily is not the one given above. So, what we expect to be the strength ofthe sign-test, namely that its performance does not dramatically alter if theunderlying assumptions on the distribution family are not valid, is actuallynot measured by this approach.

Remark. We will see later (Definition 11) that we can also design nonpara-metric tests which perform better in comparison to the t-test with respectto the criterion which we will consider now.

1.3. THE SIGN TEST IN A PARAMETRIC SETTING 9

1.3.1 The parametric test

We consider the sample mean

X =1

n

n∑k=1

Xk

with E(X) = µ and Var(X) = σ2/n. Hence

Z =X − µ0

σ

√n ∼ N(0, 1)

is standard normally distributed under H0 and and the rejection region isgiven by

C(α) = [µ0 + zασ√n,∞)

where zα is given by P (Z > zα) = α and H0 is rejected, when X ∈ C(α).For µtrue > µ0, the type II error is given by

β := P

(X − µtrue

σ

√n ≤ µ0 − µtrue

σ

√n+ zα

)and the left hand side is a standard normal variable. Hence, if we want thatthe test has significance level α and power 1− β, we need that

µ0 − µtrue

σ

√n+ zα ≤ z1−β = −zβ,

or

n ≥ σ2

zα + zβ

µ0 − µtrue

2

. (1.2)

1.3.2 The nonparametric test

Note first that for the normal distribution, the mean value and the mediancoincide (as it is the case for all symmetric distributions). We may thus usethe sign test from the preceding section as our nonparametric test. The typeII error for the sign test is given by

β = P (S < n0) =

n0−1∑k=0

(nk

)pk

true(1− ptrue)n−k (1.3)


where

ptrue = P (X > µ0) = P (X − µtrue/σ > µ0 − µtrue/σ) = 1− Φ

(µ0 − µtrue

σ

).

For large n, we can use the central limit theorem to approximate the type IIerror (1.3). We have

β = P (S < n0) = P

(S − nptrue√

nptrue(1− ptrue)<

n0 − nptrue√nptrue(1− ptrue)

)

≈ Φ

(n0 − nptrue√nptrue(1− ptrue)

).

Hencen0 − nptrue√nptrue(1− ptrue)

≤ −zβ. (1.4)

The same approximation for the determination of n0 from the significancelevel α yields (here the parameter is 1/2 since we consider here the distribu-tion assuming the hypothesis)

2n0 − n√n

≥ zα. (1.5)

Inserting n0 ≥ 12(√nzα + n) from (1.5) back into (1.4), we obtain

n ≥

zα + 2

√ptrue(1− ptrue)zβ

2ptrue − 1

2

. (1.6)

To finally compare the samples sizes for both tests, we have to find a relationbetween ptrue and µtrue. By power expanding the cumulative density functionof the standard normal distribution, we obtain

ptrue =1

2+

1√2π

µtrue − µ0

σ+O(|µtrue − µ0|3)

and inserting this into (1.6) finally yields

n ≥

zα +√

1− 2π

(µtrue−µ0)2

σ2 zβ√2π

µtrue−µ0

σ

2

. (1.7)

1.4. GROUP ACTIONS AND INVARIANT TESTS 11

1.3.3 Pitman’s asymptotic efficiency

Now we are going to make precise what we mean by the statement that non-parametric tests yield weaker results in a parametric setup.

Let H0 : θ = θ0 versus H1 : θ > θ0 be a testing problem in a parametric fam-ily Fθ. Let α > 0 be a fixed level of significance . We consider two differentstatistical tests A and B and denote by nA,α,β(θ) and nB,α,β(θ) the minimalsample size such that the power of the respective test is 1− β given the trueparameter is θ.

The following parameter is hence a measure for the relative quality of thetests A and B.

Definition 1 (Pitman’s asymptotic efficiency) The asymptotic relativeefficiency of the tests A,B is given by

eAB(α, β) := limθ→θ0

nA,α,β(θ)

nB,α,β(θ).

We can now calculate this relative efficiency for the two different tests fromthe preceding subsections. Combining formula (1.2) with (1.7), we obtain

eAB(α, β) = limµtrue→µ0

σ2

zα+zβ

µ0−µtrue

2

zα+

r1− 2

π(µtrue−µ0)2

σ2 zβ√2π

µtrue−µ0σ

2 =

2

π≈ 0.64.

That means, in the sense of this asymptotic result, you need only 2/3 of thesample size of the sign test for a parametric (t-)test with the same power andsignificance level.

1.4 Group actions and invariant tests

So far, we just saw an example how to construct a test without having toassume specific properties of the underlying distributions. In this section, wewant to make precise what we mean by the statement that the full informa-tion contained in the observation building up the sample is reduced to a few


characteristic features. Reduction to a few features means that we actuallydecompose the set of random samples into subclasses of observations to whichwe assign the same reduced information. But that means, these subclassesform equivalence classes of observations in the sense that two samples areequivalent if we extract from them the same reduced amount of information.For example, two observations (1.3,−2, 3.6,−22) and (0.1,−0.1, 0.1,−0.1)are equivalent with respect to the reduction used for the sign test. For bothobservations, we extract the same information (−,+,−,+) for the signs ofthe observations.

The next step will be to present one of the important construction princi-ples for those equivalence relations which is particularly useful for statisticalproblems that are invariant under some transformation group acting on thespace of random samples.

1.4.1 Group actions

In the sequel, we will think of a random sample always as an element x =(x1, ..., xn) ∈ Rn and G will always denote a group.

Definition 2 An (effective) group action of the group G on Rn consists ofa (one-to-one) identification of group elements g ∈ G with bijective mapsφg : Rn → Rn such that :

(i) the mapping associated to the neutral element e ∈ G is the identityφe = id : Rn → Rn,

(ii) g, h ∈ G implies φgh = φg φh, i.e. the composition of the associatedmaps respects the group multiplication.

Given a group action, we decompose the sample space Rn into its orbits. Theset of orbits can be parameterized by a maximal invariant mapping.

Definition 3 Let x0 ∈ Rn. The orbit of x0 under the operation of the groupG on Rn is given by orbG(x) := φg(x0) : g ∈ G. A maximal invariant mapis a one-to-one assignment of points in a parameter set J and the orbits ofthe G-action

jG : J → orbG(x) : x ∈ Rn.where the orbit space is given by Orb(G) := orbG(x) : x ∈ Rn.


First of all, we consider two examples:

1.4.2 Example 1. Permutations and order statistics

Let Πn be the group of permutations of n elements, i.e. Πn consists of allone-to-one maps π : 1, ..., n → 1, ..., n. Πn acts on Rn by permuting thecomponents of the random sample, i.e.

π(x1, ..., xn) := (xπ(1), ..., xπ(n)).

To determine the orbits of this action, we first consider the case n = 2: Thegroup Π2 has two elements, namely e and π(1, 2) = (2, 1). Thus φe(x1, x2) =(x1, x2) and φπ(x1, x2) = (x2, x1) and the orbits of the operation are given by

orbΠ(x) =

x x ∈ ∆2

x, π(x) x /∈ ∆2,

where ∆2 := (x, x) : x ∈ R ⊂ R2 denotes the diagonal. Every orbitcontains exactly one sample x = (x1, x2) with x1 ≤ x2. Thus, we obtain aone-to-one correspondence of the set of orbits with the set J := x ∈ R2 :x1 ≤ x2. For the point x′ ∈ J which is associated to the orbit orbΠ(x), wewill use the notation

x′ = j−1Π (orbΠ(x)) = (x(1), x(2)).

For general n, there is an analogous result.

Lemma 1 (significance of order statistics) Let Πn operate on Rn bypermuting components. Then in every orbit, there is exactly one point x′

with x′1 ≤ ... ≤ x′n. Hence, the map

jΠn : J → Orb(Πn)

with jΠn(x) := orbΠn(x) and J := x ∈ Rn : x1 ≤ ... ≤ xn is a maximalinvariant map.

Proof: Exercise.

Definition 4 (order statistics) Given a sample x = (x1, ..., xn), the cor-responding sample

o(x1, ..., xn) := j−1n (orbn(x))

is called the order statistics of x. We will often write

(x(1), ..., x(n)) := o(x1, ..., xn).


1.4.3 Example 2. Monotone maps and rank statistics

Let M be the group of monotone maps, i.e. the maps f : R → R which arecontinuous, surjective and strictly monotone in the sense that x > x′ impliesf(x) > f(x′). Then, M acts on Rn componentwise, i.e.

φf (x) = (f(x1), ..., f(xn)).

The group multiplication on M is given by the composition of maps.

Lemma 2 (significance of rank statistics) Under the action of M onRn, a point x′ ∈ Rn lies in the orbit of x ∈ Rn, i.e. x′ ∈ orbM(x) if and onlyif

r(x) := (r1(x), ..., rn(x)) = r(x′) = (r1(x′), ..., rn(x′))

where

ri(x) := |xj : 1 ≤ j ≤ n, xj ≤ xi|

denotes the rank of the component xi of x, i.e. the number of components(including xi) which are less or equal than xi.

Proof: (i) Let first f be monotonous. Hence xj < xi implies f(xj) < f(xi)and

ri(f(x)) := |f(xj) : 1 ≤ j ≤ n, f(xj) ≤ f(xi)|= |xj : 1 ≤ j ≤ n, xj ≤ xi| = ri(x).

Thus, two sample vectors in the same orbit have the same ranks ri.(ii) Let x and x′ be given with ri(x) = ri(x

′) and let (x(1), ..., x(n)) and(x′(1), ..., x

′(n)) be the corresponding order statistics. Let

qk :=

x′(k+1)

−x′(k)

x(k+1)−x(k), x(k+1) 6= x(k)

0 , else(1.8)

and

u(t) :=n−1∑k=1

qk1[x(k),x(k+1))(t).


Let now

f(x) =

x+ x′(1) − x(1) , x ≤ x(1)

x′(1) +∫ x

x(1)u(t)dt , x(1) < x ≤ x(n)

x+ x′(n) − x(n) , x > x(n)

.

Then f is continuous, surjective and piecewise linear. f is hence strictlymonotone iff for all x ∈ (x(k), x(k+1)], k = 0, ..., n, the (left) derivative of f isstrictly positive. Here we understand x(0) = −∞, x(n+1) = +∞. But thatfollows from

(x(k), x(k+1)] 6= ∅ ⇔ x(k+1) > x(k) ⇔ x′(k+1) > x′(k)

(due to the rank condition) and formula (1.8).

Definition 5 (rank statistics) Given a sample x = (x1, ..., xn), the corre-sponding sample

r(x) := (r1(x), ..., rn(x))

is called the rank statistics of x.

To find the maximal invariant map in this case, we have to find every possiblerank statistics. For that, consider integers 1 ≤ a1 < a2 < ...as ≤ n, writea = (a1, ..., as), and construct the sample r(a) = (r1, ..., rn) with

r1 = a1, ..., ra1 = a1, ra1+1 = a2, ..., ra2 = a2, ra2+1 = a3, ...., rn = as.

The vector r is the rank statistics of some sample vector x ∈ Rn (for instance,he is his own rank statistics). All possible rank statistics are permutationfrom vectors obtained in this way. Hence, the rank statistics are in one-to-onecorrespondence with the set

J :=⋃

1≤a1<a2<...as≤n,1≤s≤n

orbn(r(a)) (1.9)

and jM(r) := orbM(r) ⊂ Rn.

Remark. Note that the consideration of rank statistics simplifies drasticallyif we consider random samples for continuous random variables. In that case,we have that P (∃1≤i<j≤n : Xi = Xj) = 0 and therefore P(r(x) = r) = 0 forall r /∈ orbn(r0) with r0 = (1, 2, 3, ..., n).


We made precise what we will consider in the sequel as our way toreduce information: We consider two samples as equivalent, if theybelong to the same orbit of a group action on sample space. Thereduced amount of information that we assign to both of them is arepresentative of the corresponding orbit.

Example. We consider again the sign test example. Here, we have an actionof the multiplicative group of n-tuples of positive real numbers (R+)n := λ =(λ1, ..., λn) : λi > 0 on Rn given by (λ, x) 7→ (λ1x1, ..., λnxn). With thatdefinition, we see immediately that the two samples (1.3,−2, 3.6,−22) and(0.1,−0.1, 0.1,−0.1) lie both in the orbit of (1,−1, 1,−1).

1.4.4 Invariant tests

The reason why the group approach is useful is that there are often a priorigroup operations which should not affect the result of your test. Considerthe following example:

Example. Scientists in London and in Amsterdam want to test the hypoth-esis that the average height of the population in both countries is the sameagainst the alternative that Dutch people are larger in average. The Britishscientists measure the height of about 1000 randomly chosen people in GreatBritain and provide a list with heights in inches. The Dutch scientists mea-sure the height of about 1000 randomly chosen people in the Netherlandsand provide a list with heights in centimeters. Now the British scientistsconvert the Dutch list into inches and compare it to their own list, whereasthe Dutch scientists convert the British list into centimeters and compare itto their own list. Both groups use the same test but with the observationsmeasured in their own height scale. If the two tests would not come to thesame conclusion, we would think that there is something wrong.Where is the group ? Well, we translate the statement that the decisionshould not depend on the length scale used to collect the data into the state-ment that the test decision is invariant with respect to the action of R+ onthe space of observations Rn given by (λ, x) 7→ (λx1, ..., λxn) where λ > 0and x = (x1, ..., xn) ∈ Rn, since changing the length scale (for instance byswitching from inches to centimeters) means nothing else but multiplying allnumbers in the sample by a constant conversion factor.

Thus, we are lead to the following notion.


Definition 6 (invariant test) A test with test statistic T and critical regionC is called invariant, iff the test decision is invariant under the action of thegroup G, i.e.

T (φg(x)) ∈ C ⇔ T (x) ∈ C

for all g ∈ G.

An invariant test statistic can always be reduced to a test statistic on J . Andthis is how we will use the consideration above all the time in the sequel. Thefollowing proposition summarizes the relevant facts.

Proposition 1 (i) Let T be a G-invariant test statistic and jG a maximalinvariant map for the action of G on Rn. Then there is a map τ : J → Rsuch that

T (x) = τ j−1G (orbg(x)).

(ii) Let T = τ −1G orbG be an invariant test statistic. Then the distribution

of T depends only on the distribution of −1G orbG(x).

If we have a priori information about the invariance of our testingproblem with respect to some group action, it is natural to reduce thefull information in the sample by considering two samples as equivalentif they belong to the same orbit. Every invariant test can then bewritten as a function on the orbit space alone.

Remark. Please note that we were cheating quite a bit in this section.Except for part (ii) of Proposition 1 we can get along with it. But for thislast statement, we need that we can choose the map τ to be measurable inorder to transport the measure from the orbit space. For that, we have torequire that the group action G×Rn → Rn is a measurable map with respectto some suitably chosen sigma algebras on the respective sets. And even thenit is a theorem, that jG and τ can be chosen to be measurable. Since thistheorem holds under very weak conditions on the underlying spaces whichare met in every situation that we consider in the sequel, we will completelyignore these problems in the sequel but have to be well aware of the fact thatthey are there.

A possible rigorous version of the statements above can be obtained as fol-lows: Let (X, σX) be a measurable space and G×X → X be a measurablegroup action. We denote the orbit space by X/G and by q : X → X/G the


orbit map q(x) := orbG(x). Then, q is a surjective map and also measurable,if X/G is equipped with the sigma algebra

σX/G := B ⊂ X/G : q−1(B) ∈ σX.

σX/G is the maximal sigma algebra with that property. If T : X → R isa G-invariant and (σX , σR)-measurable map then by invariance, T−1(σR) ⊂q−1(σX/G). That implies by Dynkin’s lemma that we can find a measurablemap τ : X/G→ R with T = τ q. The orbit space X/G plays thus the roleof the parameter set.

1.5 A testing problem on domination

We will now apply Proposition 1 to a testing problem on domination withmultiple applications. It also presents a class of examples where we are facinga natural operation of the group M.

Definition 7 Let X and Y be random variables with cumulative distributionfunctions FX and FY , respectively. We say that X is stochastically largerthan Y and write X Y iff FY (t) ≥ FX(t) for all t ∈ R. We say that X isstrictly stochastically larger than Y and write X Y iff there is in additionone t ∈ R such that FY (t) 6= FX(t) .

Remark. If you are puzzled by the inverse relation between the cumulativedistribution functions, note that X being stochastically larger than Y isequivalent to P (X ≥ t) ≥ P (Y ≥ t) for all t ∈ R.

The testing problem that we consider now is a two-sample problem:

Test on domination. Let X, Y be continuous random variables. Con-struct on the basis of the two independent random samples (X1, ..., Xn) and(Y1, ..., Ym) a test for H0 : X = Y versus H1 : Y X.

Why should one be interested in such tests ?

Example. Suppose you are interested in the question whether the treatmentwith a certain medicant is effective or, alternatively, whether the patientssuffer some side effects. It is a difficult problem to measure these ’effects’

1.5. A TESTING PROBLEM ON DOMINATION 19

on the base of a numerical evaluation. In order not having to go into this,we consider an example where there is a canonical numerical evaluation.We consider as side effect of a medicant (for instance a sedative) ’treatedpatients become very sleepy’. Let there thus be two groups of patients, onlyone of which is treated with the medical in question. Let Y be the randomvariable ’sleeping time during the day of a patient treated with the medical’and X be the random variable ’sleeping time of a patient not treated’. Weassume that there are cumulative distribution functions FX and FY and thatthey are continuous. If this assumption is reasonable or not depends amongother things on the choice of the patients and is not a priori clear. Also, theindependence of the measurements Xi and Yj depends heavily on the designof the experiment. But assuming that, the linguistic question ’Are treatedpatients more sleepy ?’ can be translated into testing H0 : X = Y againstthe alternative H1 : Y X independent on any assumption on the specialshape of the distribution (except continuity).

To construct now a test on domination, we will try to follow the ideas devel-oped in the preceding section.

Lemma 3 Let f ∈M. Then

(i) X Y is equivalent to f(X) f(Y ),

(ii) X = Y is equivalent to f(X) = f(Y ),

(iii) X Y is equivalent to f(X) f(Y ).

Proof: (i) The maps f are invertible. Denoting the inverse map by f−1, i.e.f f−1 = id, we have hence

P(f(X) ≥ t) = P(X ≥ f−1(t)) ≥ P(Y ≥ f−1(t)) = P(f(Y ) ≥ t).

(ii), (iii) follow analogously.

Lemma 3 therefore means that hypothesis and alternative of our testingproblem are invariant with respect to the action of M, or

The test on domination considered above is invariant under the actionof M.


Thus, it seems also to be appropriate to look for an invariant decision rule.To get an invariant decision rule, it is appropriate to look for an invarianttest statistic. But by Proposition 1 and Lemma 2, invariant test statisticsonly depend on the rank statistics r(X1, ..., Xn, Y1, ..., Ym). Thus, as indicatedin the previous section, the assumption of invariance of the test statisticunder the action of a certain group, based on the lack of knowledge on theunderlying distribution, immediately leads to a specific coarsening of theinformation obtained by the random sample. Thus, by Proposition 1, (i)

All M-invariant decision rules for the test on domination are func-tions of the joint rank statistics r(X1, ..., Xn, Y1, ..., Ym).

But we can do even a bit better. First we recall the definition of sufficiency.

Definition 8 Let (X1, ..., Xn) be a random sample of the random variableX with probability density function fX . Let S : Rn → Rk be a map. S =S(X1, ..., Xn) is called sufficient statistic, iff for all other statistics T : Rn →Rm, the conditional probability distribution given S = s denoted by fT |s(t)does not depend on fX .

We now consider the rank statistic r(X1, ..., Xn, Y1, ..., Ym) and compose itwith the rank statistic of the ordered individual samples, (X(1), ..., X(n)) and(Y(1), ..., Y(m)), namely

Definition 9 (joint ordered rank statistics) We call the expression

ρ(X1, ..., Ym) = r(o(X1, ..., Xn), o(Y1, ..., Ym))

the joint ordered rank statistics. The set of all joint ordered rank statisticsdepends on the respective sample sizes and is denoted by R(n,m).

Since the underlying random variables X and Y were assumed to be contin-uous, we have with probability one that

X(1) < ... < X(n), Y(1) < ... < Y(m), X(i) 6= Y(j)

simultaneously. Now we show that the joint ordered rank statistics are suf-ficient for the ordered rank statistics.

1.5. A TESTING PROBLEM ON DOMINATION 21

Lemma 4 Let s = (s1, ..., sn) ∈ Nn and t = (t1, ..., tm) ∈ Nm. Under thehypothesis FX = FY we have the following conditional probability

P(r(X1, ..., Ym) = (s, t) | ρ(X1, ..., Ym) = (s′, t′))

=

1

n! m!if s′ = r(s) and t′ = r(t)

0 else

(1.10)

Proof: We have for the joint probability

P(r(X1, ..., Ym) = (s, t), ρ(X1, ..., Ym) = (s′, t′))

=

P(r(X1, ..., Ym) = (s, t)) , if s′ = r(s), t′ = r(t)0 , else

.

Hence, the conditional probability is either 0 in the latter case, or in theformer case given by the quotient

P(r(X1, ..., Ym) = (s, t) | ρ(X1, ..., Ym) = (r(s), r(t)))

= P(r(X1,...,Ym)=(s,t))P(ρ(X1,...,Ym)=(r(s),r(t)))

.(1.11)

But now, under the hypothesis FX = FY the joint sample is distributedaccording to a product measure and thus, the probabilities are not alteredby a permutation of the random variables in the joint sample. To be precise

P(r(X1, ..., Ym) = (s, t)) = P(r(Xπ(1), ..., Yπ(m)) = (s, t))

for all π ∈ Πn+m. However, if we use a general permutation, we wouldinterchange X- and Y -values and thus leave the orbit of the fixed joint rankstatistic. Hence we can only use permutations of the type π = (π1, π2) ∈Πn × Πm where π1 ∈ Πn, π2 ∈ Πm. There are n!m! of them. Thus

P(r(X1, ..., Ym) = (s, t), ρ(X1, ..., Ym) = (s′, t′))= 1

n! m!

∑π∈Πn×Πm

P(r(Xπ(1), ..., Yπ(m)) = (s, t))= 1

n! m!P(ρ(X1, ..., Ym) = (s, t)).

and inserting this into (1.11) yields the statement.

In other words, the ordered rank statistic is sufficient with respect to therank statistics of the data. That means that in fact

All M-invariant decision rules for the test on domination are func-tions of the joint ordered rank statistics ρ(X1, ..., Ym). In order todetermine the distribution of a test statistic under H0, we have to de-termine the distribution of the joint ordered rank statistics under H0.


By reducing the problem of constructing a test by means of sufficiency and in-variance, we thus finally arrive at a statement that any invariant decision ruleonly uses the information how the two samples (X1, ..., Xn) and (Y1, ..., Ym)interlace, i.e. if for instance n = 3, m = 2 and

X1 < Y2 < X3 < X2 < Y1

it uses only the information represented by the shorthand xyxxy on the orderof the observations of the two samples. Thus, so far without constructing asingle test, we found out that the structure on invariant decisions for a teston domination is rather restricted.

Remark. In the sequel, we will use interlacing patterns xyxx...xyx of thetype considered above to represent elements of R(n,m) without further men-tioning it.

Now we compute the distribution of the joint ordered rank statistics. In par-ticular, having in mind that both random variables are still assumed to becontinuous such that the observations are all mutually different with proba-bility one, we obtain:

Lemma 5 Under H0, the joint ordered rank statistics are equidistributed,i.e. for every ρ ∈ R(n,m) we have

P(ρ) =1(

n + mm

) . (1.12)

Proof: (Exercise.) A interlacing pattern ρ = xxy...yyx ∈ R(n,m) is deter-mined by the positions of the x’s in this pattern. There are n +m possiblepositions and you have to choose n of them for the x-values. The fact thatall patterns have the same probability follows from permutation invarianceunder H0.

By Proposition 1, (ii), the distribution of any invariant decision rule T (R) isgiven by the distribution of T under (1.12).

1.6 A preliminary remark about the construc-

tion of critical regions

We saw now that all invariant tests of the hypothesis H0 : X = Y againstH1 : Y X only use the information of the ordered rank statistics of the

1.7. CONSTRUCTION OF CRITICAL REGIONS 23

sample. But this immediately implies that there can be no uniformly bestinvariant tests of this hypothesis in the following sense: The set of possiblealternative random variables Y with X Y is so large and consists of somany different types of distribution functions that there is no unique choiceof a critical region which would be optimal to distinguish X from all thesealternatives. This will be illustrated by the following example.

Example. LetX be uniformly distributed on [0, 1]. We consider two possiblerandom variables 0 ≤ Yi ≤ 1 which are contained in the alternative Y X,namely (it is recommended to draw the graphs of these functions)

1. Y1 with density f1 = 2(1[1/4,1/2) + 1[3/4,1)),

2. Y2 with density f2 = 4(1[1/8,1/4) + 1[3/8,1/2) + 1[5/8,3/4) + 1[7/8,1)),

where 1[a,b) denotes the indicator function of the set [a, b).

Suppose now that m = 5 and n = 6 and that we want to construct acritical region to a significance level α > 0 which is so large that for the lastinterlacing pattern that can be added to C, we have to choose one of the twoalternatives

ρ1 = xxxyyyxxxyy,

ρ2 = xyxxxyyxxyy.

In case the true distribution of Y is given by f1, the first choice is better, incase it is f2, we better take the second pattern.

Remark. The situation that you have to choose between ρ1 and ρ2 is notas artificial as you might think. Both patterns have the same rank sum (seeDefinition 10 and the subsequent paragraph) of 36. Thus you can considera Wilcoxon test together with a significance level such that all interlacingpatterns with rank larger than 36 belong to the critical region and you canchoose exactly one more pattern with rank 36.

1.7 Construction of critical regions

In this subsection, we will establish a criterion for the construction of criticalregions. First of all, when we fix a significance level of α > 0, the critical


region for an M-invariant test must be a subset C ⊂ R(n,m) such that thetype I error

P(ρ ∈ C) = |C| × 1(n + m

m

) ≤ α

is less or equal than α. Here |C| denotes the cardinality of C. That impliesthat the maximum number of interlacing patterns ρ that we are allowed toput into the critical region is already fixed by that to

|C| =⌊(

n + mm

)α⌋

(1.13)

where buc denotes the largest integer less or equal to u.

However, that brings us not even close to the answer of the question whichof the interlacing patterns we should put into the critical region. For thatwe have to cross another conceptual gap. In the preceding subsection wealready argued that there can be no unique optimal choice for the wholealternative. In the sequel, we ask the question which choices of the criticalregion are optimal for a given parametric alternative. Depending on this re-stricted alternative, we construct several tests which are optimal with respectto Pitman’s asymptotic efficiency.

For simplicity, we will assume in the sequel that the cumulative distributionfunctions FX and FY are given by strictly positive and smooth densities fX

and fY .

Proposition 2 Let X1, ..., Xn and Y1, ..., Ym be two random samples, the firstone distributed according to FX and the second one according to FY . Then

P(ρ = (s, t)) =1(

n + mm

)E[fY (Y(t1))...fY (Y(tm))

fX(Y(t1))...fX(Y(tm))

](1.14)

where the expectation is taken with respect to the probability distribution ofthe hypothesis.

Proof: Let U(ρ) ⊂ Rn+m the subset consisting of tuples (u1, ...un, v1, ..., vm) ∈Rm+n such that

u1 < ... < ut1−1 < v1 < ut1 < ... < utm−1 < vm < utm < ... < un.

1.7. CONSTRUCTION OF CRITICAL REGIONS 25

Then, under the distribution assumption above

P(ρ = (s, t)) =

∫U(ρ)

du1...dundv1...dvmfX(u1)...fX(un)fY (v1)...fY (vn).

Let now π = (π1, π2) ∈ Π1 × Π2 act on ρ by

π(ρ) = (sπ11, ..., sπ1n, tπ21, ..., tπ2m).

Then by Fubini’s theorem

P(ρ = (s, t))

=1(

n + mm

) ∑π∈Π1×Π2

∫U(π(ρ))

du1...dundv1...dvmfX(u1)...fY (vn)

=1(

n + mm

) ∫∪π∈Π1×Π2

U(π(ρ))

du1...dundv1...dvmfX(u1)...fY (vn),

but the union ⋃π∈Π1×Π2

U(π(ρ)) = Rm+n −N

differs from Rn+m only by a Lebesgue zero set which means that actuallyeven

P(ρ = (s, t)) =1(

n + mm

) ∫Rn+m

du1...dundv1...dvmfX(u1)...fY (vn).

But that implies

P(ρ = (s, t))

=1(

n + mm

) ∫Rm+n

du1...dundv1...dvmfY (v1)...fY (vm)

fX(v1)...fX(vm)fX(u1)...fX(vm)

=1(

n + mm

)E[fY (Y(t1))...fY (Y(tm))

fX(Y(t1))...fX(Y(tm))

]

where the expectation is taken with respect to the distribution on Rn+m

with density fX(u1)...fX(vm) which is the distribution of the sample underthe hypothesis FX = FY .


So far, we considered how the distribution of the ordered rank statisticschanges, if the true distribution of Y is given by fY . That reminds of thecomputation of a type II error in parametric statistics where you also haveto know the true distribution. But due to the discussion in the precedingsection, we still have to close a conceptual gap before we can actually con-struct critical regions.

The example given in the preceding section indicates that there can be nooptimal choice of a critical region covering all alternatives. In order to never-theless construct critical regions for tests based on joint ordered rank statis-tics, we turn this around in the sense that we consider now critical regionswhich are optimal choices for very restricted, special alternatives which areprovided by one-dimensional location families FYθ

(x) = FX(x− θ).

We now compute the type II error for small values of θ. Assuming that thedensity fX is sufficiently smooth, we Taylor-expand it getting

fYθ(x) = fX(x− θ) = fX(x)− ∂fX

∂x(x)θ +O(θ2) (1.15)

where O(−) denotes the Landau symbol

f(θ) = O(θ2) ⇔ limθ→0

|f(θ)|θ2

=≤ C <∞.

Lemma 6 Let C ⊂ R(n,m) be a subset of joint ordered rank statistics serv-ing as the critical region of rank test. Then we have for the type II error ifthe true distribution is the one of Yθ

β′(θ)|θ=0 = −∑ρ∈C

m∑j=1

1(n + m

m

)E[d

dxln fX(Ytj)

].

Proof: We have by (1.14) using the shorthand fθ = fYθ

β(θ) = 1−∑ρ∈C

P(ρ)

= 1−∑ρ∈C

1(n + m

m

)E[fθ(Yt1)...fθ(Ytm)

fX(Yt1)...fX(Ytm)

]

1.8. THREE TWO SAMPLE RANK TESTS 27

and thus using (1.15) and after interchanging differentiation and integration

β′(θ)|θ=0 =∑t∈C

1(n + m

m

)E

[m∑

j=1

f ′X(Ytj)

fX(Ytj)

]

=∑ρ∈C

m∑j=1

1(n + m

m

)E[d

dxln fX(Ytj)

]

The function β(θ) is expected to decrease for increasing θ > θ0 and to beequal to 1− α for θ = θ0. Thus, β′(θ0) is a measure for how fast the type IIerror decreases with increasing θ > θ0. If β′(θ0) is small, we observe a fasterincrease of the power of the test as θ deviates from θ0. So we obtain for smallvalues of θ and for fixed values of α > 0:

A critical region leading to the smallest asymptotic type II error at sig-nificance level α > 0 is given by a subset C ⊂ R(n,m) with cardinality

|C| ≤⌊(

n + mm

)α⌋

for which

−∑ρ∈C

m∑j=1

E[d

dxln fX(Ytj)

](1.16)

is maximal.

Remark. This condition may still not be sufficient to enforce uniqueness ofthe critical region for all levels of significance α > 0. It might still happenthat to fill up the critical region, we have to make a choice among severalinterlacing patterns with the same value of the test statistic. In that case youcan choose either one of the possible patterns or randomize over all possiblecritical regions thus obtained.

1.8 Three two sample rank tests

Now we will for the first time benefit from our preparations in the sense thatwe will now construct the first real tests for the two sample problem underconsideration. We start with


Definition 10 (Wilcoxon two sample test) The Wilcoxon test is the testassociated to the logistic distribution with cumulative distribution functionFlog(x) = (1 + e−x)−1. The test statistic is given by

TW (X1, ..., Ym) :=m∑

j=1

tj (1.17)

where tj is the rank of Xj in the joint ordered sample.

To compute the critical region for a random sample X1, ..., Xn, Y1, ..., Ym andsignificance level α > 0, we have to compute by (1.16) the derivative of thelogarithmic density

ln flog(x) = ln(e−x(1 + e−x)−2

)= −x− 2 ln

(1 + e−x

),

hence

− d

dxln flog(x) = 1− 2e−x

1 + e−x= 2

1 + e−x

1 + e−x− 1− 2e−x

1 + e−x

=2

1 + e−x− 1 = 2Flog(x)− 1

Thus, maximizing a sum of expectations

−E(d

dxln flog(Y(tj)))

is equivalent to maximizing E(Flog(Y(tj))).

Lemma 7 If Z is a continuous random variable with cumulative distributionfunction FZ, then the random variable U = FZ(Z) is uniformly distributedon [0, 1].

Proof: Exercise. You may assume for simplicity that F is continuous andstrictly increasing.

The ith order statistic Z(i) of a random sample Z1, ..., Zn is distributed by

P(Z(i) ≤ t) =n∑

s=i

(ns

)FZ(t)s(1− FZ(t))n−s


which means in the special case of uniform random variables

P(U(i) ≤ t) =n∑

s=i

(ns

)ts(1− t)n−s =: F(i)(t).

That implies

E[Flog(Y(tj))] = EUtj =

∫ 1

0

t dF(tj)(t)

= t F(tj)(t)∣∣10−∫ 1

0

F(tj)(t)dt = 1−∫ 1

0

F(tj)(t)dt.

By definition of the Beta function

B(x, y) =

∫ 1

0

tx−1(1− t)y−1dt

and the connection with the Gamma function

B(x, y) =Γ(x) Γ(y)

Γ(x+ y)

we have (under H0 where X and Y are identically distributed)

E[Flog(Y(tj))] = 1−n+m∑s=tj

(n+ms

)B(s+ 1, n+m− s+ 1)

= 1−n+m∑s=tj

(n+ms

)s!n+m− s!

n+ 1!, Γ(k + 1) = k!

= 1−n+m∑s=tj

1

n+m+ 1=

tjn+m+ 1

.

Thus, maximizing (1.16) is equivalent to maximizing∑

ρ=(s,t)∈C TW (ρ), wherethe test statistic

TW (ρ) :=m∑

j=1

tj

is the rank sum of the observations coming from the Y -sample in the tuplesρ in the citical region. To maximize this for the construction of a critical re-gion for a given significance level, we start with collecting those tuples with


maximal rank sum and add subsequently those of the remaining tuples withlargest rank sum.

The problem is that, since rank-sums are equal for many tuples, you mighthave to make choices leading to the fact that the critical regions are notalways uniquely defined.

The critical region for the Wilcoxon test is a collection of tuples suchthat there is no tuple outside the critical region with a larger rank sumthan any of the tuples inside.

Remark. Usually, a simplified version of the rejection region is used. Thehypothesis is rejected, if the rank sum (1.17) exceeds a given value. This valueis chosen to be the minimal rank sum of all tuples chosen for the rejectionregion following the procedure described above. Thus, for certain values ofα > 0, the rejection region will be larger than the one constructed above.

The next example is the Fisher-Yates test:

Definition 11 (Fisher - Yates test) The Fisher-Yates test is the test as-sociated to the normal distribution, i.e. to

fµ,σ(x) =1√2πσ

exp

(− 1

2σ2(x− µ)2

).

In that case, we have

− d

dxln fµ,σ(x) =

x− µ

σ2

and thus

−m∑

j=1

E[d

dxln fµ,σ(Y(sj))

]=

1

σE[Y(sj) − µ

σ

]=

1

σE[W(sj)]

where the W(sj) are order statistics of a sample of standard normal variables.The test statistics is hence

TFY (X1, ..., Ym) =m∑

j=1

EW(sj).

In order to construct the rejection region, we have to compute the expecta-tions EW(sj). These numbers can only be calculated numerically, but thereare clearly tables available for them.


The critical region for the Fisher - Yates test is chosen in a similarway as for the Wilcoxon test, where tables for the expectation valuesare used. Often you can also see the following simplified version: Thehypothesis is rejected if the test statistic exceeds a given value (Table)depending on significance level α and sample size n.

Another way to get hold of the EW(sj) is to use the following asymptoticresult which we will state without a proof.

Theorem 1 Let X1, ..., XN be independent identically distributed continuousrandom variables with cumulative distribution function F and density f . Iff has a derivative at ξ and f(ξ) > 0 then the density of

ZN :=

√N

pqf(ξ)(Y(kN ) − ξ)

converges to that of N(0, 1) as N → ∞. Here ξ = F−1(p), q = 1 − p andkN = Np or = Np+ 1.

Proof:[1], p. 191 ff.

Using this result, we can construct a simplified but asymptotically equivalentversion of the Fisher-Yates test. If we specialize to F (t) = Φ(t), N = n+mthe cumulative distribution function of standard normal variables, and top = kN/(N + 1), we obtain that

ZN :=

√N

pq

1√2πe−

Φ−1(p)2 (X(kN ) − Φ−1(p))

is approximately standard normal distributed. Hence X(kN ) is approximatelynormally distributed with expectation value

EX(kN ) = Φ−1

(kN

N + 1

)and standard deviation

σ(X(kN ))

√2πpq

Ne−

Φ−1(p)2 → 0

as N → ∞. Thus, for large values of N = n +m, we may approximate theexpectation EX(kN ) of the order statistic in the Fisher-Yates test simply withΦ−1

(kN

N+1

). That yields to the following simplified version of the Fisher-Yates

test:


Definition 12 (Van der Waerden X-test) The van der Waerden test isthe rank test with test statistic

TX(X1, ..., Ym) =m∑

j=1

Φ−1

(tj

n+m+ 1

).

The construction of the three tests imply different behavior with respect toPitman’s asymptotic efficiency:

The Fisher-Yates test was constructed using the location problem for thenormal distribution (where we also considered no assumption on the vari-ance). It is thus not surprising that it competes well with the t-test. In fact,Pitman’s asymptotic efficiency relative to the t-test is actually one. Since theFisher-Yates and the van der Waerden X-test are very close for large samplesize, their relative efficiency is one as well.

The Wilcoxon test is constructed being optimal in the location problem forthe logistic distribution. In comparison to the t-test – which always meansin the situation where the t-test is designed for – Pitman’s relative efficiencyis 0.95. That means you need about five percent more data to achieve thesame asymptotic power.

1.9 Two sample problems and linear rank tests

So far, we were considering examples of non-parametric tests and how theywere constructed. Now we aim at a more systematic point of view in orderto classify problems and the corresponding tests. First of all, we restrict our-selves to the two sample problem, having two independent random samplesX1, ..., Xn and Y1, ..., Ym. Again, we will assume if not stated otherwise, thatthe random variables under consideration are continuous and denote theirdistribution functions by FX and FY , respectively.

Definition 13 (location- and scale parameter) Let P ⊂ R and Xθ,θ∈Pbe a family of random variables with cumulative distribution functions Fθ,θ∈P .

(i) θ is called location parameter, if there is one cumulative distributionfunction F such that

Fθ(x) = F (x− θ)

1.9. TWO SAMPLE PROBLEMS AND LINEAR RANK TESTS 33

for all θ ∈ P,

(ii) θ is called scale parameter, if there is one cumulative distribution func-tion F such that

Fθ(x) = F (x/θ)

for all θ ∈ P.

Now we consider several testing problems applying the knowledge about testson domination gained so far. To be precise, we test the hypothesis H0 : FX =FY against the following alternatives:

(1) Tests on location. Under the assumption that FY (x) = FX(x− θ) is alocation family, we consider the alternatives

H+,−, 6=1,L : θ >,<, 6= 0.

(2) Tests on scale. Under the assumption that FY (x) = FX(x/θ) is a scalefamily, we consider the alternatives

H+,−, 6=1,S : θ >,<, 6= 1.

The problem is approached by reformulating these tests as tests on domina-tion. This works pretty well for tests on location but not without modifica-tions for general scale families. On the basis of our knowledge about tests ondomination, we will construct some tests which are well adapted to the corre-sponding problems. We restrict ourselves to the class of linear rank tests andconstructing such a test is equivalent to choose proper regression coefficients.Note that how to choose these coefficients properly is subject to some intu-ition that we basically get from our knowledge about the tests on domination.

Let R := R(n,m) be again the set of all possible joint ordered rank statistics(=interlacing patterns) of two samples of size n, m, respectively. In orderto define the notion of linear rank statistics below, we will first introduce analternative presentation of some ρ ∈ R.

Another representation of interlacing patterns. Instead of interlacingstructures of the type ρ = (xxyx...yx), we consider

R(X1, ..., Ym) = (R1, ..., Rn+m)


where Ri = 0 or 1 according to whether the ith component of the or-dered rank statistic comes from an observation in the X-sample or fromthe Y -sample. For example ρ = (xxyxxyy) in the old notation translates toR = (0, 0, 1, 0, 0, 1, 1) in the new one. The ith component of R will in thesequel be denoted by Ri = Ri(X1, ..., Ym).

Definition 14 (Linear rank statistic) A statistic T is called linear rankstatistic if

T (X1, ..., Ym) :=n+m∑i=1

C(T )i Ri(X1, ..., Ym). (1.18)

The real numbers C(T )i , i = 1, ..., n +m are called the regression coefficients

of T .

Different choices for the regression coefficients naturally lead to differenttests.

Lemma 8 Under H0, we have for all i, j = 1, ..., n+m

ERi = mN, Var(Ri) = mn

N2 , Cov(Ri, Rj) = − mnN2(N−1)

,

where N = n+m.

Proof: By Lemma 5, we have

P(Ri = 1) = |ρ ∈ R(n,m) : Ri = 1| × P(ρ)

= |R ∈ 0, 1n+m :n+m∑s=1

Rs = m,Ri = 1| × P(ρ)

=

(n+m− 1m− 1

)× P(ρ) =

m

m+ n.

Hence P(Ri = 0) = 1−P(Ri = 1) = n/(n+m), the variables Ri are Bernoulli-distributed with parameter p = m/(m + n). Expectation and variance arehence given by

ERi = p = mN, Var(Ri) = pq = mn

N2 .


For the joint moments we have

E[RiRj] = P(Ri = 1, Rj = 1) =

(n+m− 2m− 2

)× P(ρ) =

m(m− 1)

N(N − 1).

and thus for the covariance

Cov(Ri, Rj) = E[RiRj]− ERi, ERj =m(m− 1)

N(N − 1)− m

N

2

= − mn

N2(N − 1).

From that, we immediately conclude

Proposition 3 Let T be a linear rank statistic. Then

ET = mN

∑Nk=1C

(T )k , Var(T ) = mn

N2(N−1)

N∑N

k=1C(T ) 2k −

(∑Nk=1C

(T )k

)2

where N = n+m.

Proof: Exercise.

Proposition 4 Under H0, the distribution of T is symmetric around itsmean if there is some constant C ∈ R such that

C(T )k + C

(T )N−k+1 = C

for all k = 1, ..., N .

Proof: Let R = (R1, ..., RN) and R′ = (RN , ..., R1) in R(n,m). We considerthe map σ : R(n,m) → R(n,m) given by σ(R) = (R′). The map σ isbijective with σ2 = id. Hence

T (R) + T (σ(R)) =N∑

k=1

C(T )k (Rk +RN−k+1)

=N∑

k=1

(C(T )k + C

(T )N−k+1)Rk

= C

N∑k=1

Rk = C m.


Since σ is bijective and under H0 we have an equidistribution on the set ofall ordered rank statistics, that implies

P(T (R) = t) = P(R : T (R) = t)= P(σ(R : T (R) = t)) = P(T (R′) = Cm− t).

That implies P(T (R) = Cm/2+ s) = P(T (R′) = Cm/2− s) and since by theabove ET (R) = ET (σ(R)) = Cm/2, the distribution is symmetric aroundits mean.

Example. For the Wilcoxon statistic

TW (X1, ..., Ym) =m∑

j=1

tj,

we haveC

(W )k + C

(W )N−k+1 = k +N − k + 1 = N + 1

and hence TW is symmetrically distributed around

ETW =m

N

N∑k=1

C(W )k =

m

N

N∑k=1

k =m

2(N + 1).

The value of the Wilcoxon statistic ranges from

TW ≥m∑

k=1

k =m

2(m+ 1)

in the case where the Y -values are the smallest to

TW ≤N∑

k=N−m

k =m

2(N + n+ 1).

The symmetry of the distribution is important to construct the critical regionfor a two-sided alternative θ 6= θ0. That means: To construct a criticalregion given a significance level α > 0 we choose in the case of a one-sidedalternative everything as before and in the case of a two-sided alternative, wechoose ordered rank statistics with very small and very large value of the teststatistic, distributing them with probability α/2 on both sides. As before,this choice is in general not unique.


1.9.1 Tests on Location

The location problem can also be interpreted as a problem on dominationof random variables. θ > 0 for example implies that FX FY . All testsdiscussed in the preceding section are thus examples of linear rank tests onlocation and there is not much more to say except listing them.

Example. (i) The Wilcoxon test is the linear rank test W with regression

coefficients C(W )k = k. (ii) The Fisher-Yates test is the linear rank test with

regression coefficients C(FY )k = Eξ(k) where ξ(k) is the kth order statistic of

a normal population. The Fisher-Yates test is also frequently called Terry-Hoeffding test. (iii) The van der Waerden test is the linear rank test with

regression coefficients C(X)k = Φ−1(k/(n+m+ 1)) where Φ is the cumulative

distribution function of the normal distribution.

1.9.2 Tests on Scale

Tests on scale can not be interpreted as tests on domination without furthermodifications.

Example. (Tests on scale – the positive case) The basic paradigm fora test on scale is the test on the variance of a normal distribution. For thenonparametric setup, we consider a scale family

Fσ(x) = F (x/σ)

and test H0 : σ = σ0 against the usual alternatives. In the positive caseF (0) = 0, i.e. the underlying random variable is positive, we see that H1 :σ > σ0 implies that x/σ < x/σ0 and thus due to monotonicity of cumulativedistribution functions

F (x/σ) ≤ F (x/σ0).

Hence, Y dominates X stochastically and we can apply the same tests as forlocations again, since they were all based on domination.

If the random variables under consideration are not strictly positive, wehave to understand that differences in scale also cause differences in loca-tion. As an example, we consider the effect of scale change for an N(µ, 1)-distributed random variable X. It turns out that in this case, the scalefamily FY (x) = F (x/θ) yields random variables Y distributed according to


a N(θµ, θ)-distribution. Thus, a scale change even affects the location pa-rameter. It is therefore not obvious how to distinguish scale and locationalternatives. However, that is different if the data are normalized such thatthe location parameter is zero, or, equivalently, if we parameterize the un-derlying family of distributions in a different way. If the location parameteris zero, we observe that the change of scale results in a picture, where forσ > 1, FX(x) ≥ FY (x) for x > 0 and FX(x) < FY (x) for x < 0. For σ < 1,it is the other way round. This observation will in the sequel serve as anintuition for how to choose the regression coefficients for associated test onscale.

First we formalize the normalization approach by the notion of location/scalefamily below.

Definition 15 (location/scale family) A location/scale family is a familyXθ,σ, θ ∈ R, σ > 0 of random variables such that there is a function F : R →[0, 1] and

Fθ,σ(x) = F

(x− θ

σ

)are the cumulative distribution functions for all values of θ, σ.

In the case of a location/scale family we now have some indication how toconstruct linear rank tests for the scaling problem. In the case that thehypothesis FX = FY or equivalently σ = 1 holds, we expect that the averagerank for the observations in the X and in the Y sample is the same.

Lemma 9 Under H0, the expected average rank for the Y - and X-variablesis (N + 1)/2 where N = n+m.

Proof: Using the random variables Rk from the definition of linear teststatistics, we may write the average rank (a random variable) of the valuesof the Y -sample as

RY =1

m

N∑k=1

k Rk.

Therefore, by Lemma 8, we have

ERY =1

m

N∑k=1

k ERk =1

2(N + 1).


The same holds for the ranks of the X-samples due to the fact that in thecalculation above the sample size m cancels.

We will now construct two rank statistics based on an idea associated to theobservation made above: We believe that if σ > 1, i.e. the distribution of Yis more dispersed than the distribution of X, interlacing patterns of the typeyyyxxxxxxyyy where the observations in the Y -sample take the small andthe large values are more likely to appear than under the hypothesis.

Definition 16 (Mood test) The Mood test is a linear rank test on disper-sion with test statistic given by

M :=N∑

k=1

(k − N + 1

2

)2

Rk

By the observation above, a large value for M would lead us to the conclusionthat the Y -sample is more dispersed where small values would support theconclusion that X is more dispersed than Y . The rejection regions for theMood-test are therefore given for the different alternatives by:

H1 σ > 1 σ < 1 σ 6= 1critical region M > m+

α M < m−α M > m+

α/2, M < m−α/2

For the two-sided alternative, note that the coefficients of the Mood-testsatisfy the assumptions of Proposition 4.

Remark. A variant of the idea of the Mood-test is provided by the so calledFreund-Ansari-Bradley-David-Barton test, where the test statistic is givenby

A :=N∑

k=1

∣∣∣∣k − N + 1

2

∣∣∣∣ Rk.

Another test that is using the same basic idea is the Siegel-Tukey test. Herethe weights for the different Y -ranks are just the integers from 1, ..., N =n+m arranged in a suitable manner such as (for instance for N even)

k 1 2 ... N/2 N/2 + 1 ... N − 1 N

C(ST )k 1 4 ... N N − 1 ... 3 2


The idea is clearly the same as above, however large dispersions are nowweighted with low regression coefficients. Therefore the construction of thecritical regions is the other way round.

Definition 17 (Siegel-Tukey test) The Siegel-Tukey test is the linear teston dispersion given by the test statistic

S :=N∑

k=1

C(ST )k Rk

where

C(ST )k =

2k , k even, 1 ≤ k ≤ N/22k − 1 , k odd, 1 ≤ k ≤ N/22(N − k) + 2 , k even, N/2 < k ≤ N2(N − k) + 1 , k odd, N/2 < k ≤ N

.

S takes large values, if Y is less dispersed than X. Hence, the rejectionregions for the various alternatives are given by:

H1 σ > 1 σ < 1 σ 6= 1critical region S < s+

α S > s−α S > s+α/2, S < s−α/2

and the critical values are again tabulated. The special point with the Siegel-Tukey test is the following:

Remark. Under H0, the distribution of S is the same as for the Wilcoxonstatistic Wn,m. That was also the initial reason for this kind of reorderingthe ranks. Thus, we do not even have to calculate a new table.

1.9.3 The distribution of the Wilcoxon test statistic

We now start a little detour about the business of exact distributions. Asymp-totic distribution results and the use of tables for these probabilities are stillvery useful. However, since computers became easily accessible, it is possibleto calculate exact distributions of the test statistic for finite sample sizes.As an example for a recursion relation which can easily be implemented,we consider the exact distribution of a Wilcoxon statistic for the two sampleproblem with samples of size n, m, respectively. We denote the correspondingstatistic by Wm,n.

1.10. ASYMPTOTIC NORMALITY 41

Lemma 10 Let

Pn,m(k) = P(Wn,m = k)

Then we have the following recursive formula

(m+ n)Pn,m(k) = mPn,m−1(k −N) + nPn−1,m(k). (1.19)

Proof: Let Ln,m(k) be given by

Pn,m(k) = Ln,m(k)× P(ρ).

Then the statement is proved if we show that

Ln,m(k) = Ln,m−1(k −N) + Ln−1,m(k).

This corresponds to the decomposition of the ordered rank statistics withrank sum k into

(a) those where there is some i with r(Yi) = N (there are Ln,m−1(k − N)of them),

(b) those where there is some i with r(Xi) = N (there are Ln−1,m(k) ofthem).

The preceding lemma gives all information that is needed to write a programthat calculates all probabilities Pn,m(k) in principle.

1.10 Asymptotic Normality

For large values of N , the distributions of linear rank statistics are all ap-proximately normal distributed. Variance and expectation value of the rankstatistics were already computed in Proposition 3. Thus, we ask the questionwhether for

TN =N∑

k=1

CNk Rk


and sample sizes mN + nN = N , the distribution of

ZN :=TN − mN

N

∑Nk=1C

Nk√

mNnN

N2(N−1)

N∑N

k=1CN 2k −

(∑Nk=1C

Nk

)2 (1.20)

converges to a suitable limit as N → ∞. The problem is that the randomvariables Rk are not independent. However, they are Bernoulli variables withcovariance

Cov(Ri, Rj) = − mNnN

N2(N − 1).

Under the assumption mn/N → λ, 0 < λ < 1, we have that

Cov(Ri, Rj) ≈ −λ(1− λ)

N − 1→ 0

as N tends to infinity. That implies that ”asymptotically the variables Ri

and Rj are independent”. This is the reason why we obtain the followingresult using a central limit theorem for dependent variables.

To compare linear rank statistics for different values of N , we assume theexistence of a function

ϕ : [0, 1] → R

such that

1. ϕ is either nondecreasing on [0, 1] or nonincreasing on [0, a)] and non-decreasing on [a, 1] for some 0 < a < 1,

2. 0 <∫ 1

0(φ(t)− φ)2dt <∞, φ =

∫ 1

0φ(t)dt.

Furthermore, we assume that the regression coefficients CNk for the rank tests

for different values of N are given by

CNk = ϕ

(k

N + 1

).

Under these assumptions, we can prove the following asymptotic statement.

1.10. ASYMPTOTIC NORMALITY 43

Theorem 2 (Asymptotic normality) For every ε > 0, there exists anM = M(ε) such that for all N with

minmN , nN > M

we havesupt∈R

|P(ZN ≤ t)− Φ (t)| < ε

where Φ denotes the cumulative distribution function of the standard normaldistribution.

Proof: see [4], Theorem 4B, p. 15.

Example. To illustrate the usage of the function ϕ, we consider the Wilcoxonstatistic TW (unlike above, we suppress the N -dependence in the notation)

where the regression coefficients are given by C(W )k = k. Normalizing this to

TW =1

N + 1

N∑k=1

r Rk,

we can use the function ϕ(x) = x in the theorem above. By

E(TW ) =mN

2(N + 1)

Var(TW ) =mNnN

12(N + 1)

we obtain asymptotic normality of

ZN =TW − ETW√

VarTW

=TW − ETW

N+1√VarTW

(N+1)2

=TN − mN

2√mNnN

12(N + 1)

.


Chapter 2

Goodness of Fit

2.1 A functional limit theorem

Another application of nonparametric statistics is to test whether a givendistribution is really the underlying distribution of a given random variable.The basic fact underlying such kind of analysis is the convergence of theempirical cumulative distribution function to the true one. Let thus onceagainX1, ..., Xn be independent identically distributed random variables withcumulative distribution function F . Note that we can drop the assumptionon continuity of F in this whole section.

Definition 18 (empirical distribution function) The empirical distri-bution function of a random sample X1, ..., Xn is given by

Fn(x) :=1

n

n∑i=1

1[Xi,∞)(x).

Remark. We can also write Fn in the form

Fn(x) :=1

n|1 ≤ i ≤ n : Xi ≤ x| .

In the sequel, let X1, X2, ... be an i.i.d. sequence of random variables withcumulative distribution function F . To see that the empirical distribution

45

46 CHAPTER 2. GOODNESS OF FIT

function converges pointwise to the true cumulative distribution function Fis a consequence of the strong law of large numbers. Let t > 0 be fixed and

Yi = 1(−∞,t](Xi) =

0 Xi > t1 Xi ≤ t

.

Then the random variables Y1, Y2, ... are i.i.d. Bernoulli-distributed with

EYi = P(Xi ≤ t) = F (t).

Hence

Fn(t) =1

n|1 ≤ i ≤ n : Xi ≤ t|

=1

n

n∑i=1

1(−∞,t](Xi) =1

n

n∑i=1

Yi

and by the strong law of large numbers

P(

limn→∞

Fn(t) = F (t))

= P

(lim

n→∞

1

n

n∑i=1

Yi = EY1

)= 1.

The next statement shows that this is even true in a much stronger uniformsense.

Theorem 3 (Glivenko-Cantelli) Let Xi,i≥1 be a sequence of i.i.d. randomvariables with distribution function F . Let

dn := supx∈R

∣∣∣Fn(x)− F (x)∣∣∣ .

ThenP( lim

n→∞dn = 0) = 1.

Proof: See e.g. [3], 11.4.2, p. 314.

Remark. The metric

d(F,G) := supx∈R

|F (x)−G(x)|

on the space of distribution functions is called Kolmogorov distance.

2.1. A FUNCTIONAL LIMIT THEOREM 47

If there is a strong law, it is reasonable to expect that there is also a centrallimit theorem. Let’s again consider the situation for fixed t ∈ R: We alreadysaw that EYi = F (t). Since Yi is a Bernoulli variable, we obtain for thevariance

VarYi = F (t)(1− F (t)).

Hence, by the central limit theorem, the random variable

Zn(t) :=√nFn(t)− F (t) =

√n

1

n

n∑i=1

Yi − E

[1

n

n∑i=1

Yi

]converges in distribution to a normal variable with mean zero and varianceF (t)(1− F (t)).

Also for this statement, there is a much stronger uniform version. In or-der to make that understandable, we first calculate the covariance structureCov(Zn(t), Zn(s)) for t ≤ s.

Lemma 11 Let s ≤ t, 1 ≤ i, j ≤ n. Then denote by Yi(s), Yj(t) theBernoulli variables constructed as above for time s and t, respectively. Then

EYi(s)Yj(t) = δijF (mins, t) + (1− δij)F (s)F (t)

where the Kronecker symbol is given by

δij =

1 i = j0 i 6= j

.

Proof: We have

EYi(s)Yj(t) = E[1(−∞,s](Xi)1(−∞,t](Xj)]

= P(Xi ≤ s,Xj ≤ t)

=

P(Xi ≤ s) i = jP(Xi ≤ s) P(Xj ≤ t) i 6= j

=

F (mins, t) i = jF (s)F (t) i 6= j

.

That implies for the covariances

Cov(Zn(t), Zn(s)) = E(Zn(t)− EZn(t))(Zn(s)− EZn(s))

the following statement:


Lemma 12 The covariance structure is given by

Cov(Zn(t), Zn(s)) = F (mins, t)− F (s)F (t).

Proof: By Lemma 11, we have

nE[(Fn(t)− F (t))(Fn(s)− F (s))]

= nE[1

n2

n∑i,j=1

Yi(t)Yj(s)−1

n

n∑i=1

Yi(t)F (s)− 1

n

n∑j=1

Yj(s)F (t) + F (s)F (t)]

=1

n

n∑i,j=1

δijF (mins, t)− (1− δij)F (s)F (t) − nF (s)F (t)

= F (mins, t)− F (s)F (t).

For every value of t ∈ R we can thus determine the limit random variable Xt

by the central limit theorem. But as in the case of the Glivenko-Cantelli theo-rem, there is a corresponding uniform statement about the weak convergenceof the difference between empirical and cumulative distribution function to aGaussian stochastic process. This is an instance of a functional limit theorem.

Definition 19 (Gaussian process) A stochastic process (Xt)t∈R is calledGaussian if for every finite index set t = (t1, ..., tn) the vector valued randomvariable

Xt = (Xt1 , ..., Xtn)

is Gaussian. If EXt = 0 for all t ∈ R, the Gaussian process is called centered.

Remark. A centered Gaussian process is uniquely determined by its covari-ance structure

C(s, t) = Cov(Xs, Xt) = EXsXt.

Theorem 4 (Donsker) As n→∞, the empirical process

X(n)t :=

√nFn(t)− F (t)

converges in distribution to a centered Gaussian process Xt with covariancestructure

Cov(Xs, Xt) = F (mins, t)− F (s)F (t).

2.2. THE KOLMOGOROV SMIRNOV TEST 49

Remark. By monotonicity of cumulative distribution functions, we have

F (mins, t) = minF (s), F (t).

By that, we can actually identify the limiting process. The centered Gaussianprocess (bt)t∈[0,1] with covariance structure

C(s, t) = mins, t − st

is called the standard Brownian bridge. Thus, we can writeXt more explicitlyas

Xt = bF (t).

Some remarks on the proof of Donsker’s theorem. Convergenceof X

(n)t to Xt in finite dimensional distributions follows from the mul-

tidimensional central limit theorem for the i.i.d. random vectors

Y i := (Yi(s1), ..., Yi(sk))

for every finite s1 < ... < sk with

Y i ∈ (0, ..., 0), (0, ..., 0, 1), (0, ..., 0, 1, 1), ..., (1, ..., 1)

andP(Y r

i = 1∀r≥l) = P(Xi ∈ (sl−1, sl]) = F (sl)− F (sl−1)

where we use the convention s0 = −∞ and sk+1 = ∞. Note that thisdoes not provide a full proof of weak convergence of the processes. Forthat, we would have to discuss tightness of the approximating sequenceX

(n)t . For a full proof, see the original article [7].

2.2 The Kolmogorov Smirnov test

The significance of Theorem 4 is more important for us than its proof. Itmeans that as n tends to infinity, or, approximately for large values of n, wehave

supt∈R

∣∣∣X(n)t

∣∣∣ := supt∈R

√n|Fn(t)− F (t)| ≈ sup

t∈R|bF (t)| = sup

s∈[0,1]

|bs|. (2.1)


And this statement is independent of F . Thus, this result fits into our generalstrategy in nonparametric statistics. Even the distribution of the supremumof a Brownian bridge can be computed more or less explicitly.

Theorem 5 (Supremum of modulus of Brownian bridge) We have

P( sups∈[0,1]

|bs| ≥ a) = 2∑n≥1

(−1)n+1 exp(−2n2a2

). (2.2)

Proof: See e.g. [3], 12.3.4, p. 364.

That implies the basic idea behind the so called Kolmogorov - Smirnov good-ness of fit test.

Corollary 1 (Kolmogorov - Smirnov) For every z ≥ 0

limn→∞

P(supt∈R

|Fn(t)− F (t)| ≤ z/√n) = L(z)

whereL(z) := 1− 2

∑k≥1

(−1)k+1 exp(−2k2z2

).

Another consequence of Theorem 4 is that we can also calculate the valuesof other related statistics. For instance, let

D+n := sup

t∈RFn(t)− F (t) (2.3)

without the absolute value. By Theorem 4 we have as in (2.1) for large valuesof n √

nD+n ≈ sup

t∈RbF (t) = sup

s∈[0,1]

bs.

But also the distribution of the supremum of the Brownian bridge is known:

Theorem 6 (Suprema of Brownian bridges) We have

P( sups∈[0,1]

bs ≥ a) = e−2a2

.

Proof: See e.g. [3], 12.3.5, p. 365.

That implies


Corollary 2 For every z ≥ 0 we have

limn→∞

P(4nD+2n ≤ z) = χ2

2(z)

where χ22 denotes the cumulative distribution function of a χ2-distribution

with two degrees of freedom.

Proof: Note first that b0 = b1 = 0 implies that

sups∈[0,1]

bs ≥ 0,

which you can (almost surely) also conclude from Theorem 6 with a = 0.Hence, we also have for z ≥ 0 that

4nD+2n ≤ z ⇔ 2

√nD+

n ≤√z,

since D+n ≥ 0 for the same reason. That implies again by Theorem 6

limn→∞

P(4nD+ 2n ≤ z) = lim

n→∞P(√nD+

n ≤√z/2) = 1− e

−2√

z2

2

= 1− e−z/2 = χ22(z).

From these considerations, we may now derive goodness of fit tests for all therelevant alternatives: Consider H0 : F = F0 that the random variables in thesample are distributed with cumulative distribution F0. To test it againstone of the alternatives

H1 : F ,≺, 6= F0,

we use the test statistics

Dn := supt∈R

|Fn(t)− F (t)|,

D+/−n := sup

t∈R±Fn(t)− F (t)

.

Under H0, the asymptotic distributions of these test statistics are given byCorollary 1 and 2. The asymptotic distributions of D+

n and D−n coincide,

since they are given by the suprema of the Brownian bridge bs and the pro-cess −bs, respectively, and it is easy to see that these two processes arecentered gaussian with the same covariance structure, so their distributionsare identical.


Definition 20 (Kolmogorov - Smirnov goodness of fit test) The Kol-mogorov - Smirnov test to a significance level α > 0 is given by

H1 F 6= F0 F F0 F ≺ F0

test statistic Dn D+n D−

n

critical region Dn > Dn,α D+n > D+

n,α D−n > D+

n,α

where the critical values are given by Dn,α := minz ≥ 0 : L(z) ≤ 1−α/√n

and D+n,α := minz ≥ 0 : e−2z2/n ≤ α respectively.

Remark. The Kolmogorov-Smirnov statistic can not be written as a linearrank statistic but as the maximum of a finite number of linear rank statistics(cf. [4], p. 62).

An analogous test can be performed in the two sample case. Let X1, ..., Xn,Y1, ..., Ym be independent random samples and we want to test the hypothesisH0 : X = Y against the three relevant alternatives. Under H0, we have bythe preceding considerations (FX = FY = F )

√mFm(t)− F (t) D→ bF (t)

√nFn(t)− F (t) D→ bF (t)

in distribution as m and n tend to infinity by Theorem 4. Consider now thestatistic

Dn,m := supt∈R

|Fn(t)− Fm(t)|.

Due to √nm

n+m(Fn(t)− Fm(t))

=

√nm

n+m(Fn(t)− F (t) + F (t)− Fm(t))

=

√m

n+m

√n(Fn(t)− F (t))−

√n

n+m

√m(Fm(t)− F (t))

which converges for m,n→∞ and m/n→ c > 0 to

βF (t) =b(1)F (t)√

1 + 1/c−

b(2)F (t)√1 + c


where b(1) and b(2) are independent Brownian bridges. Since this is a sumof independent centered Gaussian processes, β is also a centered Gaussianprocess with covariance structure

E(βsβt) =c

1 + cE(b(1)

s b(1)t ) +

1

1 + cE(b(2)s b

(2)t ) = mins, t − st

and hence β is a standard Brownian bridge, too. That implies

Lemma 13 We have

(i) limm,n→∞,m/n→c>0 P(√

nmn+m

Dn,m ≤ z)

= L(z),

(ii) limm,n→∞,m/n→c>0 P(√

nmn+m

D+n,m ≤ z

)= 1− e−2z2

,

where D+n,m := supt∈R Fn(t)− Fm(t).

Proof: Exercise.

Definition 21 (Kolmogorov - Smirnov two sample test) The Kol-mogorov - Smirnov two sample test to a significance level α > 0 is givenby

H1 FX 6= FY FY FX

test statistic√

nmn+m

Dn,m

√nm

n+mD+

n,m

critical region√

nmn+m

Dn,m > Dn+m,α

√nm

n+mD+

n,m > D+n+m,α

where the critical values are the same as in Definition 20. There is also theanalogous statement for the test statistic D−

n,m := supt∈R Fm(t)− Fn(t).

Remark. Due to the invariance principle Theorem 4, we can base a goodnessof fit test as well on other functions of the distance to the empirical process.For instance the Kuiper test considers the functional

DKn := D+

n −D−n .

By Theorem 4, DKn is asymptotically distributed as

DK∞ = sup

s∈[0,1]

bs − infu∈[0,1]

bu,

where the supremum and the infimum is taken from the same path of theBrownian bridge. The joint distribution (sup bs, inf bu) can also be calculatedexplicitly and the statement about the distribution of DK

∞ follows then fromthe continuous mapping principle which we will use more explicitly in thenext section.


2.3 The Chi-square idea

Another idea to construct a test on goodness of fit is the following: We havea sample x = (x1, ..., xn) of observations on the real line, but we decomposethe real line into bins Ik := [tk, tk+1), k = 0, ..., K where t0 = −∞ < t1 <... < tK < tK+1 = ∞. Then the number of observations nk in Ik is given by

nk = n(Fn(tk+1)− Fn(tk))

and the expected number of observations in Ik is given by

Nk = nP(X ∈ Ik) = n(F (tk+1)− F (tk)) = n pk

where pk = P(X ∈ Ik) is the probability that X falls into the kth bin.Consider now assuming that the tk are chosen in a way such that Nk > 0 forall k:

(nk −Nk)2

Nk

=

(n(Fn(tk+1)− F (tk+1)− (Fn(tk+1)− F (tk)))

)2

n(F (tk+1)− F (tk))

=

(√n(Fn(tk+1)− F (tk+1))−

√n(Fn(tk+1)− F (tk)))

)2

F (tk+1)− F (tk).

In the sequel, the main idea to measure the deviation from the hy-pothesis will be to consider test statistics consisting of terms of thetype

(observed− expected)2

expected.

By Theorem 4, we have that√n(Fn(tk+1)− F (tk+1))−

√n(Fn(tk+1)− F (tk))) → bF (tk+1) − bF (tk)

in distribution. Now by the

Continuous mapping principle. If a sequence Xn of random vari-ables converges in law to another random variable X and if φ is acontinuous map, then also

φ(Xn) → φ(X)

in distribution.

2.3. THE CHI-SQUARE IDEA 55

(for an exact statement and a proof see for instance [3], 9.3.7, p. 232), weobtain that

(nk −Nk)2

Nk

→(bF (tk+1) − bF (tk)

)2F (tk+1)− F (tk)

as n tends to infinity since

φ(x) =x2

F (tk+1)− F (tk)

is continuous. Let now Tk := F (tk). Then by monotonicity of the cumulativedistribution function

0 = T0 ≤ T1 ≤ ... ≤ TK+1 = 1

and again applying the continuous mapping principle, this time to the mapψ(x1, ..., xn) =

∑ni=1 xi, we obtain

Corollary 3 As n→∞, the random variable

S2n :=

K∑k=0

(nk −Nk)2

Nk

converges to

S2 :=K∑

k=0

(bTk+1

− bTk

)2Tk+1 − Tk

in distribution. Here Nk = n pk.

In the sequel, we will use the following facts about χ2-distributions:

Two facts about the χ2-distribution for integer values n,m ∈ N for thenumber of degrees of freedom:

(i) Let X1, ..., Xn ∼ N(0, 1) independent standard normal variables.Then the sum

X := X21 + ...+X2

n

is χ2n-distributed.

(ii) Let X ∼ χ2n, Y ∼ χ2

m be independent. Then X + Y ∼ χ2n+m.


The distribution of S2 is given by (note that there are K + 1 bins)

Lemma 14 The distribution of S2 is a χ2-distribution with K degrees offreedom.

Proof: First, we calculate the covariance structure of the centered Gaussianvariables

Xk :=bTk+1

− bTk√Tk+1 − Tk

,

k = 0, ..., n. That yields

Ckl = EXkXl = δkl −√

(Tk+1 − Tk)(Tl+1 − Tl). (2.4)

In particular, the increments of a Brownian bridge are far from being in-dependent. It seems thus that we cannot use the characterization (i) of aχ2-distribution above. But in fact we can and what follows now is a veryuseful idea in dealing with these distributions.

First of all, the covariance matric Ckl is symmetric. Hence there is a or-thogonal matrix U such that U+CU = D = diag(λ0, ..., λn) is diagonal.That implies that the variable (Y0, ..., Yn) = (X0, ..., Xn)U are independentcentered normal variables with variance

VarYi = Var∑

j

XjUji =∑j,s

EXjUjiXsUsj

=∑j,s

UjiUsjEXjXs =∑j,s

UjiUsjCjs = λi.

In particular, if λi = 0 then Yi = 0 almost surely.

The crucial fact is now that the covariance matrix in our case is idempotent,i.e. C2 = C. We see this by (2.4) and the explicit calculation

C2ij =

∑s

CisCsj

=∑

s

(δis −√

(Ti+1 − Ti)(Ts+1 − Ts))(δsj −√

(Ts+1 − Ts)(Tj+1 − Tj))

= δij −√

(Ti+1 − Ti)(Tj+1 − Tj) = Cij.

2.3. THE CHI-SQUARE IDEA 57

Hence, C is symmetric and idempotent, that means it it a projection andprojections only have eigenvalues zero or one. The diagonal matrix D hastherefore the form

D = diag(1, ..., 1, 0, ..., 0)

where the number of ones is equal to the rank rkD = rkC of the covariancematrix. Thus, the independent normal variables Yi are either almost surelyzero or standard normal and since U is an orthogonal matrix, we have almostsurely

Y 20 + ...+ Y 2

rk C−1 = Y 20 + ...+ Y 2

n = ‖(Y0, ..., Yn)‖2

= ‖(X0, ..., Xn)U‖2 = X20 + ...+X2

n

and the Yk, k = 0, ..., rkC − 1 are independent standard normal variables.Hence X2

0 + ...+X2n is χ2-distributed with rkC degrees of freedom.

It remains to show that for the covariance matrix above, rkC = K. Here weuse again that the eigenvalues are only zero or one. That implies that therank of D equals the number of eigenvalues equal to one and this is the traceof D. Thus

rkD = trD = trU+C U = trC =∑

s

Css =∑

s

(1− (Ts+1 − Ts)) = K.

From that, we can derive the second test on goodness of fit.

Definition 22 (χ2-goodness of fit test) The χ2-goodness of fit test of thehypothesis F = F0 against H1 : F 6= F0 is given by the test statistic

X2 :=K∑

k=0

(nk −Nk)2

Nk

which, under H0, is asymptotically χ2K-distributed. The critical region for

significance level α > 0 is therefore given by

C = X2 > χ21−α,K

where χ21−α,K denotes the corresponding quantile.


Remark. Even though the test was constructed for samples of real randomvariables, we can apply it even to categorial data with just a discrete proba-bility distribution determining which observation falls into which bin. Eventhe proof does not change, if you just construct an artificial random variableX and choose the bins in a way that the probability P(X ∈ Ik) equals theprobability that the initial (categorial) data falls within bin number k.

2.4 A Chi square test on independence

Now we will present another way how to use the basic chi square idea, namelyto construct a test on independence of two random variables.

Let X and Y be random variables with cumulative distribution functionsFX , FY , respectively. We divide the range of X into K + 1 subintervalsI0, I1, ..., IL and the range of Y into L + 1 subintervals J0, J1, ..., JL. Thenwe perform an experiment with N paired observations (Xi, Yi), i = 1, ..., N .We keep track about how many of the data pairs fall into the different binsaccording to the following scheme:

I0 I1 I2 ... IK−1 IKJ0 n00 n01 n02 ... n0K−1 n0K m0

......

......

......

...JL−1 nL−10 nL−11 nL−12 ... nL−1K−1 nL−1K mL−1

JL nL0 nL1 nL2 ... nLK−1 nLK mL

n0 n1 n2 ... n0K−1 nK

where

nk =∑L

l=0 nlk, ml =∑K

k=0 nlk, N =∑L

l=0

∑Kk=0 nlk. (2.5)

Our aim is now to prove that the statistic

S2N =

L∑l=0

K∑k=0

(nlk − mlnk

N

)2mlnk

N

is χ2-distributed with K × L degrees of freedom.

2.4. A CHI SQUARE TEST ON INDEPENDENCE 59

Assume first that the marginal distributions are given by P(X ∈ Ik) = pk,P(Y ∈ Jl) = ql. We consider the statistic

X2N :=

L∑l=0

K∑k=0

(nlk −Npkql)2

Npkql

and our aim is to test the hypothesis H0: X and Y are independent againstthe alternative H1: X and Y are not independent. By Lemma 14, X2

N isasymptotically χ2-distributed with (K + 1)× (L+ 1)− 1 degrees of freedom.

The difference between X2N and S2

N is that in the case that pk and ql are notknown, we have to estimate the marginals from the sample. Thus, in S2

N ,the true marginals are substituted by the Maximum-Likelihood estimators

pk = nk

N, ql = ml

N.

It is a not at all obvious fact that this procedure reduces the number ofdegrees of freedom by the number of parameters which have to be estimated.In this case, we have to estimate the marginal probabilities p0, ..., pK andq0, ..., qL. These are K + L + 2 probabilities but since both, the pi’s andthe qi’s sum up to one, we have to effectively estimate only K + L numbers.That yields (K + 1)(L + 1) − 1 − (K + L) = KL degrees of freedom forthe resulting χ2-distribution. Please note again that this is far from being aproof, for which we refer to the reference below.

Theorem 7 Under H0, the statistic S2N is asymptotically χ2-distributed with

K × L degrees of freedom.

Proof: See [2], Sec. 30.3, p. 426 ff.

Using this, we can finally construct an asymptotic test on independence.

Definition 23 (χ2-independence test) The χ2-test on independence fora paired sample (Xi, Yi)i=1,...,N is given by the test statistic

S2N =

L∑l=0

K∑k=0

(nlk − mlnk

N

)2mlnk

N

where nk =∑L

l=0 nlk, ml =∑K

k=0 nlk. The hypothesis that X and Y areindependent is rejected to a significance level α > 0 if S2

N > χ21−α,KL.


Remark. For the same reasons as in the preceding paragraph, we can applythis test to categorial data as well.

Degrees of freedom. Note that the number of degrees of freedom canalso be calculated from (2.5). According to I. J. Good [8], the number ofdegrees of freedom of a statistical problem is – independently from theoccurence of F - or χ2-distributions or quadratic forms – defined as thecodimension of the hypothesis in a ”larger hypothesis” meaning the fullspace of all distributions under consideration. In our case, an arbitrarydistribution on the (K+1)× (L+1) bins is characterized by the samenumber of non-negative numbers with sum 1 meaning that we considerin total a (K+1)× (L+1)− 1 = KL+K+L dimensional simplex ofall possible distributions in the larger hypothesis. Product distributionsare characterized completely by the marginals which are given by a K-and a L-dimensional simplex, respectively. The hypothesis that thetwo variables are independent thus consists of a space of distributionsof dimension K+L. By the definition above, the number of degrees offreedom is thus the codimension KL +K + L − (K + L) = KL. Forfurther examples for this interpretation see [8].

Appendix A

The functional delta method

The purpose of this appendix is an informal discussion of the so called func-tional delta method. It extends considerably the technique to formulate andsolve asymptotic problems in terms of the empirical distribution function.You can also consider the final example as a comment on the asymptoticnormality result Theorem 2, p. 41. For a concise outline of the generalmethod and the technical details involved, see the encyclopedia article [9].

A.1 The Mann-Whitney statistic

Recall that the continuous mapping principle enabled us to conclude fromDonsker’s result that

√nFn(x)− F (x) → bF (x)

implies

Φ(√nFn(x)− F (x)) → Φ(bF (x))

for continuous Φ.

Consider now the Wilcoxon statistic in the two sample case. Recall that theunderlying random variables were supposed to be continuous. We have

W (x1, ..., ym) =n+m∑k=1

kRk(x1, ..., ym) =m∑

j=1

(nFn(yj) +mGn(yj)

)61

62 APPENDIX A. THE FUNCTIONAL DELTA METHOD

where the two samples are given by (x1, ..., xn), (y1, ..., ym) and Fn, Gm arethe empirical distribution functions of X, Y , respectively. Clearly,

m∑j=1

mGm(yj) =m∑

j=1

j =m

2(m+ 1)

independent of the sample. Thus we may reduce the statistic to

W = nm∑

j=1

Fn(yj) = nm

∫RFn dGm.

Thus, the Wilcoxon statistic in terms of the empirical distribution functionsis basically given by

Ω(Fn, Gm) :=

∫RFn dGm (A.1)

and this is called the Mann-Whitney form of the Wilcoxon statistic. It can beshown that Ω is continuous in both arguments with respect to convergence inprobability of the random variables associated to the distribution functions.Convergence in Kolmogorov distance of the cumulative distribution functionsimplies convergence in probability of the associated variables and thereforewe have

Ω(Fn, Gm) → Ω(F,G)

as m,n→∞ by the Glivenko - Cantelli theorem.

Remark. We can even compute the limit (assuming for simplicity that Ghas a density g) to

Ω(F,G) =

∫F dG =

∫Rdyg(y)P(X ≤ y) = P (X ≤ Y ). (A.2)

A.2 Hadamard differentiability and asymp-

totic normality

The difference is now that we have to compute an asymptotic of the formN →∞,m/n→ λ/(1− λ) for√

nm

N

Ω(Fn, Gm)− Ω(F,G)

A.2. DIFFERENTIABILITY AND ASYMPTOTIC NORMALITY 63

to prove Theorem 2 for the special case of the Wilcoxon statistic.

The idea is to use Taylor expansion: If (un, vn) → (u, v) and f is sufficientlydifferentiable at (u, v), then

f(un, vn) = f(u, v) +∂f

∂u(u, v)(un − u) +

∂f

∂v(u, v)(vn − v) +R(un, vn)

where the remainder R is asymptotically negligible. The problem is clearlythat the space of cumulative distribution functions is infinite dimensionaland that it is very hard to find suitable notions of differentiability. In fact,there are several ways to do that, depending on the given problem.

The space of distribution functions is not a vector space because the sum oftwo distribution functions is not a distribution function any more. But forµ, ν ≥ 0, µ + ν = 1, the convex combination µF + νG of two distributionfunctions F and G is again a distribution function. Thus, the set of distri-bution functions D forms a convex subset of the vector space of distributionfunctions of signed and finite measures

Sfin = aF − bG : a, b ∈ R, F,G ∈ D.

In the definition below, LF is thus meant to be linear, if it is a linear mapon S. In that picture, we may think of the set

TFD := U ∈ Sfin : U = limt→0

|t|−1 (Ft − F ) : Ft, F ∈ D, Ft → F ⊂ TFSfin

as the tangent cone along the submanifold D. Here, the limit is again under-stood in the sense of convergence in probability.

Definition 24 A functional Ψ on the space of distribution functions is calledHadamard differentiable at F , if there is some linear functional LF with

limt→0

1

|t|Ψ(Ft)−Ψ(F ) = LF (U) (A.3)

for all sequences Ft for which the limit

|t|−1 (Ft − F ) → U ∈ TFD

exists.


Remark. A proper description of the limit in (A.3) requires the fixationof a suitable metric on the space of these functionals. Choosing this metricappropriately for a given problem is an important technical point.

Without a proof, we will use that Ω is Hadamard-differentiable with respectto both arguments G and F . That implies

Ω(Fn, Gm)− Ω(F,G) =

∫Fn dGm −

∫F dG

=

∫(Fn − F ) dG+

∫F d(Gm −G) +

∫(Fn − F ) d(Gm −G)

=

∫(Fn − F ) dG−

∫(Gm −G) dF +

∫(Fn − F ) d(Gm −G)

where we used partial integration∫RF d(Gm −G) + (Gm −G) dF = (Gm −G)F |∞−∞ = 0

for the last step. That implies with t =√m−1

, s =√n−1

that√nm

N


=

√m

N

∫ √n(Fn − F ) dG−

√n

N

∫ √m(Gm −G) dF

+

√m

N

∫ √n(Fn − F ) d(Gm −G)

and thus by the continuous mapping principle, as N → ∞ and n/m →λ/1− λ, this converges to

√λ

∫b(1)F (x)dG(x)−

√1− λ

∫b(2)G(x)dF (x) (A.4)

where b(1) and b(2) are two independent Brownian bridges. Hence (A.4) isthe difference of two independent normal variables which implies that theWilcoxon statistic is asymptotically normal.

A.2. DIFFERENTIABILITY AND ASYMPTOTIC NORMALITY 65

Furthermore, under the hypothesis F = G, we have for the limit variable

Z =

∫R(√λb

(1)F (x) −

√1− λb

(2)F (x))dF (x) =

∫ 1

0

(√λb(1)s −

√1− λb(2)s )ds

that it is centered Gaussian with variance

VarZ = EZ2

= E∫ 1

0

ds

∫ 1

0

dt(√λb(1)

s −√

1− λb(2)s )(√λb

(1)t −

√1− λb

(2)t )

= E∫ 1

0

ds

∫ 1

0

dt(λb(1)s b

(1)t − 2

√1− λ

√λb

(1)t b(2)s + (1− λ)b(2)s b

(2)t )

=

∫ 1

0

ds

∫ 1

0

dt(min(s, t)− st)

= 2

∫ 1

0

ds s

∫ 1

s

dt+

∫ 1

0

ds s

∫ 1

0

dt t

=1

3− 1

4=

1

12.

Altogether, we showed that for N →∞, m/N → λ > 0 that√nm

N


=

1√Nnm

W − EW

converges to a normal variable Z with mean zero and variance 1/12.

This provides one example for how the general idea of the functional deltamethod works: Hadamard differentiability of a functional Ψ together withDonsker’s Theorem automatically implies an asymptotic result by lettingFt = Fn, t = 1/

√n and

√n(Ψ(Fn)−Ψ(F )) =

Ψ(Ft)−Ψ(F )

t→ LF Ψ(lim

n

√n(Fn − F )) = LF Ψ(bF )

provided the differential LF Ψ exists.


Appendix B

Some Exercises

Exercise 1. Prove that Θ is a location parameter if and only if the distri-bution of XΘ −Θ does not depend on Θ ∈ U . What is the analogous resultfor scale parameters ?

Exercise 2. (Bain/Engelhardt, p. 495) The following 20 observations areobtained from a random number generator:

0.48, 0.10, 0.29, 0.31, 0.86, 0.91, 0.81, 0.92, 0.27, 0.21,

0.31, 0.39, 0.39, 0.47, 0.84, 0.81, 0.97, 0.51, 0.59, 0.70

1. Test H0 : med = 0.5 against H1 : med > 0.5 at level α = 0.1.

2. Test H0 : med = 0.25 against H1 : med > 0.25 at level α = 0.1.

For the first test, is it necessary to actually compute the rejection region ?

Exercise 3. (i) Can you draw the cumulative distribution of a random vari-able X with P (X = med(X)) > 0. (ii) Can you modify the sign test toinclude also such distributions ?

Exercise 4. Prove that for symmetric distributions, the mean coincides withthe median.

Exercise 5. What happens to Pitman’s asymptotic efficiency if in the para-metric location problem for the normal distribution N(µ, σ2) the variance σ2

is unknown ?

67

68 APPENDIX B. SOME EXERCISES

Exercise 6. Prove Lemma 4.1 on the significance of order statistics.

Exercise 7. Let X be a random variable with cumulative distribution func-tion F and (X1, ..., Xn) be an associated random sample. Ordering thisrandom sample by magnitude X(1) ≤ ... ≤ X(k) ≤ ...X(n) yields the or-der statistics X(k), k = 1, ..., n. In particular, X(1) = min X1, ..., Xn andX(n) = max X1, ..., Xn.

1. Compute the distribution function of X(k).

2. Compute the expectation value of X(k), if X ∼ Unif(0, 1) is uniformlydistributed on the interval [0, 1].

Exercise 8. (i) Let G = R operate on Rn by u(x1, ..., xn) := (x1 +u, ..., xn +u). Construct a maximal invariant map for this operation. (ii) Let G =R+ := u ∈ R : u > 0 operate on Rn by u(x1, ..., xn) := (u x1, ..., u xn).Construct a maximal invariant map for this operation.

Exercise 9. A composition of an integer n ≥ 1 is one way to write n asa sum of positive integers where the order of the summation is taken intoaccount. The sixteen compositions of 5 are for instance given by 5, 4+1, 1+4, 2 + 3, 3 + 2, 1 + 1 + 3, 1 + 3 + 1, 3 + 1 + 1, 2 + 2 + 1, 2 + 1 + 2, 1 + 2 + 2, 1 +1 + 1 + 2, 1 + 1 + 2 + 1, 1 + 2 + 1 + 1, 2 + 1 + 1 + 1, 1 + 1 + 1 + 1 + 1.

1. Prove that there are actually 2n−1 compositions of n ≥ 1.

2. Prove that the number of compositions of n into k parts is given by(n− 1k − 1

).

Exercise 10. Prove that the set

M := f : R → R : f strictly monotone, continuous and onto

forms a group if we take as group multiplication the composition of mapsf g(x) := f(g(x)).

Exercise 11. Prove Lemma 5 on the distribution of ordered rank statistics.Hint: Use the representation [xyxx...xyy] of the information that is containedin the ordered rank statistic mentioned in the lecture.

69

Exercise 12. Let X be a two-dimensional random vector distributed ac-cording to one of the distributions with density

fσ(x, y) =1

2πσ2exp

(− 1

2σ2(x2 + y2)

)and σ > 0. Given a sample X = (X1, ..., Xn), prove that

S(X) := (‖X1‖, ..., ‖Xn‖)

is a sufficient statistic (‖ − ‖ denotes the euclidean norm).

Exercise 13. Consider paired observations (Xi, Yi)i=1,...,n where the Xi areobservations of the continuous random variable X and the Yi are observationsof the continuous random variable Y .

1. If you compare this situation to the two-sample situation in the pa-tient example, in which kind of situations would you assume that theobservations are paired ?

2. (adapted from Engelhard p. 469) Twelve pairs of twin male lambs wereselected; diet plan I was given to one twin and diet plan II was givento the other twin in each case. The weights at eight months are givenby the table below. Use a sign test for the differences Xi − Yi to testthe conjecture that diet I is more effective than diet II to a significancelevel of α = 0.05.

3. Which piece of (probably useful) information about the sample do younot take into account when you use a sign test ?

I (Xi) 111 102 90 110 108 125 99 121 133 115 90 101II (Yi) 97 90 96 95 110 107 85 104 119 98 97 104

Exercise 14. Let U1, ..., Un be independent U(0, 1)-distributed. Denote byU(k) the value of the kth order statistic. Let kn be a sequence of numberssuch that kn/n→ x ∈ [0, 1] as n tends to infinity. Prove that

limn→∞

Ef(U(kn)) = f(x)


for all bounded continuous functions f .

Exercise 15. Consider a one-sample location problem where we assumethat the underlying distribution is continuous and symmetric. Based on arandom sample X1, ..., Xn we want to test the hypothesis H0 : med(X) = µ0

versus the alternative H1 : med(X) > µ0. We decide for the Wilcoxon signedrank test, i.e the test statistic is constructed as follows: We consider the rankstatistic of (Z1 := |X1 − µ0|, ..., Zn := |Xn − µ|) and sum up the ranks ofthose numbers |Xk − µ0|, where X1 − µ0 > 0.

1. Do you see intuitively why we have to assume that the distributionsare symmetric ?

2. Is it reasonable to expect that this assumption is fulfilled for the datain exercise 13 ?

3. Denote the test statistic by R+. Show that you can write

R+(X) =n∑

i=1

Vi(X1, ..., Xn)ri(Z)

where X = (X1, ..., Xn) denotes the random sample, Vi : Rn → R is asuitably chosen function and ri(Z) denotes the rank of Zi in Z. (Hint:It just looks complicated.)

4. Is the signed rank test invariant under monotone transformations ?

Exercise 16. Perform the Wilcoxon signed rank test for the data from Exer-cise 13. In the case of paired samples (X, Y ), you use Zi := |Xi−Yi| insteadof the Zi for the one-sample case defined above (why ?). The hypothesis isrejected if R+ exceeds a given (tabulated) critical value. You can find an-other description of the signed rank test in the english Wikipedia at

(http://en.wikipedia.org/wiki/Wilcoxon signed-rank test).

A table of the critical values is provided by the first of the external linksat the bottom of the page. What does the assumption of symmetry of thedistribution (cf. Exercise 15) means for paired observations ? Do you thinkthis assumption is justified ? Compare the result with those obtained by

71

using a sign-test and a t-test for the same problem.

Exercise 17. Use the recursion relation (1.18) from Lemma 10, p. 41 towrite a program (in R, C, whatever) with which you can generate a list ofthe tail probabilities P (Tn,m ≥ k) for given n,m.

Exercise 18. Use Proposition 9.1 to calculate expectation and variance ofthe Mood test. Is the distribution of the test statistic symmetric around itsmean ?

Exercise 19. Prove that the distribution of the Wilcoxon statistic WN

equals the distribution of the Siegel-Tukey statistic SN .

Exercise 20. Prove that two identically distributed Bernoulli variables areindependent if and only if their covariance is zero.

Exercise 21. LetX1, ..., Xn be independent, identically distributed standardnormal variables with mean

X =1

n

n∑i=1

Xi.

Prove that (n− 1)s2n =

∑ni=1(Xi −X)2 is χ2-distributed with n− 1 degrees

of freedom (at least for n = 3).

Exercise 22. Prove that the law of a centered Gaussian process is uniquelydetermined by its covariance structure.

Exercise 23. Give an alternative proof that the statistics Dn, D+n and D−

n

from the Kolmogorov-Smirnov one-sample test are distribution free. Assumethat the cumulative distribution F is continuous and strictly monotonous.Proceed as follows:

1. Rewrite for instance D+n by using the order statistic (X(1), ..., X(n)) of

the sample to

D+n = max

max1≤i≤n

[i

n− F (X(i))

], 0

. (∗)


2. Give an argument why the distribution of D+n does not depend on F

any more.

3. Give a short argument how this implies the same for D−n and Dn.

Exercise 24. Show that the alternative representation (*) of the Kolmogorov-Smirnov statistics in 23 is a representation as maximum of finitely many rankstatistics.

Exercise 25. A Brownian bridge is a process Xt,t∈[0,1] such that for 0 < t1 <t2 < ... < tn < 1 and Borel subsets A1, ..., An ⊂ R we have

P (Xt1 ∈ A1, ..., Xtn ∈ An)

=√

2π

∫A1

dx1

∫A2

dx2...

∫An

dxn

n+1∏k=1

1√2π(tk − tk−1)

exp

(−(xk − xk+1)

2

2(tk − tk−1)

)where t0 = 0, x0 = 0, tn+1 = 1, xn+1 = 0. Show that

Sn :=n+1∑k=1

(Xtk −Xtk−1)2

tk − tk−1

is χ2n-distributed, at least for n = 2, 3.

Exercise 26. Let X and Y be independent continuous random variableswith cumulative distribution functions F and G, respectively. Let Fn andGm be the empirical distribution functions given two independent samples(x1, ..., xn), (y1, ..., ym) of size n,m, respectively. Show that

rank(yk) = Fn(yk) + Gm(yk).

where rank(yk) denotes the rank of yk in the joint sample.

Exercise 27. Let P be a symmetric and idempotent n × n-matrix, i.e.P+ = P and P 2 = P .

1. Prove that P can have no eigenvalues other than 0 or 1.

2. Show that the rank of P (the number of linearly independent columns)is equal to the trace of P (i.e. the sum of diagonal elements). (Hint:You may use cyclicity of the trace, i.e. trace(ABC) = trace(BCA) =trace(CAB).)

73

Exercise 28. Let x = (x1, ..., x4) ∈ R4.

1. Determine the orbits of R4 under the action of the perturbation groupΠ4 of four elements given by

π(x1, ..., x4) = (xπ(1), ..., xπ(4)),

π ∈ Π4.

2. Determine all possible rank statistics r(x), x ∈ R4, modulo permuta-tions of the xi.

3. When is r(x) a permutation of 1, 2, 3, 4 ?

Exercise 29 Let (X1, ..., Xn) be a random sample. Assume that the cumu-lative distribution function F is continuous. Prove that the probability

P ((X1, ..., Xn) | ∃π∈Πn : r(X1, ..., Xn) = π(1, ..., n)) = 1.

Exercise 30. The following tables were obtained from two different machinesproducing steal bolts. The data represent the deviation (in mm) from thedistinguished length of the bolts.

Machine I: 0.15, -1.99, -1.08, -1.98, 2.87, 5.19, -0.37, - 0.53, -1.09, 0.56, 1.15,-0.02, -1.32, 0.06, -0.21, -0.25, -1.35, -1.68, -1.41, -0.82

Machine II: 1.18, 1.26, 3.65, -0.81, 2.64, 0.31, 2.92, -3.60, 1.81, 1.38, 2.76,-3.25, -1.085, 1.19, 1.92, 1.53, 1.56, 3.09

1. Draw a qq-plot to justify your suspicion that the deviation is not nor-mally distributed. (Use R for convenience.)

2. Perform a Wilcoxon rank-sum test to significance level α = 0.05 tocheck whether the means of the deviations of the two machines fulfilµII > µI .

Exercise 31. You suspect that the deviations above are distributed accord-ing to a two-sided exponential distribution with density

ρ(x) =1

2e−|x|


(but different locations). Construct a rank statistic which in that case per-forms best in the sense of Pitman’s asymptotic efficiency.

Exercise 32. (Removal of ties) A lazy observer only tabulated the datafrom Exercise 30 rounded off to one digit getting

Machine I: 0.2, -2.0, -1.1, -2.0, 2.9, 5.2, -0.4, - 0.5, -1.1, 0.6, 1.2, -0.0, -1.3,0.1, -0.2, -0.3, -1.4, -1.7, -1.4, -0.8

Machine II: 1.2, 1.3, 3.7, -0.8, 2.6, 0.3, 2.9, -3.6, 1.8, 1.4, 2.8, -3.3, -1.1, 1.2,1.9, 1.5, 1.6, 3.1

Some of the numbers are now equal. To perform again a Wilcoxon ranktest, you proceed as follows: To every number Xi in the pooled sample yousimulate a Uniform(0,1) random variable Ui such that the Ui are independent.Then you assign modified ranks r∗ to the Xi according to: r∗(Xi) < r∗(Xj)if and only if either Xi < Xj, or Xi = Xj and Ui < Uj.

1. Construct a critical region for the Wilcoxon rank-sum test based onthese modified ranks r∗. What is the distribution of the joint orderedrank statistic based on the modified ranks r∗ ?

2. Discuss possible weak points of this method.

Exercise 33. Let X be a random variable with strictly monotonous cumu-lative distribution function F . Let Y = f(X) with f ∈ C∞

0 (R) smooth withcompact support.

1. Prove the strong law of large numbers for Y by using Glivenko-Cantelli.

2. Prove the central limit theorem for Y by using Donsker’s Theorem.

Exercise 34. Compare the example about the asymptotic of the Wilcoxonstatistic TW following Theorem 2 with the asymptotic result for the Mann-Whitney statistic Ω at the end of Appendix A.Exercise 35. Let x1, ..., xn be a sample from a random variable X and Fn

the associated empirical distribution function. Show that if the real functionf is continuous in the points x1, ..., xn, we have∫

Rf(t) dFn(t) =

1

n

n∑i=1

f(xi).

Bibliography

[1] Chandra, T. K. (1999). A First Course in Asymptotic Theory ofStatistics. Narosa Publ. House, New Delhi.

[2] Cramer, H. (1999). Mathematical Methods of Statistics. 19thprinting, Princeton University Press, Princeton, NJ.

[3] Dudley, R. M., G. (2002). Real Analysis and Probability. Cam-bridge University Press, New York.

[4] Hajek, J. (1969). Nonparametric Statistics. Holden Day, San Fran-cisco.

[5] Krengel, U. Mathematische Statistik, Vorlesungsausarbeitung.Gottingen WS73/74.

[6] v. d. Vaart, A., Wellner, J. A. (1996). Weak Convergence andEmpirical Processes, Springer, New York.

[7] Donsker, M. D., Justification and extension of Doob’s heuristicapproach to the Kolmogorov Smirnov theorems, Annals of Math-ematical Statistics, 23, (1952), p. 277281

[8] Good, I. J., What are Degrees of Freedom ?, The American Statis-tician, December 1973, Vol. 73, No. 5

[9] Romisch, W., Delta Method, Infinite Dimensional, article in: En-cyclopedia of Satistical Sciences, J. Wiley & Sons Inc., 2006

75

Index

Brownian bridge, 49

chi-square distribution, 55

Donsker’s theorem, 48

empirical distribution function, 45

functional delta method, 61

Gaussian process, 48centered, 48

Glivenko - Cantelli theorem, 46group action, 12

effective, 12

Hadamard differentiable, 63

interlacing patternfirst representation, 22second representation, 33

joint ordered rank statistics, 20

Kolmogorov distance, 46Kronecker symbol, 47

linear rank statistic, 34location/scale family, 38

maximal invariant map, 12median, 7monotone maps

group, 14

orbit, 12orbit space, 12order statistics, 13

parameterlocation, 32scale, 33

Pitman’s asymptotic efficiency, 11

rank statistics, 15

Siegel-Tukey test, 40stochastic domination, 18sufficiency, 20

tangent cone, 63test

chi-square goodness of fit, 57chi-square independence, 60Fisher Yates, 30Freund-Ansari-Bradley-David-Barton,

39invariant, 17Kolmogorov - Smirnov, 52Kolmogorov - Smirnov, two sam-

ple case, 53Kuiper, 53Mood, 39Terry-Hoeffding, 37Van der Waerden X, 32Wilcoxon, 28

test on domination, 18

76

INDEX 77

Wilcoxon statisticdistribution, 40Mann - Whitney form, 62

Documents

Skript