103
1 Summer Workshop on Distribution Theory & its Summability Perspective 16 17 18 18.9 20 21 22 0 5 10 15 20 25 30 35 40 45 50 | Center of mass Amount of Drink Mix (in ounces) Frequencies M. Kazım Khan Kent State University (USA) Place: Ankara University, Department of Mathematics Dates: 16 May - 27 May 2011 Supported by: The Scientific and Technical Research Council of Turkey (T ¨ UBITAK) 2

Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

1

Summer Workshopon

Distribution Theory & its Summability Perspective

16 17 18 18.9 20 21 220

5

10

15

20

25

30

35

40

45

50

|

Center of massAmount of Drink Mix (in ounces)

Freq

uenc

ies

M. Kazım KhanKent State University (USA)

Place: Ankara University, Department of Mathematics

Dates: 16 May - 27 May 2011

Supported by: The Scientific and Technical Research Council of Turkey (TUBITAK)

2

Page 2: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Preface

This is a collection of lecture notes I gave at Ankara University, department of math-ematics, during the last two weeks of May, 2011. I am greatful to Professor CihanOrhan and the Scientific and Technical Research Council of Turkey (TUBITAK)for the invitation and support.

The primary focus of the lectures was to introduce the basic components ofdistribution theory and bring out how summability theory plays out its role init. I did not assume any prior knowledge of probability theory on the part of theparticipants. Therefore, the first few lectures were completely devoted to buildingthe language of probability and distribution theory. These are then used freely inthe rest of the lectures. To save some time, I did not prove most of these results.

Then a few lectures deal with Fourier inversion theory specifically from thesummability perspective. The next batch consists of convergence concepts, whereI introduce the weak and the strong laws of large numbers. Again long proofs wereomitted. A noteable exception deals with the results that involve the uniformly in-tegrable sequence spaces. Since this is a new concept from summability perspective,I have tried to sketch some of the proofs.

I must acknowledge the legendary Turkish hospitality of all the people I cameto meet. As always, it was a pleasure visiting Turkey and I hope to have the chanceto visit again.

Mohammad Kazım Khan,Kent State UniversityKent, Ohio, USA.

4

Page 3: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

List of Participants

1- AYDIN, Didem Ankara Universitesi

2- AYGAR, Yelda Ankara Universitesi

3- AYKOL, Canay Ankara Universitesi

4- BASCANBAZ TUNCA, Gulen Ankara Universitesi

5- CAN, Cagla Ankara Universitesi

6- CEBESOY, Serifenur Ankara Universitesi

7- COSKUN, Cafer Ankara Universitesi

8- CETIN, Nursel Ankara Universitesi

9- DONE, Yesim Ankara Universitesi

10- ERDAL, Ibrahim Ankara Universitesi

11- GUREL, Ovgu Ankara Universitesi

12- IPEK, Pembe, Ankara Universitesi

13- KATAR, Deniz Ankara Universitesi

14- ORHAN, Cihan Ankara Universitesi

15- SAKAOGLU, Ilknur Ankara Universitesi

16- SOYLU, Elis Ankara Universitesi

17- SAHIN, Nilay Ankara Universitesi

18- TAS, Emre Ankara Universitesi

19- UNVER, Mehmet Ankara Universitesi

20- YARDIMCI, Seyhmus Ankara Universitesi

21- YILMAZ, Basar Ankara Universitesi

22- YURDAKADIM, Tugba Ankara Universitesi

6

Page 4: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Contents

Preface 3

List of Participants 5

Contents 5

List of Figures 9

1 Modeling Distributions 11.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Probability Space & Random Variables . . . . . . . . . . . . . . . . 8

2 Probability Spaces & Random Variables 11

3 Expectations 213.1 Properties of Lebesgue integral . . . . . . . . . . . . . . . . . . . . . 223.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Various Inequalities 274.1 Holder & Minkowski’s Inequalities . . . . . . . . . . . . . . . . . . . 284.2 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Classification of Distributions 355.1 Absolute Continuity & Singularity . . . . . . . . . . . . . . . . . . . 41

6 Conditional Distributions 496.1 Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . 52

7 Conditional Expectations & Martingales 577.1 Properties of E(X|Y ) . . . . . . . . . . . . . . . . . . . . . . . . . . 577.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8 Independence & Transformations 638.1 Transformations of Random Variables . . . . . . . . . . . . . . . . . 638.2 Sequences of Independent Random Variables . . . . . . . . . . . . . 688.3 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8 CONTENTS

9 Ranks, Order Statistics & Records 75

10 Fourier Transforms 8310.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

11 Summability Assisted Inversion 89

12 General Inversion 9712.1 Fourier & Dirichlet Series . . . . . . . . . . . . . . . . . . . . . . . . 99

13 Basic Limit Theorems 10713.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . 10813.2 Convergence in Probability & WLLN . . . . . . . . . . . . . . . . . . 111

14 Almost Sure Convergence & SLLN 117

15 The Lp Spaces & Uniform Integrability 12715.1 Uniform Integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

16 Laws of Large Numbers 14116.1 Subsequences & Kolmogorov Inequality . . . . . . . . . . . . . . . . 142

17 WLLN, SLLN & Uniform SLLN 15117.1 Glivenko-Cantelli Theorem . . . . . . . . . . . . . . . . . . . . . . . 163

18 Random Series 16918.1 Zero-One Laws & Random Series . . . . . . . . . . . . . . . . . . . . 16918.2 Refinements of SLLN . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

19 Kolmogorov’s Three Series Theorem 183

20 The Law of Iterated Logarithms 189

Page 5: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

List of Figures

1.1 A Histogram for the Drink Mix Distribution. . . . . . . . . . . . . . 41.2 Inverse Image of an Interval . . . . . . . . . . . . . . . . . . . . . . . 8

8.1 Inverse of a Distribution Function. . . . . . . . . . . . . . . . . . . . 66

11.1 Triangular Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

12.1 Dirichlet Kernels for n = 5 and n = 8. . . . . . . . . . . . . . . . . . 10112.2 Fejer Kernels for T = 5 and T = 8. . . . . . . . . . . . . . . . . . . . 10312.3 Poisson Kernels for r = 0.8 and r = 0.9. . . . . . . . . . . . . . . . . 105

14.1 Density of Random Harmonic Series . . . . . . . . . . . . . . . . . . 124

10 LIST OF FIGURES

Page 6: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 1

Modeling Distributions

A phenomenon when repeatedly observed gives rise to a distribution. In otherwords, a distribution is our way of capturing the variability in the phenomenon.Such distributions arise in almost all fields of endeavor. In social sciences they areused to keep tabs on social indicators, in finance they are used to study and qunatifythe financial health of corporations and pricing various assets and derived securitiessuch as options and bonds. Data distributions appear in statistics. In mathematicsdistributions of zeros of orthogonal polynomials appear and the distribution ofprimes are fundamental entities. In natural sciences about one and a half centureago Maxwell conjoured up a distribution to describe the speed of molecules inideal gas, which was later observed to be quite accurate. The genetic diversityand its quantification is still in its infency in terms of discovering the underlyingdistributions that it hides.

In this chapter we will collect the tools that are quite effective in studyingdistributions. We will present the following basic notions.

• Some examples of distributions,

• A framework by which distributions can be modeled,

• Transforms of distributions, such as moment generating functions and char-acteristic functions,

• Conditional probabilities and conditional expectations.

These results will be used in the remainder of the book.

1.1 Distributions

Any characteristic, when repeatedly measured, yields a collection of measured/collectedresponses. The word “variable” is used for the characteristic that is being measured,since it may vary from measurement to measurement. The collection of all the mea-sured responses is called the “distribution” of the variable. Sometimes, the worddata is also used to refer to the distribution of the variable. Distributions may bereal or just imagined entities. Here we collect a few examples of distributions ofthe following sorts to show their vast diversity.

2 Modeling Distributions

• (i) Distributions arising while measuring mass produced products.

• (ii) Distributions arising in categorical populations.

• (iii) Distribution of Stirling numbers.

• (iv) Distribution of zeros of orthogonal polynomials.

• (v) Distributional convergence of summability theory.

• (vi) Distributions of eigenvalues of Toeplitz matrices.

• (vii) Maxwell’s law of ideal gas.

• (viii) Distribution of primes.

• (ix) The Fineman-Kac formula and partial differential equations.

Of course, this is just a tiny sample of topics from an enormous field. One obviousomission being the field of Schwartz’s distributions. This is purely because thereare excellent books on the subject.1 We will, however, briefly visit this branch whilediscussing summability assisted Fourier inversion theory.

Example - 1.1.1 - (Measurement distributions — accuracy of automaticfilling machines) Kountry Times makes 20 ounce cans of lemonade drink mix.Due to unknown random fluctuations, the actual fill weight of each can is rarelyequal to 20 oz. Here is a collection of fill weights of 200 randomly chosen cans.

18.3 19.4 18.8 19.6 19.8 17.7 18.2 20.1 17.2 18.8 19.0

18.6 18.0 18.9 19.1 17.2 17.3 19.4 18.6 20.5 20.8 19.9

18.7 16.7 19.2 18.8 18.3 18.3 18.3 17.9 18.2 17.5 17.6

19.7 20.5 19.5 18.6 19.9 19.3 18.5 19.9 18.7 20.3 19.2

18.9 18.6 19.4 18.7 18.5 19.2 17.3 18.0 17.7 19.2 19.1

18.8 18.3 21.0 18.0 18.9 19.9 21.4 18.8 19.0 18.9 18.7

18.9 19.2 17.6 20.0 19.5 19.4 18.3 19.9 18.4 18.3 18.6

19.4 17.7 18.8 17.8 19.2 18.6 20.2 19.0 18.3 18.3 19.0

18.4 19.4 19.4 17.9 19.2 18.5 17.7 19.3 19.0 16.7 18.3

19.7 18.8 19.4 20.3 18.3 18.6 19.4 18.4 18.6 19.1 18.0

18.8 18.3 18.7 19.1 17.8 17.5 17.0 19.4 19.2 19.8 18.6

17.7 17.9 19.1 18.2 19.5 19.6 20.4 20.7 19.8 18.9 19.2

17.8 21.0 17.5 17.9 18.5 21.1 19.8 18.3 20.2 17.4 18.8

18.5 19.7 19.0 18.3 19.3 18.8 18.1 17.8 19.1 20.1 19.9

21.0 17.9 18.3 17.1 18.7 18.5 19.1 17.6 20.4 19.2 19.2

20.2 17.4 18.4 18.9 18.4 18.8 18.3 19.8 18.7 19.1 20.4

18.7 18.9 18.0 20.7 20.8 19.9 20.6 19.2 18.4 18.5 18.5

18.4 19.9 17.9 19.4 19.2 20.4 19.7 17.5 19.0 17.9 18.4

19.7 19.1

In this example, the feature being measured is the fill weight (measured in ounces).We see unexpectedly large amount of variability. The issue is:

“Does the distribution say anything about whether the advertised av-erage fill weight of 20 oz is being met or not”?

1For instance, see ...

Page 7: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

1.1 Distributions 3

The average, or the mean and the variance of this collected distribution now denotedas x1, x2, · · · , xn, is

xn =x1 + x2 · · ·+ xn

n=

1

n

n∑

i=1

xi = 18.940,

S2 =1

n− 1

n∑

i=1

(xi − x)2 =

∑ni=1 x2

i − n(x)2

n− 1= 0.8574.

The mean gives a feeling that the automatic filling machine may be malfunctioning.Since the units of variance are squared, we work with its positive square root, S,called the standard deviation. For our fill weights distribution S = 0.926 oz. Thestandard deviation typically gives a scale by which we can gauage the “width” ofthe distribution. Typically plus/minus three times the standard deviation aroundthe mean contains most of the values of the distribution. Note that the smallestvalue of our data distribution is 16.7 oz, and the largest value is 21.4. In this caseall the values of the distribution lie within 3S of the mean.

Of course knowing this distribution of 200 observations is only partially inter-esting. The real aim is to conclude someting about the source of these 200 obser-vations, called the population distribution, which is a mythical entity and repsentshow the automatic filling machine behaves. To get a feel for and then model theshape of the source distribution we resort to figures. We make some groups, alsocalled bins, say J1 = (16.5−17.0], J2 = (17.0−17.5], etc., and count the number ofobservations that fall into these bins. Dividing the frequencies by the total numberof observations gives the relative frequency distribution, which does not change theshape. A plot of this frequency distribution is called a histogram of the distribution.

Fill Weights Frequency Relative Frequency

16.5 − 17.0 2 0.01017.0 − 17.5 8 0.04017.5 − 18.0 23 0.11518.0 − 18.5 33 0.16518.5 − 19.0 44 0.22019.0 − 19.5 43 0.21519.5 − 20.0 23 0.11520.0 − 20.5 12 0.06020.5 − 21.0 7 0.03521.0 − 21.5 5 0.025

Total 200 1.000

Figure1.1 shows the resulting histogram for the distribution where the bins are onthe x-axis and the frequencies are on the y-axis so that the areas of the rectanglesare proportional to the frequencies. The general shape is captured by the super-imposed bell shaped curve. This form is a little more revealing than the originallong list of 200 data values. We clearly see that a majority of the cans had lessthan the advertised amount of 20 oz of drink mix in them. Also, only about0.06 + 0.035 + 0.025 = 12 percent of the observations were above 20 oz.

It is an amazing fact of life that most of the data sets which measure heights,weights or lengths of components produced by factories on a mass scale, one

4 Modeling Distributions

16 17 18 18.9 20 21 220

5

10

15

20

25

30

35

40

45

50

|

Center of massAmount of Drink Mix (in ounces)

Freq

uenc

ies

Figure 1.1: A Histogram for the Drink Mix Distribution.

tends to observe such bell shaped histograms. The superimposed curve is called anormal curve and is proportional to

f(x) =1

σ√

2πexp

−1

2σ2(x− µ)2

, −∞ < x <∞.

Most measurement type distributions are mathematically modeled by a normalcurve. Symbolically we denote this by X ∼ N(µ, σ2), where X represents thefill weight of a randomly chosen can. The letter N stands for the word “normaldistribution”, and µ is the center or mean and σ is the standard deviation. In words,the modeled density describes where will a randomly selected can’s fill weight fall.The histogram reflects the empirical2 evidence for our model.

The qunatity, P(a < X ≤ b), represents the model-predicted percentage of canswhose fill weights lie between a and b oz. Thanks to modern calculators,

P(18 < X ≤ 21) =1

σ√

∫ 21

18

e−(x−µ)2/(2σ2) dx =1√2π

∫ (21−18.9)/0.926

(18−18.9)/0.926

e−u2/2 du

≈ 0.82.24%.

When we consult the observed distribution and actually count we find that

33 + 44 + 43 + 23 + 12 + 7

200=

162

200= 0.81 = 81%

cans had their fill weights between 18 and 21 ounces. The agreement is remarkablygood, indicating that our normal curve model is quite useful. We don’t have tocarry the n = 200 observations in our pocket anymore. One simple mathematical

2“Empirical” stands for “derived from experience”.

Page 8: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

1.1 Distributions 5

curve captures it. Even more importantly the curve describes the fill weights ofall those cans that the automatic filling machine produced that we never checked.Due to the popularity of normal curves as models, they have been given specialemphasis in statistical literature.3

Example - 1.1.2 - (Modeling categorical populations — US voters) A weekbefore the 2000 US presidential elections, the voter preferences of the two presiden-tial candidates were as follows.

Bush,Bush, · · · ,Bush︸ ︷︷ ︸

80 million

, Gore,Gore, · · · ,Gore︸ ︷︷ ︸85 million

.

This is a very large categorical distribution. However, a very simple way to representthis distribution, without losing any information, is to write it in its frequenceyformat, namely write B (for Bush) once and put its frequence next to it and writeG (for Gore) once and put its frequence next to it. We may code the categories(letters) B,G by numbers, if we like. For instance, denoting B by 0 and G by 1,we may write the distribution of our coded variable, say X, as

Values of X 0 1Frequencies 80,000,000 85,000,000

The resulting relative frequency distribution is specified by the proportion

p =85, 000, 000

80, 000, 000 + 85, 000, 000=

85

165= 0.51515.

where X denotes the preference of an individual, represented in the coded form of0 or 1. Note that the population mean is p and the population variance is p(1− p).

Example - 1.1.3 - (Distribution of zeros of orthogonal polynomials) Sofar we talked about data distributions and their sources. As mentioned in thebeginning, distributions appear every where.

Consider a sequence of polynomials pn(x), n = 0, 1, 2, · · · , where p0(x) ≡ 1, andpn(x) is of degree n. Suppose there exist real constants an, bn such that an > 0 forall n ≥ 0,

an+1pn+1(x) + (bn − x)pn(x) + anpn−1(x) ≡ 0, n ≥ 1.

There is a result of Favard which says that each pn(x) has exactly n distinct realzeros, which we denote as

x1n < x2n < · · · < xnn.

The issue is what are they? Having these zeros gives us an extremely fast numericalintegration method (called the Gaussian quadrature) among other benefits. If it is

3Normal distributions are also called Gaussian distributions since the German math-ematician/astronomer, Carl Friedrich Gauss (1777 - 1855), showed their importance asmodels of measurement errors in celestial objects.

6 Modeling Distributions

too painful to write all of the zeros down for all n, then the next best thing to askis, what is their distribution, at least approximately? The answer is that the curve,

f(x) =

1

π√

(b+2a−x)(x−b+2a), for x ∈ (b− 2a, b + 2a),

0 otherwise,

has a lot to do with the approximate distribution of zeros. From summabilityperspective, we can say more. Suppose an and bn are two sequences of realnumbers so that, an ≥ 0 and for some constants a, b, and every ǫ > 0 we have

n∑

k: |ak−a|≥ǫ

1

n + 1→ 0,

n∑

k: |bk−b|≥ǫ

1

n + 1→ 0.

In the language of summability theory, ak is Cesaro-statistically convergent to aand bk is Cesaro-statistically convergent to b. If a > 0 then the histogram of thedistribution of zeros of pn(x) is approximately the curve f(x). This is just the tipof the iceberg. Much more can be deduced in much more general settings.

Example - 1.1.4 - (Convergence and summability) A matrix summability methodconsists of numbers ank, n, k = 0, 1, 2, · · · arranged in a matrix form, A = [ank].Such a matrix is constructed with the aim of converting a nonconvergent sequence,x0, x1, · · · into a convergent one. In other words, if

yn :=∞∑

k=0

ankxk, n = 0, 1, 2, · · ·

then our hope is that yn should converge. However, when (xk) is itself convergentto some number ℓ then we insist that (yn) should also be convergent to the sameℓ. A matrix A = [ank] which has this “convergence reproducing” property is calledregular. To handle the kind of examples we will present we need a bit more generalconcept that allows x = (xkn) to be a matrix as well and xkn need not be num-bers but could be functions. When xkn = xk for all n, we revert to the classicalsummability. There are four notions of convergence.

• (i) Let yn =∑∞

k=0 xkn ank be defined for all n called the A-transform of x.We way that x is A-summable to α if yn → α. This notion can be extendedto the case when xkn and α lie in a normed linear space.

• (ii) If xkn are real, and let F (t) be a distribution, i.e., nondecreasing rightcontinuous function with F (−∞) = 0 and F (+∞) = 1. We say x isA-distributionally convergent to F if for all t at which F is continuous wehave

limn→∞

k:xkn≤t

ank = F (t).

This notion can be extended to higher dimensional forms when both xkn andt are d-dimensional vectors.

Page 9: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

1.1 Distributions 7

• (iii) We say x = (xkn) is A-statistically convergent to α if for every ε > 0 wehave

limn→∞

k:|xkn−α|>ε

ank = 0.

This notion can be extended to the case when xkn and α lie in a topologicalspace. Example 1.1.3 uses this notion of convergence for the ak and bksequences with the matrix A being the Cesaro matrix.

• (iv) We say x = (xkn) is A-strongly convergent to α if

limn→∞

∞∑

k=0

|xkn − α| ank = 0.

This notion can be extended to the case when xkn and α lie in a metric space.

Example - 1.1.5 - (Distribution of primes) Let π(n) be the prime countingfunction. That is, π(n) is the total number of primes that lie in the interval (0, n].Gauss as a teenager conjectured that

π(n) ∼ n

lnn.

The prime number theorem says that

π(n) ∼ Li(n) :=

∫ n

2

1

lnxdx ∼ n

∞∑

j=0

j!

(lnn)j+1.

This was proved by both Hadamard and de la Vallee Poussin in 1896, by showingthat the Riemann zeta function ζ(z) has no zeros of the type 1 + it. Hardy andWright’s4 book provides more details.

In 1914 Littlewood5 showed that π(n)−Li(n) is positive and negative infinitelyoften. Since Li(n) ∼ n

ln n + n(ln n)2 + 2n

(ln n)2 + · · · , Chebyshev asked the behavior of

the ratio

Xn :=π(n)

n/ lnn, n = 1, 2, · · · .

Chebyshev showed that 78 < lim infn Xn ≤ lim supn Xn < 9

8 . The recent book ofHavil6 shows that if limn→∞ Xn exists then limn Xn = 1. As an evidence of deeproots of π(x), the Riemann hyphothesis is equivalent to the statement

|π(n)− Li(n)| = O((lnn)

√n).

For more, see Ingham.7

4Hardy, G. H. and Wright, E. M. An Introduction to the Theory of Numbers, 5th ed.Oxford University Press, 1979.

5”Sur les distribution des nombres premiers”, Comptes Rendus Acad. Sci. Paris, vol.158, pp. 1869-1872, 1914.

6Havil, J. Gamma: Exploring Euler’s Constant. Princeton University Press, NJ, 2003.7Ingham, A. E., The Distribution of Prime Numbers. Cambridge University Press,

London, 1990.

8 Modeling Distributions

Example - 1.1.6 - Of course there are many more examples, such as

• (i) Asymptotic normality of the Stirling numbers.

• (ii) Distributions of eigenvalues of Toeplitz matrices.

• (iii) Distributions of eigenvalues of random matrices — wigner’s law.

• (iv) Maxwell’s law of ideal gas — distribution of speed of molecules in idealgas.

• (v) Fineman-Kac formula. Solutions of many types of PDEs and Brownianmotion go hand in hand. The poster child being the heat equation. This linkhas found an unexpted admirer, namely the financial industry, since it tiesvery nicely into the price of various derived financial securities such as thecall and put and many other options.

The above examples give a glimpse of the importance of the concept of a dis-tribution. Probability theory provide the ideal language to express concepts ofdistribution theory. Therefore we start off by building some basic structures ofprobability theory.

1.2 Probability Space & Random Variables

Our aim is to construct a mathematical structure to house the concept of a distribu-tion. Distributions always describe some features of some variables. Since variablesmay have random components in them, distributions are often linked to probabilitytheory with a concept called a random variable. A random variable being a func-tion defined over a probability space. To see what we need, consider the followingdiagram in which ω belongs to some abstract set, Ω, shown as the horizontal axisfor convenience. To connect to the histogram of X, if J = (a, b] is any bin, the

X(w)

A

J=(a,b]

w

Figure 1.2: Inverse Image of an Interval

area of the rectangle over it represents the “size” of the event a < X ≤ b. So, weneed to require that the abstract set Ω should be such that for every J = (a, b],its inverse set A = X−(J) should have some concept of “size”, which we may callits “measure” or its “chance” or its “probability” of occurrence. To avoid logicalinconsistencies we set up some ground rules for the collection of all sets for which

Page 10: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

1.2 Probability Space & Random Variables 9

the concept of size needs to be defined. To be precise denote this set by the symbolE .• (i) We insist that a concept of size be attached to each subset of the type

A = X−1(J), when J = (a, b] and a < b. That is, each such A ∈ E .• (ii) Since combining two or more J1 = (a1, b1], J2 = (a2, b2], · · · , makes

practical sense, the concept of size should apply to

X−1(∪iJi) =⋃

i

X−1(Ji) =⋃

i

Ai.

In general, ∪iAi ∈ E , whenever Ai ∈ E . In particular, since R = ∪i(−i, i], weinsist that Ω = ∪iX

−1((−i, i]) ∈ E .• (iii) Since Jc = (−∞, a] ∪ (b,∞) is a union of some other J ’s, we insist that

Ac = X−1(Jc) should also be in our collection. More generally, if A ∈ E thenAc ∈ E .

• (iv) The concept of size should be defined for all A ∈ E . Furthermore, theconcept of size should respect disjointness. That is, if A1, A2, · · · are pairwisedisjoint then their individual sizes should add up to the size attached to theirunion ∪iAi.

Around 1930 A. N. Kolmogorov realized that all of the above requirements werepart and parcel of the then newly discovered Lebesgue measure and integrationtheory. His 1933 book on the foundations of probability theory detailing this isnow a classic. Let us collect and freeze these notions for our future use.

Definition - 1.2.1 - (Probability space) A probability space (Ω, E , P ) consistsof the following items.

• Ω = The set of all possible outcomes of an experiment, also called thesample space.

• E = The set of all subsets of Ω for which a (probability) function (measure)P can be defined. Each member of E is called an event (or a measurable set).

The collection of all events, E , must obey the conditions

• (i) Ω ∈ E ,• (ii) if A ∈ E then Ac ∈ E ,• (iii) if A1, A2, · · · ∈ E then ∪iAi ∈ E .

Any collection of subsets of the space, Ω, that obeys the above conditions, is calleda sigma field. The probability measure, P, is a real valued function over E , withthe following requirements:

• (i) P(Ω) = 1,

• (ii) 0 ≤ P(A) ≤ 1, for any A ∈ E ,• (iii) if A1, A2, · · · ∈ E are disjoint then P(∪iAi) =

∑i P(Ai).

10 Modeling Distributions

The last property of P is called the (disjoint) countable additivity.

Any function X : Ω → R having the property that X−1((a, b]) ∈ E for alla < b, is called a random variable (or measurable function). Every such X has acumulative distribution associated with it, denoted as F (t) or FX(t). It is obtainedby

F (t) = P(X−1((−∞, t])) = P(X ≤ t), t ∈ R.

Remark - 1.2.1 - Note that the definition of P is tied to the collection of events(sigma field). Condition (i) of the definition of a sigma field is needed for Condition(i) of the definition of P to avoid logical inconsistencies. The same is the case withthe third conditions of the two concepts.

Exercise - 1.2.1 - Let (Ω, E , P) be a probability space for a random experiment.Show that P satisfies the following properties for any A,B ∈ E :

(i) P(∅) = 0,(ii) P(Ac) = 1− P(A),(iii) P(Ac ∩ B) = P(B)− P(A ∩B),(iv) P(A ∪B) = P(A) + P(B)− P(A ∩ B),(v) if A ⊆ B, then P(A) ≤ P(B).

Page 11: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 2

Probability Spaces &

Random Variables

Remark - 2.0.2 - (Is P a continuous function?) The ususual notio of continuitydoes not apply to the function P since its domain has not been given any topologicalstructure. The problem is that we cannot talk about limn→∞ An for an arbitrarysequence of sets (events) in E. For some special sequences of sets “convergence ofsets” can be defined. When A1 ⊆ A2 ⊆ · · · , then we define limn→∞ An =

⋃∞n=1 An.

Similarly, if A1 ⊇ A2 ⊇ · · · , then limn→∞ An =⋂∞

n=1 An. A question arises, “forsuch sets is P continuous in the sense that

P( limn→∞

An) = limn→∞

P(An)?

Here is a result that answers this question.

Theorem - 2.0.1 - (The continuity property of P) If An, n ≥ 1 and Bn, n ≥1 are sequences of events such that A1 ⊆ A2 ⊆ · · · and B1 ⊇ B2 ⊇ · · · , then

(i) limn→∞

P(An) = P

(lim

n→∞An

), and (ii) lim

n→∞P(Bn) = P

(lim

n→∞Bn

).

Proof: Note that limn→∞

An = ∪∞n=1An = A1 ∪ (A2 − A1) ∪ (A3 − A2) ∪ · · · , where

the unions on the right side are disjoint. Thus,

P

( ∞⋃

n=1

An

)= P(A1) + P(A2 − A1) + P(A3 −A2) + · · ·

= P(A1) + limn→∞

n−1∑

i=1

(P(Ai+1 − Ai))

= P(A1) + limn→∞

n−1∑

i=1

(P(Ai+1)− P(Ai))

= P(A1) + limn→∞

P(An)− P(A1).

12 Probability Spaces & Random Variables

The reader should prove part (ii) (cf. Exercise (2.0.2)). ♠As we saw above, a monotone sequence of sets has a limit. In general, if

A1, A2, · · · is any sequence of sets, the new sequence B1 = ∪k≥1Ak, B2 = ∪k≥2Ak,· · · , Bn = ∪k≥nAk, for n = 1, 2, · · · becomes monotone. That is, B1 ⊇ B2 ⊇ B3 ⊇· · · . Hence, the sequence, B1, B2, · · · has a limit, which is called the lim supn An

and stands for

lim supn

An := limn→∞

Bn = ∩∞n=1Bn = ∩∞n=1 ∪k≥n Ak.

Similarly, the new sequence Cn = ∩k≥nAk is a monotone sequence since C1 ⊆ C2 ⊆· · · . It also has a limit, called the lim infn An and stands for

lim infn

An = limn→∞

Cn = ∪∞n=1Cn = ∪∞n=1 ∩k≥n Ak.

Since Cn ⊆ Bn for every n, their respective limits also share the same relationship,namely lim infn An ⊆ lim supn An. Note that the definition of E ensures that bothlim infn An and lim supn An are in E whenever all An are in E .

In probability literature, the event ∪iAi is often read as “at least one of the

Ai occurs”. And similarly, the event ∩iAi is often read as “every one of the Ai

occurs”. Continuing this further, the event lim supn An stands for “infinitely many

of the Ai occur” and lim infn An stands for “all but finitely many of the Ai occur”.The reader should try to see why this interpretation is justified. Here is anotherconsequence of the definition of probability function.

Theorem - 2.0.2 - (The first Borel-Cantelli lemma) Let A1, A2, ... be a se-quence of events. If

∑n P(An) <∞ then P(lim supn An) = 0.

Proof: Note that if Bn =⋃

k≥n Ak then B1 ⊇ B2 ⊇ · · · . Thus, by the continuityproperty of P,

0 ≤ P

(lim sup

nAn

)= lim

nP (Bn) = lim

nP

k≥n

Ak

.

By the subadditivity property of P, we get P

k≥n

Ak

≤∑

k≥n

P (Ak). Since, the

tail of a convergent series goes to zero,

0 ≤ P

(lim sup

nAn

)= lim

nP

k≥n

Ak

≤ limn

k≥n

P (Ak) = 0. ♠

Remark - 2.0.3 - (Inclusion-exclusion principle) It is a natural question toask, “can one find the probability of union of events when one knows only the

probabilities of the individual events?” The answer is yes, provided probabilities of

Page 12: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Probability Spaces & Random Variables 13

their intersections are known, and is called the inclusion-exclusion principle and isdue to H. Poincare.)

P

(n⋃

i=1

Ai

)

=n∑

i=1

P (Ai)−∑

i<j

P (Ai ∩Aj) +∑

i<j<k

P (Ai ∩Aj ∩Ak)

+ · · · + (−1)n+1P

(n⋂

i=1

Ai

)

= B1 −B2 + B3 − · · ·+ (−1)n+1Bn,

where, Bj =∑

1≤i1<i2<···<ij≤n

P (Ai1 ∩ Ai2 ∩ · · · ∩ Aij); j = 1, 2, · · · , n.

Therefore, P (Ac1 ∩ Ac

2 ∩ · · · ∩Acn) = 1− P (A1 ∪A2 ∪ · · · ∪An)

= 1−n∑

j=1

(−1)j−1Bj

=n∑

j=0

(−1)jBj where B0 = 1,

and it represents the probability of none of the Ai’s occurring. This is a specialcase of a yet more general result due to Jordan who proved it in 1927. It says that

P (exactly k events among A1, · · ·An will occur) =n∑

j=k

(−1)j−k

(j

k

)Bj ,

which reduces to the result of Poincare for k = 0. Both of these results can beproved by induction and are left for the reader as exercises.

Remark - 2.0.4 - (Various assignment methods) How should one define thefunction P : E → [0, 1] so that the three requirements of its defintion are fulfilledand at the same time is realistic? The word “realistic” points towards our desirethat it should be applicable in various real life situations. This is a modelingissue. Typically any one of the following four techniques is invoked due to variousreasonings:

• (Counting method). When Ω has only finitely many members we take E tobe the power set of Ω. Now if one can justify that each member should havethe same chance assigned to it then, by the third requirement of probabilityfunction, we get

P(A) =number of elements of A

number of elements of Ω.

In this case (Ω, E , P) is called an equilikely probability space. It

• (Lengths/areas/volumes method). When Ω is too large, such as aninterval or a subset of the plane or Rk, even when one may be able to justify

14 Probability Spaces & Random Variables

that the outcomes should show no preference, the above counting methodbreaks down. Its natural analog then becomes

P(A) =size of A

size of Ω,

where the “size” is taken to mean the length in case Ω is a bounded subsetof R, or area or volume for Rk when k > 1. In this case, typically E is takento be the smallest sigma field containg the bounded rectangles, called theBorel sigma field (cf. Exercise (2.0.6)).

• (Weighted versions). A large class of probability models come in weightedform of the above two items. In the following several standard models of prob-ability and statistics of this sort are listed. Such models are often justifiedbased either on analytic derivations or preponderence of empirical evidence.

• (Independence). Another distinct modeling technique that sets probabilitytheory apart from other disciplines is the modeling tool of independence. Wewill briefly describe this concept a bit later.

Example - 2.0.1 - (Secretary’s matching problem — equilikely probabil-ity space) Here we illustrate the use of the inclusion-exclusion principle applied to aparticular equilikely probability space and solve the secretary’s matching problem.A secretary types n letters addressed to n different people. Then he types n en-velopes with the same n addresses. However, while putting the letters into theenvelopes, he randomly puts letters into the envelopes. (This word “random” herestands for no preference to any particular letter going any particular envelope.This then can be interpreted to mean that the resulting probability space is equi-likely.) We would like to know the probability that at least one of the lettersis correctly put into its own envelope.1 Let Ai be the event that letter i goesinto its own envelope (1 ≤ i ≤ n) (i.e., a match occurs for the ith letter). Wewant the probability of at least one of the Ai’s occur, i.e., P (∪n

i=1Ai). To findout P(Ai), P(Ai ∩ Aj), P(Ai ∩ Aj ∩ Ak), , i 6= j 6= k, etc., we proceed as follows.There are n! ways to place the letters into the n envelopes. Therefore, we see that,

P(Ai) = (n−1)!n! = 1

n , i = 1, 2, · · · , n. Similarly, P(Ai ∩ Aj) = (n−2)!n! = 1

n(n−1) and

P(Ai ∩ Aj ∩Ak) = 1n(n−1)(n−2) , etc. So, by the result of Poincare,

P

(n⋃

i=1

Ai

)

=

n∑

i=1

1

n−∑

i<j

1

n(n− 1)+∑

i<j<k

1

n(n− 1)(n− 2)+ · · ·

= 1−(

n2

)

n(n− 1)+

(n3

)

n(n− 1)(n− 2)+ · · ·+ (−1)n+1

(nn

)

n!

= 1− 1

2!+

1

3!− · · ·+ (−1)n+1 1

n!=

n∑

j=1

(−1)j+1 1

j!.

1B. D. Choi gives a generalization of the matching problem in which we only select asubset of the recipients and see how many of them got the correct letters. For this, see hispaper “Limiting distribution for the generalized matching problem”, (1987), The Amer.

Math. Monthly, vol. 94, no. 4, pp. 356-360.

Page 13: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Probability Spaces & Random Variables 15

Example - 2.0.2 - (Random selection) Pick a point “at random” from (0, 2π].What does this mean? Well, it might be painful to construct an experiment whichwill do just that. At least approximately it can be performed with the help ofa spinner which does not have any preferential stopping spots. Mathematicallyspeaking it stands for the following probability space.

• Ω = (0, 2π]

• E is the smallest sigma field of subsets of (0, 2π] which contains all the inter-vals of (0, 2π]. We will call it the Borel sigma field, over (0, 2π].

• P((a, b]) = b−a2π−0 for any 0 ≤ a < b ≤ 2π.

Note that the actual act of selection of a point from the interval (0, 2π], or as tohow does one physically perform this operation, are none of our concerns. This isan idealization and the phrase “selection of a point at random” points towards thismathematical model (abstraction) of the physical act. The resulting cumulativedistribution is

F (t) =

0 if t ≤ 0,t

2π if 0 < t < 2π,1 if t ≥ 2π.

Exercise - 2.0.2 - Finish the proof of part (ii) of Theorem (2.0.1).

Exercise - 2.0.3 - By using induction, prove the result of Poincare.

Exercise - 2.0.4 - (Continuity property of P revisited) Show that for anysequence of events An, n ≥ 1, we have

P (lim infn

An) ≤ lim infn

P (An) ≤ lim supn

P (An) ≤ P (lim supn

An).

Hence, deduce that if lim infn An = lim supn An =: limn An, then P (limn An) =limn P (An), giving a slight extension of the continuity property of P.

Exercise - 2.0.5 - (Intersection of sigma fields is a sigma field) Let Ω be anonempty set and let Gα, α ∈ Λ be any nonempty collection of sigma fields ofsubsets of Ω. Show that F = ∩α∈ΛGα is again a sigma field.

Exercise - 2.0.6 - (Smallest sigma field containing a class) Let Ω be a nonemptyset and let A be a collection of of subsets of Ω. (Note that A need not have anyproperties.) Show that FA = ∩G⊃AG is again a sigma field, where the intersectionis over all sigma fields G that contain A. [Example: On R the smallest sigma fieldcontaining the collection of all intervals is called the Borel sigma field.]

Exercise - 2.0.7 - (Generated sigma field) The collection,

σ(X) :=X−1(B) : B ∈ B

,

is always a sigma field (as can be verified easily), and is called the sigma fieldgenerated by X. So, a real valued function, X, over Ω, is a random variable forthe probability space (Ω, E , P) if and only if σ(X) ⊆ E . Verify that σ(X) is a sigmafield when X is any real valued function on Ω.

16 Probability Spaces & Random Variables

Remark - 2.0.5 - (Whats observable?) The actual elements of Ω may or maynot be observable. Even worse is that the probability of events, P(A), is NEVERobservable. Often the observables are the values of certain random variables thatturn up as a result of performing the experiment.

• Probabilists propose models for the unknown (unobservable) probability func-tion P and then, using those models, deduce results by knowing some partialinformation concerning the experiment or without performing the randomexperiment at all. The results are only as good as the models.

• Statisticians test the validity of the proposed models for P after perform-ing the experiment a large number of times and observing certain randomvariables (called data analysis and statistical inference).

Every random variable has its own (unique) distribution function (or distribu-tion, for short). All the probabilistic properties of the random variable are storedin its distribution. Probability theory essentially is the study of these distributionalproperties.

Definition - 2.0.2 - (Multivariate distribution) Let X be a d-dimensional ran-dom vector (i.e, d number of random variables, X1,X2, · · · ,Xd, all defined over thesame probability space (Ω, E , P)). The multivariate (or joint) distribution of X isa function,

F (x1, x2, · · · , xd) = P(X ≤ x) = P

X1

X2

...Xd

x1

x2

...xd

, x ∈ Rd,

= P(X1 ≤ x1, X2 ≤ x2, · · · , Xn ≤ xn).

Here the inequality between vectors means componentwise inequalities must holdfor all the components, and the commas separating the events, Xi ≤ xi, standfor the intersection operations.

Distributions are nonnegative, right continuous functions, (in each variable)which may or may not be differentiable. Even more, some distributions may havejumps, but the points of jump can always be counted (i.e., is a countable set). Fora distribution, F , of one random variable, a point x is a point of jump of F if

P(X = x) = F (x)− F (x−) > 0.

For a distribution, F , of two random variables, a point (x1, x2) is a point of jumpof F if

P(X1 = x1, X2 = x2) = F (x1, x2)− F (x−1 , x2)− F (x1, x

−2 ) + F (x−

1 , x−2 ) > 0.

Most of the commonly used distributions fall into two categories: The differentiablekind, which are (strangely) given the name “continuous”, and the jump type, whichare given the name “discrete”. However, there are other types of distributions whichdo not fit into these two categories.

Page 14: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Probability Spaces & Random Variables 17

Definition - 2.0.3 - (Continuous case) The (joint) density, of a random vectorwith (joint) distribution F , when it exists, is a nonnegative function

f(x) =∂d

∂x1 · · · ∂xdF (x), with

Rd

f(x) dx = 1.

In this case, for any (Borel) subset B of Rd, we take P(X ∈ B) =∫

B f(x) dx.

(Discrete case) The joint (discrete) density, of a random vector with (joint)distribution F , when it exists, is a nonnegative function , f , with a countable subsetD ⊆ Rd, so that

f(x) = P(X = x) > 0, x ∈ D, with∑

all x∈D

f(x) = 1.

Remark - 2.0.6 - (Notation) X ∼ F or X ∼ f . The actual probability space(Ω, E , P), over which X is defined, is often suppresed once the distribution F (orthe density f) is obtained. One may safely assume that there is some probabilityspace from where the specified random variable with its distribution, came from.This observation was proved by Kolmogorov. Hence, X,F and f are related as

P(X ≤ t) = F (t), f(x) =

ddxF (x), continuous case,

F (x)− F (x−), discrete case.

Example - 2.0.3 - Here are some commonly used (discrete and continuous) modelsfor distributions (actually densities) of random variables.

• (Normal). X ∼ N(µ, σ2) stands for X having the density

f(x) =1

σ√

2πe−(x−µ)2/(2σ2), −∞ < x <∞.

The parameters, µ, σ > 0, control the shape of the density.

• (Lognormal). X ∼ LN(µ, σ2) stands for X having the density

f(x) =1

xσ√

2πe−(ln x−µ)2/(2σ2), x > 0.

The parameters, µ, σ > 0, control the shape of the density.

• (Chi square). X ∼ χ2(k) stands for X having the density

f(x) =1

2k/2Γ(k2 )

xk2−1e−x/2, 0 < x <∞.

Here the single parameter, k > 0, called the degree of freedom, controls theshape of the density. By the way,

Γ( 12) =

√π, Γ(1) = Γ(2) = 1, Γ(x + 1) = xΓ(x), x > 1. (0.1)

18 Probability Spaces & Random Variables

• (Exponential). X ∼ Exp(λ) stands for X having the density

f(x) = λe−λx, x > 0.

Here λ > 0, is the parameter of the density.

• (Gamma). X ∼ Gamma(λ, α) (also sometimes denoted as X ∼ G(λ, α))stands for X having the density

f(x) =λα

Γ(α)xα−1 e−λx, x > 0.

Here λ, α > 0, are the parameters of the density. When α = 1 we get theexponential density as a special case. If we take λ = 1

2 and α = k2 , we get

the chi square density with k degrees of freedom.

• (Multivariate normal). X ∼ N(µ,V) stands for the X = [X1, · · · ,Xd]′

having the (joint) density

f(x) =1

(2π)d/2√

det(V)exp

−1

2(x− µ)′V−1(x− µ)

; x ∈ Rd.

Actually, a multivariate normal random variable is defined through its mo-ment generating function (mgf) since it uses V only, even when V is notinvertible. The vector µ = [µ1, · · · , µd]

′ controls the center of the densityand the d × d positive-definite matrix, V, controls the spread and shape ofthe density.

• (Binomial). X ∼ B(n, p) stands for X having the (discrete) density,

f(x) =

(n

x

)px(1− p)n−x, x = 0, 1, 2, · · · , n.

• (Poisson). X ∼ Poisson(λ) represents a random variable whose (discrete)density is

f(x) = e−λ λx

x!, x = 0, 1, 2, · · · .

• (Geometric). X ∼ Geometric(p) represents a random variable whose (dis-crete) density is

f(x) = p (1− p)x, x = 0, 1, 2, · · · .

Definition - 2.0.4 - (Independence) Two events, A,B, are called independent if

P(A ∩ B) = P(A)P(B).

This is a distinguishing concept of probability theory. The early literature of prob-ability theory heavily relied on it. Much of the modern probability theory evolvedwhile proving old results that assumed this structure by trying relaxing it as muchas possible.

Page 15: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Probability Spaces & Random Variables 19

Definition - 2.0.5 - (Independence of events) A (finite or infinite) sequence ofevents, A1, A2, · · · is called independent if for any finite subset of them their jointprobability is the product of the individual probabilities. That is

P(∩i∈JAi) =∏

i∈J

P(Ai),

for any finite subset J of positive integers.

Theorem - 2.0.3 - (2nd Borel-Cantelli lemma) If A1, A2, · · · are independent

events such that∞∑

n=1

P(An) =∞, then P

(lim sup

nAn

)= 1.

Proof: By the continuity property of P, we have

P

(lim sup

nAn

)= P

k≥1

n≥k

An

= limk→∞

P

n≥k

An

, and,

P

n≥k

An

= limm→∞

P

(m⋃

n=k

An

)

.

Now, the independence of events gives that

P

(m⋃

n=k

An

)

= 1− P

(m⋂

n=k

Acn

)

= 1−m∏

n=k

(1− P(An))

= 1− exp

m∑

n=k

ln(1− P(An))

.

Here, if P(An) = 1, then the equality remains valid if we agree to take ln 0 = −∞.

By the Taylor series∑∞

k=1xk

k = − ln(1− x), we see that

ln (1− P(An)) = −(

P(An) +(P(An))2

2+ · · ·

)

≤ −P(An).

This inequality remains valid when P(An) = 1 since ln 0 = −∞ < −1. Thus,m∑

n=k

ln (1− P(An)) ≤ −m∑

n=k

P(An). This implies that

0 ≤ limm→∞

exp

m∑

n=k

ln (1− P(An))

≤ limm→∞

exp

−m∑

n=k

P(An)

= 0.

This gives that

P

n≥k

An

= limm→∞

(1− exp

m∑

n=k

ln (1− P(An))

)= 1.

Since this is true for all k, P (lim supn An) = 1. ♠

20 Probability Spaces & Random Variables

Remark - 2.0.7 - (Zero-one property) When A1, A2, · · · is a sequence of in-dependent random variables the two Borel-Cantelli lemmas together show thatP(lim supn An) is always either 0 or 1. It cannot have any other value. This fact isa special case of a more general result, known as Kolmogorov’s zero-one law.

Definition - 2.0.6 - (Independence of random variables) When the distribu-tion of X = [X1, · · · ,Xd]

′ can be written as a product, i.e.,

F (x) = P(X ≤ x) = P

X1

X2

...Xd

x1

x2

...xd

=

d∏

i=1

P(Xi ≤ xi); x ∈ Rd,

we say that X1,X2, · · · ,Xd are mutually independent (or just independent). Iffor every pair, (Xi,Xj), i 6= j, the two random variables are independent thenX1,X2, · · · ,Xd are called pairwise independent. The single word “independent”will always refer to mutual independence.

Remark - 2.0.8 - (The iid notation & the notion of random sample) The

notation X1,X2, · · · ,Xdiid∼ stands for the case when

F (x) = P(X ≤ x) = P

X1

X2

...Xd

x1

x2

...xd

=

d∏

i=1

P(X1 ≤ xi), x ∈ Rd.

If we denote the common function P(X1 ≤ t) by G(t), then the above iid notation

takes the form X1,X2, · · · ,Xdiid∼ G, or G is replaced by the name given to G. In

this case the collection X1,X2, · · · ,Xd is called a random sample from G, and Gis called the population distribution.

Example - 2.0.4 - Recall that N(0, 1) is the name given to the standard normal

distribution. So, the notation, X1,X2, · · · ,Xdiid∼ N(0, 1), uses the (common) dis-

tribution function

G(t) = P(X1 ≤ t) =1√2π

∫ t

−∞e−u2/2du, with density f(u) =

e−u2/2

√2π

, u ∈ R.

Remark - 2.0.9 - (Independence of events, & sigma fields) For a probabilityspace (Ω, E , P), two events, A,B ∈ E , are independent if P(A ∩ B) = P(A)P(B).Extending this idea, if F ,G are subsigma fields of E , then F ,G are called indepen-dent if for any A ∈ F , and any B ∈ G, we have P(A ∩ B) = P(A)P(B). Using thisnotion of independence of sigma fields, it turns out that two random variables X,Yare independent if and only if their respectively generated sigma fields, σ(X), σ(Y )are independent. For the most part we will not work at this generality.

Exercise - 2.0.8 - When X ∼ N(0, 1), by rewriting P(X2 ≤ t) as P(−√

t ≤ X ≤√t) and then differentiating with respect to t, show that Y = X2 ∼ χ2

(1).

Page 16: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 3

Expectations

The concept of an average or a mean of values of h(X), when X is a random variableand h is a function of interest, is captured by the Lebesgue integral. We present abrief heuristic argument here along with collecting its basic properties that we willneed.

We illustrate the basic idea with the help of the distribution FX of any randomvariable X defined over some probability space (Ω, E , P). The Riemann-Stieltjes in-

tegral∫ b

a h(t) dFX(t) partitions the x-axis using intervals, while Lebesgue’s methodof integration, instead partitions the y-axis using intervals. Then the two methodsperform distinctly different actions. When (xi−1, xi] is one of the partitioning in-tervals, the Riemann-Stieltjes integral measures its size by using the distributionas FX(xi) − FX(xi−1). Furthermore it uses an arbitrary point ai ∈ [xi−1, xi] andevaluates h(ai) to create the Riemann-Stieltjes sum

R(h, FX ,P) =

n∑

i=1

h(ai) (FX(xi)− FX(xi−1)) =

n∑

i=1

h(ai) P(xi−1 < X ≤ xi).

Instead Lebesgue suggested partitioning the y-axis into intervals and constructingthe integral for nonnegative functions first. If J is one of such partitioning intervalson the y-axis, the inverse image need not be a nice interval at all. Nor do we needit to be an interval to measure its size! All we want is to find the probability ofthe inverse set, A, which is given by the relationship P(A) = P(h(X) ∈ J). So,let A1, A2, · · · , An be such inverse sets for intervals J1 := (0, y1], J2 = (y1, y2], · · · ,Jn := (yn,∞) and take y0 = 0. For some choice of ai ∈ Ai, when h is nonnegativeand bounded, we create the Lebesgue sum as

n∑

i=1

h(ai) P(X ∈ Ai) =n∑

i=1

h(ai) P(h(X) ∈ Ji) =n∑

i=1

h(ai) P(yi−1 < h(X) ≤ yi).

When the partition of the y-axis is made finer and if the resulting limit exists we callthe limit as the Lebesgue integral. The left side’s limit appropriately is denoted as∫Ω h(X) dP while the right side is denoted as

∫R

h(t) dFX (t). But the two integralsare the same. The general case is then handled by writing h(t) = h+(t)−h−(t), the

22 Expectations

difference of positive and negative parts of h. This not only gives us the change ofvariables formula, but more importantly, this way we are able to construct integralsover abstract probability spaces. Using this inegral, we define the expectation (ormean) of a random variable X to be

E(X) =

Ω

X dP =

R

t dFX(t), whenever E|X| =∫

Ω

|X| dP <∞.

Higher order moments are defined by using h(X) = Xk, by using positive integersfor k. In particular, the variability of a distribution is captured by

V ar(X) = E(X2)− (E(X))2, Std(X) =√

V ar(X),

called the variance and standard deviation respectively.

3.1 Properties of Lebesgue integral

Ignoring some technical details the resulting integral has the usual properties:

• (Linearity) E(ah(X) + bg(X) + c) = aE(h(X)) + bE(g(X)) + c for anyconstants a, b, c. The multivariate extensions go along the same lines. Forinstance, for two random variables, X,Y ,

E(ah(X,Y ) + bg(X,Y ) + c) = aE(h(X,Y )) + bE(g(X,Y )) + c.

• (Positivity) If h1(t, s) ≥ h2(t, s) then E(h1(X,Y )) ≥ E(h2(X,Y )), and

|E(h(X,Y ))| ≤ E|h(X,Y )|.

• (Change of variable formula CVF) If Z = h(X,Y ) with distributionFY (t) then

E(Z) =

Ω

Z dP =

R

t dFZ(t) =

R2

h(x, y) dFX,Y (x, y).

In particular, E(Xn) is called the n-th moment of X.

• (Integration by parts) When F,G are nondecreasing right continuous,∫

[a,b]

F (x) dG(x) +

[a,b]

G(x−) dF (x) = F (b)G(b) − F (a−)G(a−).

• (Fatou’s lemma) For any sequence of nonnegative rvs, Xn, n ≥ 1,

E

(lim inf

nXn

)=

Ω

lim infn

Xn dP ≤ lim infn

Ω

Xn dP = lim infn

E(Xn).

• (Monotone convergence theorem) If 0 ≤ Xn(ω) ≤ Xn+1(ω) → X(ω)and E(X) <∞ then

E(X) =

Ω

X dP = limn

Ω

Xn dP = limn

E(Xn).

Page 17: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

3.2 Covariance 23

• (Lebesgue dominated convergence theorem) If Xn(ω) → X(ω) and|Xn| ≤ Y for some random variable Y with E(Y ) <∞ then

E(X) =

Ω

X dP = limn

Ω

Xn dP = limn

E(Xn).

• (Fubini-Tonelli’s theorem) If F,G are distributions of X,Y respectively,then Tonelli’s theorem says that

R

R

|h(x, y)| dF (x) dG(y) =

R

R

|h(x, y)| dG(y) dF (x).

Fubini’s theorem says that if either side of above equation is finite then theabove interchange of integrals can be performed without the absolute valuesaround h(x, y) as well.

The integral has some further properties which we will mention as they are needed.

Example - 3.1.1 - When X ≥ 0 with distribution F (x) and E|X| < ∞, Tonelli’stheorem gives

E(X) =

Ω

X dP =

Ω

∫ X

0

1dx dP =

∫ ∞

0

ω:X(ω)>x1dP dx =

∫ ∞

0

(1−F (x)) dx.

3.2 Covariance

Definition - 3.2.1 - (Variance-covariance matrix) If E(|X1|2 + · · · + |Xd|2) isfinite, the variance-covariance matrix of the random vector X = [X1,X2, · · · ,Xd]

is

V = E

X1

...Xd

[X1 · · ·Xd]

EX1

...EXd

[EX1 · · ·EXd]

=

V ar(X1) Cov(Xi,Xj)

. . .

Cov(Xi,Xj) V ar(Xd)

,

where V ar(Xi) = EX2i − (EXi)

2, and Cov(Xi,Xj) = E(XiXj) − (EXi)(EXj).

We take the correlation Corr(X,Y ) = Cov(X,Y )/√

V ar(X)V ar(Y ) wheneverthe variances are finite.

Remark - 3.2.1 - (Hoeffding formula) There is an old result of Wassily Hoeffding(1914-1991), that he proved1 in 1940, which says that

Cov(X,Y ) =

∫ ∞

−∞

∫ ∞

−∞(H(x, y)− F (x)G(y)) dx dy,

1Hoeffding, Wassily (1940), “Masstabinvariante Korrelationstheorie”. Schriften des

Math. Instituts und des Instituts hir Angewandte Mathematik der Universitat Berlin 5,pp. 179-233.

24 Expectations

for any bivariate distribution, H, as long as the variances of the random variablesexist and F and G were the marginal distributions of H”. Note that this approachdoes not need the joint density to compute the covariance.

Note from the definition that the covariance obeys the following properties,known as the symmetry and blinearity properties:

• (i) Cov(X,Y ) = Cov(Y,X),

• (ii) Cov(cX, Y ) = Cov(X, cY ) = cCov(X,Y ), for any constant c,

• (iii) Cov(X1 + X2, Y ) = Cov(X1, Y ) + Cov(X2, Y ),

• (iii) Cov(c, Y ) = 0, for any constant c.

In particular, V ar(X) = Cov(X,X) when E(X2) <∞. We say that X and Y areuncorrelated if Cov(X,Y ) = 0. Next, we turn our attention towards inequalities.

Proposition - 3.2.1 - (Existence of moments) Let h : [0,∞)→ R be a strictlyincreasing continuous function. Then for any nonnegative random variable X wehave

∞∑

n=1

P(X ≥ h(n)) ≤∫

S

h−1 X(s) dP =

R

h−1(t) dFX(t)

= E(h−1(X)

)≤

∞∑

n=0

P(X ≥ h(n)).

Proof: Just note that

∞∑

n=1

P (X ≥ h(n)) =

∞∑

n=1

P (h−1(X) ≥ n) =

∞∑

n=1

∞∑

i=n

P (i ≤ h−1(X) < i + 1)

=

∞∑

i=1

i∑

n=1

P (i ≤ h−1(X) < i + 1)

=

∞∑

i=1

iP (i ≤ h−1(X) < i + 1).

Use the notation Ai = i ≤ h−1(X) < i + 1, for i = 0, 1, 2, · · · . Then Ai aredisjoint events with ∪iAi = S. Therefore,

∞∑

n=1

P (X ≥ h(n)) =∞∑

i=1

iP (Ai) =∞∑

i=1

Ai

i dP ≤∞∑

i=1

Ai

h−1(X(s)) dP (s)

≤∞∑

i=0

Ai

h−1(X(s)) dP (s)

=

S

h−1(X(s)) dP (s), completing half of the proof,

Page 18: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

3.2 Covariance 25

=

∞∑

i=0

Ai

h−1(X(s)) dP (s)

≤∞∑

i=0

Ai

(i + 1) dP (s) =

∞∑

i=0

(i + 1)P (Ai)

=

∞∑

i=0

i∑

n=0

P (i ≤ h−1(X) < i + 1)

=

∞∑

n=0

∞∑

i=n

P (i ≤ h−1(X) < i + 1)

=

∞∑

n=0

P (h−1(X) ≥ n) =

∞∑

n=0

P (X ≥ h(n)).

This finishes the proof. ♠

Example - 3.2.1 - If X ∼ N(0, σ2) and Y = X2, then E(X) = 0 and E(Y ) =E(X2) = σ2. Here are the justifications.

E(X) =1

σ√

∫ ∞

−∞u e−u2/(2σ2) du = 0,

since the integrand is an odd function. Next,

E(Y ) = E(X2) =1

σ√

∫ ∞

−∞u2e−u2/(2σ2) du

=2

σ√

∫ ∞

0

u2e−u2/(2σ2) du, intengrand is an even function,

=σ2

√2π

∫ ∞

0

t1/2 e−t/2 dt, by substituting t = u2/σ2,

=σ2

√2π

23/2Γ( 32 ), since total area under χ2

(3) density is one,

= σ2, by (0.1).

26 Expectations

Page 19: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 4

Various Inequalities

A useful inequality involving a random variable is due to Jensen. It says that if fis a convex function over an interval and a random variable X takes values in thatinterval, then

E(f(X)) ≥ f(E(X)),

when the expectation on the left side exists. In particular, by taking f(t) = t2, weget

E(X2)≥ (E(|X|))2 .

This implies that V ar(X) ≥ 0.Let X be a random variable with variance σ2. Chebyshev’s inequality says that

P (|X − E(X)| > ε) ≤ σ2

ε2, for any ε > 0.

This is a bit crude inequality and surprisingly pervasive in probability theory andanalysis. Here we collect some of the standard inequalities from analysis.

Exercise - 4.0.1 - For any p > 0, show that the following statements are equivalent.

• (i) E|Y |p <∞,

• (ii)∑∞

n=1 P (|Y |p ≥ n) <∞,

• (iii)∫[0,∞) P (|Y |p ≥ t) dt <∞,

• (iv)∑∞

n=1 np−1P (|Y | ≥ n) <∞.

Then show that

E(|Y |p) = p

∫ ∞

0

tp−1P (|Y | > t) dt = p

∫ ∞

0

tp−1P (|Y | ≥ t) dt.

Example - 4.0.2 - (AM-GM inequality) The arithmetic mean µ of n numbers,

a1, a2, · · · an, is µ =a1 + a2 + · · ·+ an

n. If these numbers are positive, their geomet-

ric mean is

µ = antiloge

log a1 + log a2 + · · ·+ log an

n

=

(n∏

i=1

ai

) 1n

.

28 Various Inequalities

The AM-GM inequality says that µ ≥ µ. To prove this, define a random variableX ∈ ∆ = a1, a2, · · · , an with equal probabilities assigned to the elements of ∆(the set ∆ contains the given positive numbers and repetitions are allowed). Then,it is obvious to see that E(X) = µ. Also, by CVF, we have

Elog(X) =log a1 + log a2 + · · ·+ log an

n= logµ.

Since − log x is a convex function, Jensen’s inequality gives

− logµ = − logE(X) ≤ −Elog(X) = − logµ.

Multiplying both sides by −1 and then taking the antilogs give µ ≥ µ.

Exercise - 4.0.2 - (AGH inequality) Let a, b be two positive numbers. The

quantity

((1/a) + (1/b)

2

)−1

is called the harmonic mean of the two numbers. Use

the AM-GM inequality to prove the following AGH inequality:

Arithmetic Mean ≥ Geometric Mean ≥ Harmonic Mean.

4.1 Holder & Minkowski’s Inequalities

An extension of the CBS inequality is known as the Holder inequality. For this weneed an extended version of the AM-GM inequality — known as Young’s inequality.

Proposition - 4.1.1 - (Young’s inequality) Let X ∈ ∆ = a, b with densityf(a) = p, f(b) = 1− p, where 0 < p < 1 and a, b ≥ 0. Then

ap + b(1− p) = E(X) ≥ ap · b1−p,

where the equality holds if and only if a = b. (For p = 12 this reduces to the AM-GM

inequality.)

Proof: Look at the inequality backwards. We need to show that

ap + b(1− p) ≥(a

b

)p

b, assuming that b 6= 0.

(The inequality is trivially true when either a = 0 or b = 0.) So, we need to prove

thata

bp + (1− p) ≥

(a

b

)p

or equivalently,

f(t) = −tp + tp + (1− p) ≥ 0, for all t > 0.

The minimum of this function can be obtained with brute force, and we leave it forthe reader. Instead, there is an easier way. Define a random variable U that takestwo values, a

b and 1, with respective probabilities, p and (1 − p). Now apply theAM-GM inequality to get

a

bp + (1− p) = µ ≥ µ =

(a

b

)p

. ♠

Page 20: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

4.1 Holder & Minkowski’s Inequalities 29

Proposition - 4.1.2 - (Holder’s inequality)1 Let p, q be numbers such that p, q >1 and 1

p + 1q = 1. If X,Y are random variables with E(|X|p) <∞ and E(|Y |q) <∞

then

E|XY | ≤ (E|X|p) 1p · (E|Y |q) 1

q .

Equality holding if and only if α|X|p = β|Y |q for some constants α, β.

Proof: If E|X|p = 1 and E|Y |q = 1 then take a = |X|p and b = |Y |q and replacep by our 1

p in Young’s inequality2 to get

a

(1

p

)+ b

(1

q

)≥ a1/p b1/q or

( |X|pp

)+

( |Y |qq

)≥ |XY |.

Taking expectations of both sides we get

E|XY | ≤ 1

pE|X|p +

1

qE|X|q =

1

p+

1

q= 1 = (E|X|p)

1p (E|Y |q)

1q .

If E|X|p = 0 or if E|Y |q = 0 then one of the random variables is identically zeroand the inequality is trivially true. Otherwise, define

U :=X

(E|X|p)1p

, V :=Y

(E|Y |q)1q

.

This gives that E|U |p = 1 and E|V |q = 1 . Thus, we have

E|UV | ≤ 1 or E|XY | ≤ (E|X|p) 1p · (E|Y |q) 1

q . ♠

Exercise - 4.1.1 - Finish the proof of the above proposition by showing when theequality holds.

Proposition - 4.1.3 - (Minkowski’s inequality) Let X,Y be two random vari-ables with E|X|p <∞, and E|Y |p <∞. Then E|X + Y |p <∞ and

(E|X + Y |p) 1p ≤ (E|X|p) 1

p + (E|Y |p) 1p ; for any p ≥ 1.

Proof: The case of p = 1 is trivial, so assume that p > 1. The triangle inequality,|X + Y | ≤ |X|+ |Y |, gives that

|X + Y |p ≤ (|X|+ |Y |)p ≤ (2max|X|, |Y |)p

= 2p max|X|p, |Y |p ≤ 2p (|X|p + |Y |p) .

1In mathematics texts, the quantity (E|X|p)1/p is often represented as ||X||p and iscalled the p-norm of X.

2Note that 1−p of Young’s inequality is now 1/q of this proposition. That is either wework with p, 1−p where 0 < p < 1 as we did in Young’s inequality or we work with p > 1,q > 1 with 1

p+ 1

q= 1 as we do in this proposition. This is just using different notations.

30 Various Inequalities

This gives that E|X + Y |p < ∞. Since 1p + 1

q = 1, we have 1 + pq = p which gives

that p = (p − 1)q. Note E(|X + Y |p−1

)q= E(|X + Y |p) < ∞. Now apply the

Holder inequality to get

E|X + Y |p−1|X| ≤ (E|X|p)1p

(E(|X + Y |p−1

)q) 1q

.

Similarly, we get E|X + Y |p−1|Y | ≤ (E|Y |p) 1p (E|X + Y |p) 1

q . Now we put thesepieces together as follows.

E|X + Y |p = E(|X + Y |p−1|X + Y |

)

≤ E|X + Y |p−1|X|+ E|X + Y |p−1|Y |

≤ (E|X|p)1p (E|X + Y |p)

1q + (E|Y |p)

1p (E|X + Y |p)

1q

= (E|X + Y |p) 1q (E|X|p) 1

p + (E|Y |p) 1p .

Dividing by the first expression of the right hand side (and noting that 1p = 1− 1

q )

gives that (E|X + Y |p)1−1q ≤ (E|X|p)

1p + (E|Y |p)

1p . If E|X + Y |p = 0 then there

was nothing to prove in the first place. ♠

4.2 Jensen’s Inequality

Now explain the Jensen inequality in a bit more detail.

Definition - 4.2.1 - (Convex functions) Let f be a real valued function on aninterval (α, β). We say f is convex on (α, β) if, for any subinterval [a, b] ⊆ (α, β),the graph of f on [a, b] lies on or below the line segment connecting (a, f(a)) and(b, f(b)). This is equivalent to saying that

f(θa + (1− θ)b) ≤ θf(a) + (1− θ)f(b), (0 ≤ θ ≤ 1),

for all a, b ∈ (α, β), a < b.

HW1 Exercise - 4.2.1 - If φ is a convex function over (a, b) then show that φ is contin-uous.

Proposition - 4.2.1 - A function f defined over an interval (α, β) is convex if andonly if for every random variable, say Xn, taking values in (α, β) and having a finiterange a1, a2, · · · , an we have f(E(Xn)) ≤ Ef(Xn).

Proof: The only if part follows easily by taking n = 2. The if part is obtained byrepeated use of the definition and induction. We illustrate the induction argumentfor n = 3 only.

f(E(X)) = f(p1a1 + p2a2 + p3a3)

Page 21: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

4.2 Jensen’s Inequality 31

= f

((1− p3)

p1a1 + p2a2

1− p3+ p3a3

)

≤ (1− p3)f

(p1

1− p3a1 +

p2

1− p3a2

)+ p3f(a3)

≤ (1− p3)

p1

1− p3f(a1) +

p2

1− p3f(a2)

+ p3f(a3)

= p1f(a1) + p2f(a2) + p3f(a3) = Ef(X).

The same argument goes for higher values of n. ♠

Proposition - 4.2.2 - Let f be a convex function over an interval I and let X bea random variable taking values in I with E|X| <∞ so that E|f(X)| <∞. Thenthere exists a sequence of simple random variables Xn taking values in I so thatXn → X, E(Xn)→ E(X), E|f(Xn)| → E|f(X)| as well as Ef(Xn)→ Ef(X).

Proof: As in approximating any random variable by a sequence of simple ran-dom variables, we subdivide the real line R by considering intervals [ i

2n , i+12n )

for i = 0, 1, 2 · · · , n2n to cover [0, n) and considering the intervals [ i2n , i+1

2n ) fori = −1,−2, · · · ,−n2n to cover the interval [−n, 0). We need only consider thosesubintervals that intersect with I and ignore the rest. Now over each such subin-terval the value of X is approximated by a value taken by Xn as follows. Since|f | is a continuous function, over the closed interval [ i

2n , i+12n ], let an be the point

of minimum of |f(t)| and let Xn = an over this subinterval. When X falls in theinterval [n,∞), even though the minimum |f(t)| may not exist, inft∈[n,∞) |f(t)| isstill a finite value. (Draw all five possible shapes of f and then all eight or socorresponding shapes of |f(t)|. There are only two shapes, corresponding to thosef which are always nonnegative and monotone and convex and the decreasing sidehas a finite asymptote, these shapes are the ones where the minimum of f(t) over[n,∞) or (−∞,−n] does not occur at t = ±n. For all the rest of the shapes, andlarge enough n, we have that |f(t)| will have its minimums at t = ±n.) Considerthe rest of the six shapes first. In these cases, take Xn = n when X ≥ n and takeXn = −n when X < −n. Hence, we see that by construction |f(Xn)| ≤ |f(X)| forall n. Furthermore, |Xn −X|χ

|X|<n≤ 1

2n and

E(|Xn −X|χ

|X|≥n

)= E

(|n−X|χ

|X|≥n

)≤ 2E

(|X|χ

|X|≥n

)→ 0.

Hence, not only Xn → X but also E|Xn − X| → 0. Furthermore, the continuityof f implies that f(Xn) → f(X). Since |f(Xn)| ≤ |f(X)| and E|f(X)| < ∞,the Lebesgue dominated convergence theorem gives that E|f(Xn)| → E|f(X)|.Now consider those two shapes in which f(t) ≥ 0 and convex and either entirelyincreasing or entirely decreasing. For simplicity consider the case when f is de-creasing and let a be the right asymptote. Then consider another convex functiong(t) = f(t)−a−1. Now this g is one of those considered above and hence we can findsimple random variables Xn → X with E(Xn) → E(X) and Eg(Xn) → Eg(X),which is another way of saying Ef(Xn)→ Ef(X). But since f ≥ 0, it is the sameas saying that E|f(Xn)| → E|f(X)|. Similar argument takes care of the increasing,

32 Various Inequalities

positive convex function. Finally to show that Ef(Xn) → Ef(X), just note that,by our construction (since all constructions |f(Xn)| ≤ |f(X)| or using g instead off and having the same property),

|f(Xn)− f(X)| ≤ |f(X)|+ |f(Xn)| ≤ 2|f(X)|

The left side goes to zero, the right side has finite expectation. Hence, the Lebesguedominated convergence theorem gives that E|f(Xn)− f(X)| → 0. ♠

HW2 Exercise - 4.2.2 - Let f be a convex function over an interval I and let X be arandom variable taking values in I and E|X| < ∞. Prove that Ef(X) is welldefined by showing that Ef−(X) <∞.

Proposition - 4.2.3 - (Convexity & Jensen’s inequality) Let I be an interval.The following statements are equivalent

• (i) f is convex over I,

• (ii) E(f(X)) ≥ f(E(X)) ∀ simple r.v. X on I,

• (iii) E(f(X)) ≥ f(E(X)) ∀ r.v. X on I with E|X| <∞.

Proof: It is clear that (iii) implies (ii) since a simple random variable automat-ically has a finite expectation. Proposition (4.2.1) gives that (ii) implies (i). Thestatement (i) implies (iii) is known as Jensen’s inequality, which we now prove.So, let f be a convex function and let X be a random variable taking values in Iwith E|X| < ∞. Since Ef(X) is always well defined, (cf. Exercise (4.2.2)) andEf−(X) < ∞, the only possibility is that Ef(X) = ∞, in which case we havenothing to prove. So, assume that E|f(X)| < ∞. Take a sequence of simple ran-dom variables Xn taking values in I so that Xn → X, and E(Xn) → E(X) andEf(Xn)→ Ef(X) (cf. Proposition (4.2.2)). Note that

f(E(X)) = f(

limn→∞

E(Xn))

, by construction of X,

= limn→∞

f (E(Xn)) , continuity of f

≤ limn→∞

E (f(Xn)) , by Proposition (4.2.1),

= E (f(X)) , by Proposition (4.2.2).

This finishes the proof. ♠

Remark - 4.2.1 - (Carefull) Now we consider more than one random variable ata time. When X,Y have a induced measure on R2, (and without loss of general-ity, ignoring the inifinite values of X and Y ) then for any real valued measurablefunction h over R2, the change of variable formula (once again) gives that

E(h(X,Y )) =

S

h(X(s), Y (s)) dP (s) =

R2

h(x, y) dPX,Y (s, y)

=:

R2

h(x, y) dFX,Y (x, y),

Page 22: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

4.2 Jensen’s Inequality 33

provided the first integral exists. In particular, if X,Y have finite means then bythe linearity of Lebesgue integral,

E(X + Y ) = E(X) + E(Y ).

If X,Y are two random variables (defined over a probability space) the covarianceof X and Y is given by the expectation

Cov(X,Y ) = E(X −E(X))(Y − E(Y ))

which can be defined when E|X|, E|Y | and E|XY | exist as finite numbers, sincethe linearity of expectations allows us to write it as,

Cov(X,Y ) = E(XY )− E(X)E(Y ).

Note that the existence of the first moments of X,Y do not imply that E|XY | <∞.For instance, take U ∼ Uniform(0, 1) and then let X = Y = U−1/2. It is clearthat E|X| < ∞ but E|XY | = E(U−1) = ∞. Similarly, assuming the existenceof E|X|, E|Y | and E|XY | does not imply the existence of E(X2). For instance,take U ∼ Uniform(0, 1) and let X = U−1/2 and let Y = U−1/4. In some books,covariance is defined only when V ar(X) and V ar(Y ) is finite. However, we separatethis case by defining a closely related concept and correlation for which we assumethe existence of the second moments of X and Y . Correlation eliminates the unitsof measurements for X and Y and gives an absolute constant.

HW3 Exercise - 4.2.3 - Prove that E|Y | < ∞ if and only if for any constant c > 0,∑∞n=1 P (|Y | ≥ cn) <∞. Deduce that P (|Y | ≥ cn i.o.) = 0.

HW4 Exercise - 4.2.4 - (Lyapunov inequality) For any s > 1, from the Holder in-equality deduce that

E(|X|) ≤ (E|X|s)1/s.

And deduce the Lyapunov inequality

(E|X|r)1/r ≤ (E|X|s)1/s ; for any 0 < r < s.

HW5 Exercise - 4.2.5 - Let X,Y be two nonnegative random variables and let p ≥ 0 bea number. Prove the following

E(X + Y )p ≤

E(Xp) + E(Y p) if p ∈ [0, 1],2p−1 (E(Xp) + E(Y p)) if p > 1.

HW6 Exercise - 4.2.6 - If X is a nonnegative random variable with distribution F andfinite variance then show that

E(X2)

=

∫ ∞

0

2x (1− F (x)) dx.

34 Various Inequalities

D Exercise - 4.2.7 - Let X be (any) random variable and define a sequence of discreterandom variables,

Xn =j

nif

j

n< X ≤ j + 1

n; j = 0, ±1, ±2, · · · .

If E(Xn) exists for some n = 1, 2, · · · then show that E(Xn) exists for every valueof n and lim

n→∞E(Xn) exists. [Hint: Writing Xn = XN +(Xn−XN) and verifying

that Xn −XN is bounded by 2, show E(Xn) forms a Cauchy sequence.]

Exercise - 4.2.8 - Continuing Exercise (4.2.7), for any continuous random variableX, for which E(Xn) exists for some n, show that E(X) = limn→∞ E(Xn). Thatis, our old definition of expectation for a continuous random variable matches withthe limit obtained from Exercise (4.2.7).

Exercise - 4.2.9 - Redo Exercise (4.2.8) for discrete random variables.

Page 23: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 5

Classification of

Distributions

First, let us recall some elemantry facts about non-decreasing functions and leftand right limits.

Definition - 5.0.2 - (Discontinuties of type I and type II) Let f be any realvalued function over an interval I ⊆ R. For any x ∈ I, the symbol f(x+) and f(x−)are defined as

f(x+) := limh↓0

f(x + h), f(x−) := limh↑0

f(x + h),

when the limits exist and are called the right limit and left limit of f at x respec-tively. We say f is right continuous at x if f(x+) = f(x) and f is left continuousat x if f(x−) = f(x). (If I = [a, b], then by default f is right continuous at b andleft continuous at a.) The quantities

f(x+)− f(x), f(x)− f(x−), f(x+)− f(x−)

are called the right jump, left jump and the jump of f at x respectively.A point x ∈ I is called a point of discontinuity of the first kind if f(x+) and

f(x−) both exist and f is not continuous at x. All other points of discontinuity areconsidered to be of the second kind.

For example, let f be the Dirichlet function on [0,1], i.e.,

f(x) :=

1 if x is rational, x ∈ [0, 1],

0 otherwise.

Then all points are points of discontiuity of the second kind. For χ 12(t) the point

t = 12 is a point of discontinuity of the first kind.As a second example, let an be an enumeration of the rationals and let bn

be a sequence of non-negative numbers so that the series∑

n bn converges. Define

36 Classification of Distributions

a function

f(x) =∞∑

n=1

bnχ[an,∞)

(x); x ∈ R.

It is easy to see that f is non-decreasing and the series is uniformly convergent.The following proposition implies that all points of discountinuity of f are of thefirst kind.

Proposition - 5.0.4 - (Basic facts for nondecreasing functions) Let f be non-decreasing on an interval I ⊆ R. Then the following results hold.

• (First kind discontinuities) Then all points of discontinuity, if any, of fare of the first kind. Therefore, we may write

f(x−) = supt<x

f(t); f(x+) = infx<t

f(t).

• (Discontinuities are countable) The set D ⊆ I over which f is discontin-uous is at most countably infinite.

• (Continuity) f is continuous at x if and only if f(x−) = f(x) = f(x+).

Proof: Since f is non-decreasing, for any 0 < h′

< h

f(x) ≤ f(x + h′

) ≤ f(x + h).

That is, f(x + h) is decreasing when h is decreasing and it is bounded below byf(x). Thus the limit must exist. Similar argument goes for the left limit.

Any non-decreasing function has discontinuities only of the first kind (i.e., jumpdiscontinuities). Note that when f is bounded,

D = ∪∞n=1

x ∈ I : f(x+)− f(x−) ≥ 1/n

= ∪∞n=1Dn.

Each Dn is a finite set, since, it consists of those points where f has a jump of atleast 1/n. (If Dn had infinite points, adding all these jumps would become infinitewhich would violate the boundedness of f). Hence, D must be countable.

When f is not bounded, let En = [−n, n] and let Dn be the points of disconti-nuities of f which are in En. Since f , when restricted on the inverse image of En ,is a bounded non-decreasing function, Dn is a countable set. Taking union over allDn gives that D must be countable as well.

Finally, for any ǫ > 0, we have f(x− ǫ) ≤ f(x) ≤ f(x + ǫ). Letting ǫց 0 givesthat f(x−) ≤ f(x) ≤ f(x+). So, if f is continuous at x then the left side must equalthe right side. Conversely, when the left side equal the right side, then f must becontinous at x. ♠

Proposition - 5.0.5 - Let f1 and f2 be non-decreasing functions over R. If overa dense subset D ⊆ R, f1(x) = f2(x) for x ∈ D, then f1 and f2 must have thesame points of jump (if any) and f1(x) = f2(x) for all x at which f1 and f2 arecontinuous.

Page 24: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Classification of Distributions 37

Proof: Let x ∈ R be fixed and let tn ր x, then

f1(x−) = lim

tn↑xf1(tn) = lim

tn↑xf2(tn) = f2(x

−).

Similarly, we see that f1(x+) = f2(x

+). Hence, we see that

f1(x+)− f1(x

−) = f2(x+)− f2(x

−).

So, f1 and f2 must have the same points of jump and at the points of continuitywe have f1(x) = f2(x). ♠

Remark - 5.0.2 - (Normalization of monotone functions) We cannot say any-thing about the values of the functions over the points of jump. An easy exampleis to take f1(x) = χ[0,∞)(x) and let f2(x) take value zero over (−∞, 0) and valueone over (0,∞). And f2(x) could take any fixed value between zero and one.

There are three commonly used ways of rectifiying misbehavior at points ofjump (called normalizing) of a non-decreasing function. If f is a non-decreasingfunction, then we define

• (i) fR(x) = f(x−), (Chinese & Eastern European way).

• (ii) fA(x) = f(x+), (American & Western European way).

• (iii) fF (x) =f(x+) + f(x−)

2, (Applied mathematicians way).

Proposition - 5.0.6 - (Right continuous normalization) If f(x) is a non-decreasingfunction then fA(x) is a non-decreasing and right continuous function.

Proof: If x < y then picking any ǫ > 0 small enough so that x + ǫ < y, we have

f(x + ǫ) ≤ f(y) ≤ f(y+) = fA(y).

Letting ǫ ց 0 gives that fA(x) = f(x+) ≤ fA(y). To show that fA is rightcontinuous, let x ∈ R be fixed. We want to show that

fA(x+) = fA(x).

We already know that fA(x) ≤ fA(x+) since fA is non-decreasing. Now, for anyǫ > 0,

fA(x+) ≤ fA(x + ǫ) = f((x + ǫ)+) ≤ f(x + 2ǫ).

The first inequality follows by the fact that fA is non-decreasing, the middle equalityis just the definition of fA and the last inequality follows by the fact that f is non-decreasing. Letting ǫ drop to zero gives that

fA(x+) ≤ f(x+) = fA(x).

So, fA is right continuous. ♠

38 Classification of Distributions

HW7 Exercise - 5.0.10 - Let F be a nondecreasing function which is bounded so thata ≤ F (x) ≤ b for all x ∈ R. Prove that for any ǫ > 0, the total number of points ofjump of F having jump size greater than ǫ is no more than (b− a)/ǫ.

Exercise - 5.0.11 - (Specification over a dense subset suffices) Let f be anon-decreasing function defined over a dense subset D ⊆ R. Let

f∼

(x) = inft∈D; t>x

f(t); x ∈ R.

Show that f∼

(x) ≥ f(x) on D, and it is non-decreasing and right continuous on R.

HW8 Exercise - 5.0.12 - Let f be a nondecreasing function defined over a dense subset

D and let f∼

be its nondecreasing, right continuous version over R as in Exer-

cise (5.0.11). If f is uniformly continuous on D then show that f∼

is also uniformlycontinuous on R. By an example show that the assumption of uniform continuityof f on D cannot be relaxed to just continuity on D.

The following is one of the main results of this section which says that anydistribution can be uniquely decomposed into a convex combination of a discreteand a continuous distribution.

Let a1, a2, · · · be an enumeration of all the points of discontinuities of anondecreasing right continuous function F along with their corresponding jumps,

bj := F (aj)− F (a−j ) > 0; j = 1, 2, · · · .

Define functions D and C as follows:

D(x) :=∞∑

j=1

bj χ[aj ,∞)(x), C(x) := F (x)−D(x), x ∈ R.

It is clear that D(−∞) = 0 and D(∞) ≤ F (+∞). So, the infinite series, D(x), isuniformly convergent when F (∞) <∞. Let

Theorem - 5.0.1 - (Decomposition into discrete and continuous distribu-tions) For a distribution F let C,D be defined as above. Then

• (a) C(x) and D(x) are non-decreasing functions,

• (b) D is right continuous and C is continuous.

• (c) The above decomposition F (x) = C(x) + D(x) is unique.

• (d) Every non-trivial (non-constant) bounded non-decreasing right continu-ous function F can be written as

F (x) = αFd(x) + (F (+∞)− α)Fc(x),

for an α ∈ [0, F (+∞)], where Fd is a discrete probability distribution and Fc

is a continuous probability distribution.

Page 25: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Classification of Distributions 39

Proof: D is a non-decreasing function since x < y implies χ[aj ,∞)(x) ≤ χ[aj ,∞)(y)for all j. To show that C is non-decreasing, let x < y. Then

D(y)−D(x) =∑

x<aj≤y

bj =∑

x<aj≤y

F (aj)− F (a−

j )≤ F (y)− F (x).(0.1)

The inequality follows by the fact that F is non-decreasing and the sum of all thejumps within (x, y] is not greater than the total rise of F over the same interval.Rearranging the two extreme sides,

C(x) = F (x)−D(x) ≤ F (y)−D(y) = C(y).

So, C is also a non-decreasing function as well. Next, each χ[aj ,∞)(x) is a rightcontinuous function for each fixed j. The series

D(x) :=

∞∑

j=1

bj χ[aj ,∞)(x)

converges uniformly. Therefore, D(x) is right continuous. Hence, C(x) = F (x) −D(x) is also right continuous. Now we show that C is left continuous. For this,just note that for any x ∈ R, we have

F (x)− F (x−) = D(x)−D(x−) =

0 if x 6= aj for any j,bj if x = aj for some j.

Hence, we for any x ∈ R we have,

C(x)− C(x−) =(F (x)− F (x−)

)−(D(x)−D(x−)

)= 0.

To prove the uniqueness of the above decomposition suppose there exists a contin-uous function K(x) and another function ∆(x) of the type

∆(x) :=∞∑

j=1

βj χ[αj,∞)

(x),∑

j

|βj | <∞,

where βj are not zero and αj is a sequence of real numbers and

F (x) = K(x) + ∆(x); x ∈ R.

We want to prove that K must necessarily be C and ∆ must be D. SupposeD(x) 6= ∆(x) for some x ∈ R. Then atmost one of the following two possibilitiesmay occur

• The set a1, a2, · · · 6= α1, α2, · · · . That is, there is a point t which is inone set and not the other so that D(t)−D(t−) 6= ∆(t)−∆(t−). Note that theleft and the right limits of ∆ must exist since its series converges uniformlyand χ[αj,∞)(x) are non-decreasing.

• The two sets are equal, i.e., a1, a2, · · · = α1, α2, · · · . In this case relabelαj so that both D and ∆ have the same points of jump but for some t, thejump sizes are different. That is, D(t)−D(t−) 6= ∆(t)−∆(t−).

40 Classification of Distributions

Hence, in either case, we have a point t so that

D(t)−D(t−)

−∆(t)−∆(t−)

6= 0.

Just note that

K(x)− C(x) = F (x)−∆(x)− F (x) + D(x) = D(x)−∆(x).

This gives that

0 =K(t) −K(t−)

−C(t) −C(t−)

= K(t)− C(t) −K(t−)− C(t−)

= D(t)−∆(t) −D(t−)−∆(t−)

=D(t)−D(t−)

−∆(t)−∆(t−)

6= 0.

This contradiction proves the result. To prove the last part, we already know thatF (x) = D(x) + C(x). If D(∞) = α ∈ (0, F (∞)), then we may write

F (x) = α

(D(x)

α

)+ (F (∞)− α)

C(x)

F (∞)− α= αFd(x) + (F (∞)− α)Fc(x).

If D(∞) = 0, then we may take α = 0 in this case and take

F (x) = F (∞)C(x)

F (∞)= F (∞) Fc(x),

since, D(x) = 0 for all x. And if D(∞) = F (∞) then we may take α = F (∞) and

F (x) = α

(D(x)

α

)= αFd(x),

since C(x) = 0 for all x. ♠

HW9 Exercise - 5.0.13 - Let F be a distribution function and let aj be the collectionof all (if any) points of jump. Prove that

limǫ↓0

aj : aj∈(x−ǫ,x)

F (aj)− F (a−j ) = 0.

Does the conclusion remain true if the interval (x− ǫ, x) is replaced by (x− ǫ, x] inthe above sum?

HW10 Exercise - 5.0.14 - Let F be a distribution function. A point x is called a point ofsupport of F if for every ǫ > 0 we have F (x + ǫ) − F (x − ǫ) > 0. The collectionof all such points gives the support of F . Show that if F is a non-decreasing rightcontinuous function then the following results hold.

1. Each point of jump of F belongs to the support of F .

2. Each isolated point of support of F is a point of jump of F .

Page 26: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

5.1 Absolute Continuity & Singularity 41

3. The support of a distribution function is a closed set.

4. Provide an example of a discrete distribution whose support is the whole realline.

5. The support E of a continuous distribution is a perfect set (i.e., each pointof E is a limit point of E).

5.1 Absolute Continuity & Singularity

To see how to decompose a distribution into “smooth” and “non-smooth” ones weneed to introduce a new concept.

Definition - 5.1.1 - (Absolutely continuous & singular functions) A real-valued function G defined over R (or a closed interval [a, b]) is called absolutely continuousif for any ǫ > 0 there exists a δ > 0 such that for any finite collection of disjointsubintervals (xi, yi), i = 1, 2, · · · , n, of its domain with total length

∑ni=1(yi−xi) <

δ implies thatn∑

i=1

|G(yi)−G(xi)| < ǫ.

A real-valued function G is called singular if G′(t) = 0 for “almost all”1 t.

Example - 5.1.1 - By taking n = 1 in the above definition we immediately see thatany absolutely continuous function is automatically continuous.

Any function G having the property that |G(x)−G(y)| ≤ K|x− y| for a fixedconstant K is called a Lipshitz function. By taking δ = ǫ

K in the definition ofabsolute continuity, we see that any Lipshitz function is absolutely continuous.

It is worth spending several minutes staring at the definition of absolute conti-nuity and singularity and soaking in what they are trying to say. To see their propercontext, consider the nondecreasing and continuous function I(x) := x. This func-tion actually is the distribution of the Lebesgue measure on R, however, we don’tneed to worry about it here.2 It looks like a ramp of slope one and goes “up”forever. This distribution is the poster-child of the all the “smooth” distributions.On the other extreme, to see a prototype of a “non-smooth” distribution considerH(x) = [x], where [x] equals the greatest integer less than or equal to x. It lookslike stairs that go up forever. These two distributions are used to compare all otherdistributions. All our discrete distributions are “related to” this stair function, H.All the absolutely continuous distributions are “related to” the ramp function I(x).To see what the phrase “related to” means, let us see the defintion of absolutecontinuity in the context of distributions.

1The concept of almost all is the same as the concept of almost sure. Now it meansthat set where the property fails has length zero, abbreviated as a.e.[F ]. If F is a non-decreasing function then the F -length of an interval (x, y] is the amount F (y) − F (x).When F (x) = x the F -length is the same as the ordinary length of the interval. If F is aprobability distribution then it is just the probability of the interval.

2A measure, µ, is like the P of a probability space (Ω, ‖P

, P). The only thing that maybe different is that µ(Ω) need not equal 1 while P(Ω) = 1 always.

42 Classification of Distributions

Definition - 5.1.2 - (Absolutely continuous distributions) Let F,G be anytwo distributions over R. We say G is absolutely continuous with respect to F iffor any ǫ > 0 there exists a δ > 0 such that for any finite collection of disjointsubintervals (xi, yi), i = 1, 2, · · · , n, of its domain with

∑ni=1(F (yi) − F (xi)) < δ

implies thatn∑

i=1

|G(yi)−G(xi)| < ǫ.

When G is absolutely continuous with respect to F we denote this by G≪ F . Whenwe take F (x) = I(x) = x, and G ≪ I often we simply say that G is absolutelycontinuous without mentioning that it is this I(x) = x that we used.3

Example - 5.1.2 - (Continuous random variables) All absolutely convergentintegrals give rise to absolutely continuous distributions. Suppose g is the densityof a continuous random variable X, with distribution

G(t) =

∫ t

−∞g(x) dx.

Note that for any xi < yi ≤ xi+1 < yi+1, i = 1, 2, · · · , n,

n∑

i=1

(G(yi)−G(xi)) =n∑

i=1

∫ yi

xi

g(x) dx.

By the mean value theorem of integrals, there exist ξi ∈ (xi, yi), i = 1, 2, · · · , and

n∑

i=1

(G(yi)−G(xi)) =n∑

i=1

g(ξi) (yi − xi).

Now if g happens to be bounded, say by K, then by taking δ = ǫK we see that

G is absolutely continuous. For instance, if X ∼ Exp(λ), then its distribution isabsolutely continuous, since its density is bounded by K = λ. It turns out thatboundedness of g is not needed, only integrability of g is needed. Hence, the cdfG of our continuous random variables are all absolutely continuous. Actually, it isbetter to say backwards. That is, we call a random variable to be continuous if itsdistribution is absolutely continuous with respect to I(x) = x. The usual normal,beta, gamma, exponential, chi-square densities are all examples of this type.

Example - 5.1.3 - (Integer valued random variables) All absolutely conver-gent series also give rise to absolutely continuous distributions, but in this case thecomparison is made with the stair function H(x) = [x]. Suppose the density ofan integer valued random variable X is P(X = j) = pj , j = 0,±1,±2, · · · with∑

j pj = 1. The distribution of X is

G(t) =∑

j: j≤t

pj , t ∈ R.

3Our earlier definition of absolute continuity used this special distribution I(x) = xinstead of F . Now we are allowed to use any distribution for F instead.

Page 27: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

5.1 Absolute Continuity & Singularity 43

Note that for any −∞ < xi < yi ≤ xi+1 < yi+1 < ∞, i = 1, 2, · · · , n, only finitelymany integers live inside each interval (xi, yi], i = 1, 2, · · · , n. Also,

n∑

i=1

(G(yi)−G(xi)) =n∑

i=1

j: j∈(xi,yi]

pj

≤n∑

i=1

number of integers in (xi, yi] , since pj ≤ 1,

≤n∑

i=1

H(yi)−H(xi),

Hence, by taking δ = ǫ we see that G≪ H. All our integer valued random variablesare absolutely continuous with respect to the distribution H(x) = [x]. The usualbinomial, geometric, hypergeometric, Poisson and negative binomial densities areexamples of this type. Similarly, all discrete distributions are absolutely continuouswith respect to “counting distribution” (see Exercise (5.1.5)... for more).

Remark - 5.1.1 - (Radon-Nikodym theorem) So far we did not bring the con-cept of a derivative into the picture. Recall the fundamental theorem of calculuswhich says that if

G(t) =

∫ t

a

g(x) dx,

then G is differentiable and G′(t) = g(t). There is a remarkable result due to Radonand Nikodym which says that absolute continuity, G≪ F , holds if and only if thereexists an integrable function g so that

G(t) =

∫ t

−∞g(x) dF (x), and

dG

dF(t) = g(t), a.e.[F ].

When F (x) = x we see the resemblance with the fundamental theorem of calculus.When F (x) = [x] the integral is interpreted as a series. The function g is called thedensity of G with respect to F and denoted as dG

dF (t) = g(t).

Example - 5.1.4 - (Singularity and the Cantor function) Note that if G≪ Fand if F is continuous at t then so will be G. Conversely, if G(t)−G(t−) > 0 andF (t)− F (t−) = 0 then it is impossible to have G≪ F . So, if G is a distribution ofa discrete random variable then it cannot happen that G ≪ I where I(x) = x. Infact, the total length measured by I is zero of the support of G when G is a discreteprobability distribution. In this sense, we say that G is singular with respect to Iand write G ⊥ I. So, all discrete distributions are singular with respect to I(x) = x.

Can a non-discrete distribution be singular with respect to I(x) = x? Theanswer is yes, and there are many examples. The most famous of them all wasgiven by George Cantor over a hundred years ago.

Define f : [0, 1] → [0, 1] as follows: Let f = 12 over [ 13 , 2

3 ]; then let f = 122

on [ 132 , 2

32 ] and f = 322 on [ 7

32 , 832 ]. Then make f take values 1

23 , 323 , 5

23 , 723 on the

middle third closed subintervals of the previous four intervals over which f was yet

44 Classification of Distributions

undefined. Continuing in this way we get the Cantor function f . One can provethat f is uniformly continuous and f ′ = 0 almost everywhere on [0, 1]. Furthermoref is singular with respect to I(x) = x on [0, 1].

By the way, if we throw away the middle third open subintervals successively aswe did while constructing f , the remaining subset of [0, 1] is called the Cantor set.The reader can verify that the total length of the set that we threw away is 1.Hence, the length of the Cantor set is 0, yet it happen to have uncountably manypoints. An easy way to state the Cantor set is that it consists of all those t ∈ (0, 1]for which a ternary expansion has no 1 in it. That is, t = 0.a1a2a3 · · · is in theCantor set if each ai ∈ 0, 2, i = 1, 2, · · · .Exercise - 5.1.1 - (Cantor set) Show that the Cantor set is uncountable. Provethat the Cantor function, f , is uniformly continuous and f ′ = 0 almost everywhereon [0, 1], making f to be singular with respect to I(x) = x on [0, 1].

Remark - 5.1.2 - In the above examples, the set of nonnegative integers was usedjust for convenience sake. Any countable set ∆ could have been used instead (andthen putting the counting distribution over ∆). This set ∆, in terms of randomvariables, is nothing but the range of the random variable X. The above examplesshow that when the range ∆ of X is a countable set, it is enough to give P(X = x),for each x ∈ ∆, to fully characterize the distribution G of X. In this case, G issingular with respect to the I(x) = x. However, G is absolutely continuous withrespect to the counting distribution, Fc, which assigns the value 1 to each elementof ∆ and zero to other sets. The Radon-Nikodym derivative of G with respect toFc is precisely P(X = x), x ∈ ∆ and zero otherwise. The derivative, P(X = x),x ∈ ∆, of G with respect to Fc is the density of X.

Remark - 5.1.3 - (Discrete & continuous singular distributions) Anotherexample of a sigular distribution (with respect to F (x) = x) can be constructed asfollows. Let pn > 0 be any sequence of positive numbers so that

∑n pn = 1. Let

bn be an enumeration of the set of rationals. Define a function G(x) by

G(x) =

∞∑

n=1

pnχ(−∞,x]

(bn), x ∈ R.

Note that G(−∞) = 0, G(∞) = 1 and G(x) is nondecreasing. One can show thatit is also right continuous function but singular with respect to F (x) = x. Notethat the set of discontinuities of G forms a dense set in R.

Another example can be constructed as follows. Let ρ > 0 be a fixed number.Define a function Gρ(x) on [0, 1] by Gρ(0) = 0 and for x ∈ (0, 1] by

Gρ(x) =∞∑

i=0

ρi(1 + ρ)−ai ,

where x is written in its nonterminating binary expansion gives that x =∑∞

i=01

2ai

and a0 < a1 < · · · are positive integers and ai > i. One can show that it is acontinuous singular distribution. 4

4The interested reader may consult Lajos Takacs (1978), “An increasing continuoussingular function”, Amer. Math. Monthly, vol. 85, no. 1, pp. 35-37.

Page 28: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

5.1 Absolute Continuity & Singularity 45

Remark - 5.1.4 - (Decomposition to absolutely continuous distributions)Now we spend some time showing how to decompose a continuous distributionfunction into a continuous singular and absolutely continuous components. Recallthat by Radon-Nikodym theorem, if G is absolutely continuous with respect toI(x) = x, then there exists a nonnegative integrable function g so that

G(t) =

(−∞,t]

g(x) dx; for all t ∈ R.

So, the function g is the density of G.On the other hand, if G is singular with respect to I(x) = x then G must be

concentrated on a set whose length is zero. For instance, any discrete distributionmust be singular with respect to I(x) = x. We may have a continuous distributionwhich is singular, as the above example of Cantor function shows. How do weseparate a singular portion from its distribution?

To construct this, let G be any probability distribution. Its a fact from realvariables theory, that G(x) is always differentiable a.e.[I], where I(x) = x, with anonnegative derivative G′(t) = g(t). Now for any x < y, Fatou’s lemma gives that∫

(x,y]

G′

(t) dt =

(x,y]

limn→∞

G(t + (1/n))−G(t)1n

dt

≤ lim infn

n

(x,y]

(G(t + (1/n)) −G(t)) dt, Fatou’s lemma,

= lim infn

n

(∫

(y,y+ 1n

]

G(t) dt−∫

(x,x+ 1n

]

G(t) dt

)

, cancel common area,

≤ lim infn

n

(1

nG(y +

1

n)− 1

nG(x)

), largest minus smallest,

= G(y)−G(x), right continuity.

Hence, we see that for any distribution G, we always have

G(y)−G(x) ≥∫

(x,y]

G′(t) dt.

And of course, should the ≥ sign become equality then G ≪ I, where I(x) = xand the density G′(t) = dG

dI (t) (now denoted as dGdt instead) becomes the Radon-

Nikodym derivative.Note that for the Lebesgue integral the fundamental theorem of calculus is not

guaranteed. This is a small price we pay for using Lebesgue integrals. To see howwe decompose a continuous distribution into its absolutely continuous and singularparts, let us define

Gac(x) :=

(−∞,x]

G′

(t) dt, Gs(x) := G(x)−Gac(x); x ∈ R.

By the Radon-Nikodym theorem Gac is an absolutely continuous distribution (withrespect to I(x) = x). Since, for each x < y,

Gs(y)−Gs(x) = G(y)−Gac(y)−G(x) + Gac(x)

46 Classification of Distributions

= G(y)−G(x)− Gac(y)−Gac(x)

= G(y)−G(x)−∫

(x,y]

G′(t) dt

≥ 0,

we see that Gs(x) is also a nondecreasing function. Since, G′

s(x) = G′

(x) −Gac

′(x)a.e.= G

(x) − G′(x) = 0, we see that G′

s(x) is singular with respect toI(x) = x. (i.e. a singular distribution) If G is a continuous distribution then thecorresponding Gs will also be continuous but still singular.

Hence, any distribution could be decomposed into its absolutely continuous andsingular parts. Combining with our earlier decomposition results, we see that anydistribution can be decomposed into discrete, absolutely continuous and continuoussingular components. Such a decomposition becomes unique if we insist that thecomponent distributions be probability distributions.

HW11 Exercise - 5.1.2 - Show that if the support of a distribution G has length (Lebesguemeasure) zero then G is singular. Give an example of a singular distribution whosesupport is the whole real line.

HW12 Exercise - 5.1.3 - Let G be a probability distribution and it can be written as

G(t) =

∫ t

−∞g(x) dx, t ∈ R.

for a continuous function g. Show that G′(t) = g(t) ≥ 0 for all t ∈ R.

Exercise - 5.1.4 - Let X and Y be two random variables (perhaps coming fromtwo different probability spaces) having the same distribution over R. For anymeasurable function h on R → R prove that h(X) and h(Y ) also have the samedistribution.

Exercise - 5.1.5 - A discrete distribution is absolutely continuous with respect toa counting distribution.

Exercise - 5.1.6 - (Integration by parts) Let f and g be two non-decreasingfunctions over R with the corresponding generated Borel measures F and G respec-tively. Show that the following integration by parts formula holds:

[a,b]

f(x−)dG(x) +

[a,b]

g(x+)dF (x) = F (b)G(b)− F (a−)G(a−),

Also show that∫

[a,b]

f(x+)dG(x) +

[a,b]

g(x−)dF (x) = F (b)G(b)− F (a−)G(a−),

Puttingthe the two together we also get

F (b)G(b)− F (a−)G(a−)

=

[a,b]

f(x+) + f(x−)

2dG(x) +

[a,b]

g(x+) + g(x−)

2dF (x).

Page 29: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

5.1 Absolute Continuity & Singularity 47

Exercise - 5.1.7 - (Integration by parts revisited) Let f and g be two non-decreasing functions over R with the corresponding generated Borel measures Fand G respectively. For any −∞ < a < b <∞, prove the following integration byparts formula:

(a,b]

f(x−)dG(x) +

(a,b]

g(x+)dF (x) = F (b)G(b) − F (a)G(a),

where∫

EfdG(x) stands for the integral with respect to the measure represented

by G. Then deduce that

(a,b]

f(x+)dG(x) +

(a,b]

g(x−)dF (x) = F (b)G(b) − F (a)G(a),

and hence deduce that

F (b)G(b)− F (a)G(a)

=

(a,b]

f(x+) + f(x−)

2dG(x) +

(a,b]

g(x+) + g(x−)

2dF (x).

Exercise - 5.1.8 - (Functions of bounded variation & Jordan decompo-sition) A real-valued function f is called of bounded variation and denoted asf ∈ BV , if f = F − G for some nondecreasing functions F,G. Let P = a =x0, x1, x2, · · · , xn = b be a partition of [a, b]. If f is a real valued function over[a, b], we define

P+(f) :=

n∑

i=1

f(xi)− f(xi−1)+, P−(f) :=

n∑

i=1

f(xi)− f(xi−1)−, and

P(f) :=∑n

i=1 |f(xi)− f(xi−1)|, where x+, x− stand for the positive and thenegative parts of x. Note that P(f) = P+(f) +P−(f). The positive, negative andtotal variation of f are respectively defined by

V +[a,b](f) := sup

PP+(f), V −

[a,b](f) := supPP−(f), V[a,b](f) := sup

PP(f).

(a) Prove the following decomposition of f known as the Jordan decomposition.

f(x) = f(a) + V +[a,x](f)− V −

[a,x](f); x ∈ [a, b],

and the two functions on the right are the positive and the negative variations off .(b) Is it true that V +

[a,x](f) and V −[a,x](f) are right continuous?

(c) Show that f ∈ BV [a, b] if and only if V[a,b](f) <∞.(d) Can any comparison be made between the spaces BV [a, b] and BV (R)?

Exercise - 5.1.9 - (Distributions on Rk) Consider R2 for simplicity. If G(x, y)is non-decreasing and right continuous in both of its variables in order for it to be

48 Classification of Distributions

called a bivariate distribution it must obey the extra condition that its G-area (ananalog of our concept of G-length) of a rectangle (a, b]× (c, d], defined as follows

G-area((a, b]× (c, d]) = G(b, d)−G(a, d)−G(b, c) + G(a, c),

is nonnegative. If G is a probability distribution, the G-area is the probability ofthe rectangle. Show that all of the following functions are distributions and givethe same G-area of any rectangle [a, b]× [c, d].

G1(x, y) = xy

G2(x, y) = xy + x

G3(x, y) = xy + y

G4(x, y) = xy + x + y

G5(x, y) = xy + x + y + 13.

Is any of them a probability distribution? Show that if G(x, y) is a distributionthen, for any fixed constant k and a fixed interval A, the new function

H(x, y) := G(x, y) + G-area(A× (k, y])

is also a distribution and gives the same H-area to any rectnagle as G does. (HenceH is another “version” of G.) Finally, if G is a bounded bivariate distributionexplain why, without effecting the G-areas of rectangles, we may conveniently definea version of G to be

H(x, y) := G-area((−∞, x]× (−∞, y]); (x, y) ∈ R2.

Page 30: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 6

Conditional Distributions

When A,B are two events, the conditional probability of A given B, denoted asP(A|B), is define as

P(A|B) =P(A ∩B)

P(B), when P(B) > 0.

The Bayes’ theorem shows how to flip the positions of A,B. That is,

P(B|A) =P(A|B) P(B)

P(A), when P(A) > 0, P(B) > 0.

The theorem of total probability (TTP) states that if C1, C2, · · · , forms a partitionof the sample space then

P(A) =∞∑

i=1

P(A|Ci) P(Ci), when P(Ci) > 0 for all i.

Example - 6.0.5 - (Memory less property) For X ∼ Geometric(p), prove thatP(X ≥ j + k|X ≥ j) = P(X ≥ k), for any nonnegative integers j, k. Does thisproperty hold when j, k are not necessarily integers?

When j, k are integers, we have

P(X ≥ j + k|X ≥ j) =P(X ≥ j + k ∩X ≥ j)

P(X ≥ j)

=P(X ≥ j + k)

P(X ≥ j)

=(1− p)j+k

(1− p)j

= P(X ≥ k).

50 Conditional Distributions

This property may fail when j, k are not integers. For instance, j = k = 1/2 givesa counter example.

The above notions involving probabilities carry over to densities. Suppose(X,Y ) have a discrete bivariate density, f(x, y). The conditional density of Xgiven Y = y is taken to be

fX|Y =y(x) =fX,Y (x, y)

fY (y), x ∈ R,

both in the discrete and the continuous cases. A conditional density acts just asany usual density, as a function of x. However, as a function of y it does not actlike a density. The Bayes theorem in density format is

fY |X=x(y) =fX|Y =y(x) fY (y)

fX(x), x ∈ R,

The theorem of total probablity, in density format, is

fX(x) =

∫ ∞

−∞fX|Y =y(x) fY (y) dy

Example - 6.0.6 - (Conditional densities in the bivariate normal case) Let(X,Y ) ∼ BV N((µ1, µ2), (σ

21 , σ

22), ρ). To make our life a bit simpler, let us first take

µ1 = µ2 = 0 and σ1 = σ2 = 1. In this case, the marginal density of Y is obtainedby simply integrating out the unwanted variable x. That is,

fY (y) =

∫ ∞

−∞f(x, y) dx =

1

2π√

1− ρ2

∫ ∞

−∞exp

−[x2 − 2ρxy + y2

]

2(1 − ρ2)

dx

=e−(y2−(ρy)2)/(2(1−ρ2))

2π√

1− ρ2

∫ ∞

−∞exp

−1

2(1− ρ2)

[x2 − 2ρxy + (ρy)2

]dx

=e−y2/2

√2π

(1√

2π(1− ρ2)

∫ ∞

−∞exp

−1

2(1− ρ2)[x− ρy]

2

dx

)

=e−y2/2

√2π

.

Here the we used the fact that the area under the normal density N(ρy, (1 − ρ2))is one. Hence, we see that Y ∼ N(0, 1). Therefore, the conditional density of Xgiven Y = y is

fX|Y =y(x) =f(x, y)

fY (y)=

1

2π√

1−ρ2exp

−1

2(1−ρ2)

[x2 − 2ρxy + y2

]

1√2π

e−y2/2

=1√

2π(1− ρ2)exp

−1

2(1− ρ2)[x− ρy]2

.

Page 31: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Conditional Distributions 51

Hence, we see that the conditional density of X given Y = y is N(ρy, (1 − ρ2)).More generally, when (U, V ) ∼ BV N((µ1, µ2), (σ

21 , σ2

2), ρ), the reader can verify(along the above lines) that the marginal density of U is N(µ1, σ

21), the marginal

density of V is N(µ2, σ22), and the conditional density of U given V = v is

U |V = v ∼ N(µ1 +ρσ1

σ2(v − µ2), σ

21(1− ρ2)) (0.1)

Remark - 6.0.5 - It should be kept in mind that the conditional density is defined,in the continuous random variable case, when the given event has zero probability.We should not use the usual rules of conditional probability on the conditioningevents in such situations. If we still apply those rules, we may run into contradictoryresults. Such contradictions are called Borel paradoxes. Here is one such paradox.

Example - 6.0.7 - (A Borel paradox) Let X,Yiid∼ N(0, 1). Let us find the

conditional density of X given X = Y . (Note that P(X = Y ) = 0.)

Method 1. Let Z = X−Y and note that we want to find the conditional densityof X given that Z = 0. This requires us to find

fX|Z=0(x) =fX,Z(x, 0)

fZ(z).

It is not difficult to see that Z ∼ N(0, 2). The absolute value of the jacobian is 1.The joint density of X and Z is

fX,Z(x, z) =1√2π

e−x2/2 · 1√2π

e−(x−z)2/2.

Thus, the required conditional density is fX|Z=0(x) =1√π

e−x2

, which is N(0, 12 ).

Method 2. Consider the random variable Z = X/Y . Now X = Y if and only ifZ = 1. We now want to find the conditional density of X given X

Y = 1,

fX|Z=1(x) =fX,Z(x, 1)

fZ(1).

By using the change of variable method, the absolute value of the jacobian of thetransformation is |x|/z2. Therefore,

fX,Z(x, z) =|x|z2

fX,Y (x, x/z) =|x|z2

1

2πe−x2(1+z−2)/2.

Integrating out x, the marginal density of Z is Cauchy(0, 1). Thus, the requiredconditional density is

fX|Z=1(x) =|x| 1

2π e−x2

12π

= |x|e−x2

,

which is not even a normal density. This example underscores the importance ofkeeping the conditioning event as specified to avoid arbitrary results.

52 Conditional Distributions

6.1 Conditional Expectations

A conditional expectation is just an ordinary expectation but obtained from a con-ditional density. For instance, in the case of continuous densities,

E(X|Y = y) =

∫ ∞

−∞x fX|Y =y(x) dx, (conditional expectation),

E(X2|Y = y) =

∫ ∞

−∞x2 fX|Y =y(x) dx, (conditional second moment),

V ar(X|Y = y) = E(X2|Y = y)− (E(X|Y = y))2 , (conditional variance).

In the discrete case, we use summations instead of integrals.

Remark - 6.1.1 - So far we have defined conditional distributions and conditionalexpectations for discrete or continuous cases. To define conditional expectationsfor the general case, and to stay away from Borel paradoxes, one really needsthe concept of differentiation on abstract spaces. In particular, a theorem due toM. Johann Radon and Otton Nikodym. For the time being we will bypass thisdetail.

Example - 6.1.1 - (Convolutions) Here we show that sometimes the theorem oftotal probability can also be used for obtaining new distributions. If X,Y are two(say continuous) random variables with a joint density f(x, y), and the marginaldensity of Y is denoted as g(y), then the theorem of total probability can be usedto find the density of their sum, Z = X + Y . Indeed,

P(Z ≤ z) = P(X ≤ z − Y ) =

∫ ∞

−∞P(X ≤ z − y|Y = y) gY (y) dy

Differentiating with respect to z gives the density of Z (which is a bit more generalversion of convolution)

fZ(z) =

∫ ∞

−∞fX|Y =y(z − y) gY (y) dy =

∫ ∞

−∞f(z − y, y) dy.

If the rvs are discrete, the integral is replaced by a summation symbol. If X,Yare independent with their respective densities, h(x) and g(y), then the resultingdensity, fZ , is the convolution of h and g. Now the distribution of W = X −Y canbe obtained along the above lines as well. What about the distribution of R = X

Yor S = XY ? The same theorem of total probability can be used to obtain

fR(z) =

∫ ∞

−∞|y| f(zy, y) dy, fS(z) =

∫ ∞

−∞

1

|y| f(z/y, y) dy.

So far we have seen the usual four basic operations of algebra +, −, ÷ and ×performed on random variables and how to obtain their corresponding distributions.One natural question is, “what about the composition “” operation?” Conditioningcan handle this operation as well. See Example (6.1.3).

Page 32: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

6.1 Conditional Expectations 53

Remark - 6.1.2 - (Conditional expectation & conditional variance ran-dom variables) It should be noted that E(X|Y = y) is a real number. How-ever, this number may change for different values of y. For instance, E(X|Y =3) may be different from E(X|Y = 13.4). That is, E(X|Y = y) is a func-tion of y. In this function of y, if we replace the dummy variable, y, by therandom variable Y then we obtain the conditional expectation random variable,denoted as E(X|Y ). This random variable has its own distribution. Similary,the conditional variance random variable V ar(X|Y ) is obtained by replacing thedummy variable y by the random variable Y in V ar(X|Y = y).

Usually we are not interested in the distributions of the random variablesE(X|Y ) and V ar(X|Y ). However, these two very special random variables have twofundamental properties that we recall. The theorem of total expectation, (TTE),says that

E(E(X|Y ) ) = E(X).

The theorem of total variance, (TTV), says that

E(V ar(X|Y ) ) + V ar(E(X|Y ) ) = V ar(X).

Example - 6.1.2 - (Conditional expectations in the bivariate normal case)When (X,Y ) ∼ BV N((µ1, µ2), (σ

21 , σ

22), ρ), recall from (0.1), that the conditional

density of X given Y = y is again normal with

E(X|Y = y) = µ1 +ρσ1

σ2(y − µ2), V ar(X|Y = y) = σ2

1(1− ρ2). (1.2)

In this case the conditional variance random variable, V (X|Y ), is a constant andthe conditional expectation random variable, E(X|Y ), is

E(X|Y ) = µ1 +ρσ1

σ2(Y − µ2).

By Example (6.0.6), since we know that the marginal of Y is N(µ2, σ22), we see

thatE (E(X|Y ) ) = µ1 +

ρσ1

σ2(E(Y )− µ2) = µ1 = E(X),

which verifies the theorem of total expectation. The theorem of total variance alsois easily verified since

E (V ar(X|Y ) ) + V ar (E(X|Y ) )

= E(σ2

1(1− ρ2))

+ V ar

(µ1 +

ρσ1

σ2(Y − µ2)

)

= σ21(1− ρ2) +

(ρσ1

σ2

)2

σ22 = σ2

1 = V ar(X).

This verifies the theorem of total variance in this example.

Exercise - 6.1.1 - Verify statements (1.2).

54 Conditional Distributions

HW13 Exercise - 6.1.2 - Let X ∼ Exp(λ). Compare P(X ≥ x+y|X ≥ x) with P(X ≥ y),when x, y > 0.

HW14 Exercise - 6.1.3 - Let (U, V ) ∼ BV N((µ1, µ2), (σ21 , σ2

2), ρ), verify that the marginaldensity of U is N(µ1, σ

21), the marginal density of V is N(µ2, σ

21), and the condi-

tional density of U given V = v is

U |V = v ∼ N(µ1 +ρσ2

σ1(v − µ2), σ

22(1− ρ2)).

Exercise - 6.1.4 - Let (X,Y ) have the following bivariate density

f(x, y) =

3x if 0 < y < x < 1,0 otherwise.

Find (i) the marginal density of Y , (ii) the conditional density of X given Y = y,(iii) P(X ≤ 0.5|Y = 0.25), (iv) P(X ≤ 0.5|Y ≤ 0.25), (v) E(X|Y = y).

HW15 Exercise - 6.1.5 - (Empirical distribution function) Let X1,X2, · · · ,Xn be asequence of iid random variables having some common distribution F . Conditionedon X = [X1,X2, · · · ,Xn]′, we define a discrete uniform random variable, U , overthe values X1,X2, · · · ,Xn. That is,

P(U = Xi|X) =1

n, i = 1, 2, · · · , n.

Here ties are allowed in the sense that if X1 = X2 then P(U = X1|X) = 1n + 1

n = 2n .

The conditional distribution of U |X is called the empirical distribution function.A better way is to define the distribution

FU |X(x) = P(U ≤ x|X) =the number of Xi which are ≤ x

n.

Using the empirical distribution, show that

E(U |X) = Xn, V ar(U |X) =1

n

n∑

i=1

(Xi −Xn)2.

Then by using the theorems of total expectation and total variance, verify thatE(U) = E(X1) and V ar(U) = V ar(X1). From the statistical point of view, theconditional distribution of U given X itself is considered as an estimator of wholeF . More precisely, for a fixed real number x, let Yi(x) := 1 if Xi = x and zero

otherwise, i = 1, 2, · · · , n. Verify that Y1(x), Y2(x), · · · , Yn(x)iid∼ B(1, F (x)), and

FU |X(x) =1

n

n∑

i=1

Yi(x), E(FU |X(x)) = F (x), V ar(FU |X(x)) =F (x)(1− F (x))

n.

The quantitysup

−∞<x<∞

∣∣FU |X(x)− F (x)∣∣

measures the largest discrepency between the two distributions, FU |X(x) and F (x),and plays a basic role in model selection.

Page 33: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

6.1 Conditional Expectations 55

Example - 6.1.3 - (Compositions of random variables) Your friend rolls afair die and tells you the value he observed. Then you toss a coin (with probabilityof heads being p) that many times. What is the distribution of the number of headsyou observed?

To model this, let N be the number of heads you observed and let Y be the factvalue your friend observed. Note that N = X(Y ) is a composition of two randomvariables where X(n) ∼ B(n, p) and Y ∼ Uniform1, 2, 3, 4, 5, 6. By the theoremof total probability

P(N = k) =

6∑

j=1

P(N = k |Y = j) P(Y = j)

=

6∑

j=k

(j

k

)pk(1− p)j−k 1

6

=pk

6

6−k∑

i=0

(i + k

i

)(1− p)i, k = 0, 1, 2, 3, 4, 5, 6.

Furthermore, by the theorem of total expectation, E(N) = E(E(N |Y )) = E(Y p) =3p and by the theorem of total variance

V ar(N) = V ar(Y p) + E(Y p(1− p)) = p2V ar(Y ) + 3p(1− p).

56 Conditional Distributions

Page 34: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 7

Conditional Expectations &

Martingales

7.1 Properties of E(X|Y )

It should now be self evident as to what the notation

E(X|Y1, Y2)

must stand for. Indeed, it is a random variable that is a function of Y1, Y2 and isobtained by computing

E(X|Y1 = v, Y2 = u)

and then replacing v by Y1 and u by Y2 in it. Of course the more general version

E(X|Y1 = v, Y2 = u, · · · , Yn = w)

is obtained similarly. Here are the main properties.

• (i) (Function of given info) E(X|Y ) is always a function of Y .

• (ii) (Indicator property) For any indicator random variable χB , where theevent B is determined by Y , we have

E(X χB) = E(χB E(X|Y )).

• (iii) (TTE) E(E(X|Y )) = E(X).

• (iv) (TTV) E(V ar(X|Y )) + V ar(E(X|Y )) = V ar(X).

• (v) (Linearity on the left side of the vertical line)

E(ah(X,Y ) + bg(X,Y ) + c|Y ) = aE(h(X,Y )|Y ) + bE(g(X,Y )|Y ) + c.

where a, b, c are constants.

58 Conditional Expectations & Martingales

• (vi) (Given factor comes out) Any factor determined by the informationon the right side of the vertical line can be treated as a constant: For instance,

E(h(Y )g(X,Y )|Y ) = h(Y )E(g(X,Y )|Y ).

• (vii) (Compression property)

E(E(Z|X)|X,Y ) = E(E(Z|X,Y )|X) = E(Z|X).

That is, the smaller given information survives.

• (viii) (Conditional Jensen’s inequality) For any convex function h ifE|X| <∞ and E|h(X)| <∞ then

E(h(X)|Y ) ≥ h(E(X|Y )).

In particular, E(|X| |Y ) ≥ |E(X|Y ) |.• (ix) (Covariance property) Cov(Y,E(X|Y )) = Cov(X,Y ) when variances

are finite.

• (x) (Independence property) If X,Y are independent then E(X|Y ) =E(X).

• (xi) (Projection property) When X,Y have finite variances, among allfunctions h(Y ) with finite variance,

E(X − h(Y ))2

is the least for the function h(Y ) = E(X|Y ).

Let us prove the last two properties. Note that when X,Y are independent, then

fX,Y (x, y) = fX(x) fY (y), =⇒ fX|Y =y(x) =fX,Y (x, y)

fY (y)= fX(x).

This gives E(X|Y ) = E(X). For the projection property, add and subtract E(X|Y )and expand to get

E(X − h(Y ))2 = E (X − E(X|Y )) + (E(X|Y )− h(Y ))2

= E(X − E(X|Y ))2 + E(E(X|Y )− h(Y ))2 + 2E((X − E(X|Y ))(E(X|Y )− h(Y ))).

Now notice something surprising. The TTE and Property (iv) give

E((X −E(X|Y ))(E(X|Y )− h(Y ))

= E (E((X − E(X|Y ))(E(X|Y )− h(Y ))|Y )) , TTE,

= E ((E(X|Y )− h(Y ))E((X − E(X|Y )))|Y )) , Prop (iv),

= E ((E(X|Y )− h(Y )) (E(X)− E(E(X|Y )))|Y )) , Linearity,

= E ((E(X|Y )− h(Y )) (E(X)− E(X))|Y )) , TTE,

= 0.

So, we see that

E(X − h(Y ))2 = E(X −E(X|Y ))2 + E(E(X|Y )− h(Y ))2 ≥ E(X −E(X|Y ))2.

Need say no more!

Page 35: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

7.2 Martingales 59

Example - 7.1.1 - Let X1,X2, · · · , iid∼ F , having finite variance. Let Sn = X1 +X2 + · · ·+ Xn. Let us find

(i) E(Sn|X1), and (ii) E(X1|Sn).

The first one is easy since

E(Sn|X1) = X1 + E(X2) + E(X3) + · · ·+ E(Xn) = X1 + (n− 1)E(X1).

The reader should justify the above steps. The second one is not too hard either,thanks to the above properties of conditional expectation random variable. Indeed,let E(X1|Sn) = Z. Since Xi are iid, one may not object to noting E(X2|Sn) =· · · = E(Xn|Sn) = Z. Adding these up gives nZ = E(Sn|Sn) = Sn. Hence,E(X1|Sn) = Z = Sn

n .

Exercise - 7.1.1 - Consider X,Y with joint density

f(x, y) =

6(1 − y) if 0 < x < y < 1,0 otherwise.

Compute E(X), E(Y ), E(X2), E(Y 2), V ar(X), V ar(Y ), E(XY ), Cov(X,Y ),Corr(X,Y ), E(X|Y = y), E(X2|Y = y) and V ar(X|Y = y) and the correspondingconditional random variables E(X|Y ), V ar(X|Y ). Then verify the TTE and TTV.

Exercise - 7.1.2 - Let X,Yiid∼ N(0, 1). Find the mgf of XY .

HW16 Exercise - 7.1.3 - Let X|Λ = λ be distributed as Poisson(λ) and let Λ ∼ Exp(β).Find the density of X and the conditional random variables E(X|Λ), V ar(X|Λ).

Exercise - 7.1.4 - Let X|R = r be distributed as B(n, r), where n is a fixed positiveinteger and let R ∼ Beta(a, b). Find the density of X and obtain the conditionalrandom variables E(X|R), V ar(X|R).

Exercise - 7.1.5 - Let X|N = n be distributed as B(n, r), where r ∈ (0, 1) is afixed real number and let N ∼ Poisson(λ). Find the density of X and obtain theconditional random variables E(X|R), V ar(X|R).

7.2 Martingales

Two of the most useful constructs of modern probability theory are martingales andMarkov chains. While a martingale is based on considitional expectations, Markovchains are based on conditional probabilities. Although the two concepts are dis-tinct, they have some common ground. Here we collect a few results concerningMartingales.

60 Conditional Expectations & Martingales

Definition - 7.2.1 - (Martingale) Let Y1, Y2, · · · be a sequence of (finitely or in-finitely many) random variables. Let Mn = fn(Y1, Y2, · · · , Yn), n = 1, 2, · · · be anew sequence made up from the Y1, Y2, · · · , so that E|Mn| <∞ for all n = 1, 2, · · · .We say that the collection, M1,M2, · · · , forms a martingale if

E (Mn+1 |Y1, Y2, · · · , Yn) = Mn, n = 1, 2, · · · .

Note that the value of Mn is determined by the values of Y1, Y2, · · · , Yn, for eachn. We rephrase this by saying “Mn is an adapted process to Y1, Y2, · · · .”

Example - 7.2.1 - (Centered random walks are martingales) Let Y1, Y2, · · ·be any sequence of independent random variables with finite means µn = E(Yn),n = 1, 2, · · · . Let

Mn = (Y1 − µ1) + (Y2 − µ2) + · · ·+ (Yn − µn) =n∑

i=1

(Yi − µi), n = 1, 2, · · · .

Note that the value of Mn is determined by the values of Y1, · · · , Yn. Also E|Mn| <∞. To see that the martingale property holds, note that

E(Mn+1|Y1, · · · , Yn) = E

(

(Yn+1 − µn+1) +

n∑

i=1

(Yi − µi) |Y1, · · · , Yn

)

= E (Yn+1 − µn+1) + Mn

= Mn, n = 1, 2, · · · .

Example - 7.2.2 - (Doob’s martingale) Let Z, Y1, Y2, · · · be any random vari-ables so that E|Z| <∞. Define

Mn := E (Z |Y1, · · · , Yn) , n = 1, 2, · · · .

It is clear that the value of Mn is determined by the values of Y1, Y2, · · · , Yn. Bythe TTP and the conditional Jensen’s inequality,

E|Mn| = E (|E (Z|Y1, · · · , Yn)|) ≤ E (E (|Z| |Y1, · · · , Yn)) = E(|Z|) < ∞.

Using the compression property we get

E (Mn+1 |Y1, · · · , Yn) = E (E (Z |Y1, · · · , Yn+1) |Y1, · · · , Yn)

= E (Z |Y1, · · · , Yn)

= Mn, n = 1, 2, · · · .

Exercise - 7.2.1 - (Wald’s martingale) Let Y1, Y2, · · · be a sequence of indepen-dent and identically distributed random variables such that the moment generatingfunction φ(θ) := E(eθY1) exists in a neighborhood of zero. Fix a real number θ anddefine

Mn :=expθ(Y1 + Y2 + · · ·+ Yn)

(φ(θ))n=

n∏

i=1

(eθYi

φ(θ)

).

Show that M1,M2, · · · obeys the martingale property.

Page 36: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

7.2 Martingales 61

Exercise - 7.2.2 - (Variance martingale) Let Y1, Y2, · · · be a sequence of inde-pendent and identically distributed random variables such that E(Yn) = 0 andV ar(Yn) = σ2 <∞. Define a sequence of random variables,

Mn :=

(n∑

i=1

Yi

)2

− nσ2, n = 1, 2, · · · .

Show that M1,M2, · · · obeys the martingale property.

Exercise - 7.2.3 - (Random harmonic series) Let U1, U2, · · · iid∼ B(1, p) repre-sent the zero/one outcomes of a coin toss. Define

Xn :=n∑

i=1

2Ui − 1

i, n ≥ 1.

Let Fn denote the information regarding U1, U2, · · · , Un, as a short hand notationso that E(W |Fn) stands for the conditional expectation E(W |U1, U2, · · · , Un). Forwhich value(s) of p does X1,X2, · · · obey the martingale property with respect toF1,F2, · · · ?

Exercise - 7.2.4 - Let Y1, Y2, be a sequence of iid random variables with E(Y1) = 0and mgf E(euY1) = φ(u). Let Mn := euSn−n ln φ(u), where Sn = Y1 + Y2 + · · ·+ Yn.Show that for each u, Mn is a martingale.

HW17 Exercise - 7.2.5 - (Martingale generation technique) Consider Exercise (7.2.4).(i) Formally differentiate Mn with respect to u once to deduce that Sn is a cen-tered random walk martingale of Example (7.2.1). (ii) Formally twice differentiateMn with respect to u to deduce that S2

n−nσ2 is a variance martingale of Exercise(7.2.2), where σ2 = V ar(Y1). (iii) By three times differentiating Mn and lettingu→ 0, obtain an expression for a possible martingale containing an S3

n term. Thendirectly verify.

Exercise - 7.2.6 - Let Y1, Y2, · · · be iid N(µ, σ2) and let Sn =∑n

k=1 Yk. Find anonzero value of θ so that Mn := eθSn is a martingale.

HW18 Exercise - 7.2.7 - Let Y1, Y2, · · · be iid N(µ, σ2) and let Sn =∑n

k=1 Yk. For afixed value of r > 0, find two nonzero values of θ for which Mn := eθSn−nr becomesa martingale.

62 Conditional Expectations & Martingales

Page 37: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 8

Independence &

Transformations

We will now provide some of the standard techniques of distribution theory. Thereare three (somewhat unrelated) themes here.

1. The first theme shows a little bit of the normal sampling theory, which isexplored in more detail in the following chapter.

2. The second theme deals with the ranks and order statistics, which lie behindnonparametric side of statistical inference.

3. The second theme provides convolution densities, i.e., the density of a sum(or linear combinations) of (usually independent) random variables. Randomwalks is rich subclass of this topic. Also this theme leads one to define splines,interpolation techniques and culminates with Peano kernels of interpolationand other approximation techniques. The last section of this chapter showshow (unexpectedly) one finds the merger of order statistics (from uniforms)with the concepts of splines and together they lead up to the Peano kernelsof various approximation techniques.

But first we start off by explaining how one can construct independent randomvariables.

8.1 Transformations of Random Variables

When one transforms a random variable X into another random variable Y withthe help of a function, Y = h(X), then there are at least three kinds of questionscan raised:

• (i) What is the distribution of the new rv Y = h(X)?

• (ii) What is the expectation of the new rv, E(h(X))?

• (iii) By collecting E(h(X)) for several h’s, can we reconfigure the distributionof X from them?

64 Independence & Transformations

Question (ii) was answered in the last chapter. Now we try to answer questions (i)and (ii).

It turns out that the density of Y can be obtained from the density of X in astraight forward way, especially when h is a one-one function. For instance, whenX is a continuous random variable with density fX , then the density of Y is

fY (y) = fX(h−1(y))

∣∣∣∣dh−1(y)

dy

∣∣∣∣

In the discrete case the last derivative term is omitted. More generally, if h is ak-one function, we break up the domain of h over intervals where it is one-one andapply the above result over each subinterval and then add up all the terms. Moreprecisely,

Proposition - 8.1.1 - Let X be a continuous random variable taking values inan open set ∆ ⊆ R with density fX where ∆ is the disjoint union of open sets∆1,∆2, · · · ,∆k. If h is a k-one function which is strictly monotone over each∆i with continuously differentiable inverse functions h−1

1 , h−12 , · · · , h−1

k then thedensity of Z = h(X) is given by

fZ(z) =k∑

i=1

χh(∆i)

(z)fX(h−1i (z)) ·

∣∣∣∣d

dzh−1

i (z)

∣∣∣∣ ; z ∈ h(∆);

where χh(∆i)

(z) = 1 if z ∈ h(∆i) and otherwise the whole entry in the summand istaken to be zero.

Proof: For any small ǫ > 0, consider P (z − ǫ < Z < z + ǫ). Since h is k-onefunction, there are at most k number of xi such that h(xi) = z for i = 1, 2, · · · , k.Since, h is a strictly monotone function over ∆i, the inverse set of h(x) ∈ (z−ǫ, z+ǫ)would be the union of at most k intervals I1, I2, · · · , Ik, where Ii = (xi−δi, xi +τi).That is, h(xi − δi) = z − ǫ, h(xi + τi) = z + ǫ, when h is increasing in Ii and theother way round if h is decreasing in Ii. If z 6∈ h(∆i) then no such δi and τi comeinto picture and we simply ignore such ∆i. So,

P (z − ǫ < Z < z + ǫ) = P (X ∈ I1 ∪ X ∈ I2 ∪ · · · ∪ X ∈ Ik)

=k∑

i=1

P (X ∈ Ii)

=k∑

i=1

FX(xi + τi)− FX(xi − δi).

Some terms in the summation may become zero if χh(∆i)(z) = 0. (For convenienceof notation we ignore this indicator term.) Dividing by 2ǫ on both sides gives that

P (z − ǫ < Z < z + ǫ)

2ǫ=

k∑

i=1

(τi + δi

)FX(h−1

i (z) + τi)− FX(h−1i (z)− δi)

τi + δi.

Page 38: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

8.1 Transformations of Random Variables 65

As ǫ drops to zero, both δi and τi drop to zero. The left side gives the density of Zand so,

fZ(z) =k∑

i=1

limǫ→0

τi + δi

2ǫfX(h−1

i (z)).

Now note that τi + δi = h−1i (z + ǫ) − h−1

i (z − ǫ) when hi is increasing in Ii andτi +δi = h−1

i (z−ǫ)−h−1i (z+ǫ) when hi is decreasing in Ii. Consider the increasing

case.

limǫ→0

τi + δi

2ǫ= lim

ǫ→0

h−1i (z + ǫ)− h−1

i (z − ǫ)

= limǫ→0

h−1i

(z + ǫ) + h−1i

(z − ǫ)

2(L’Hopital’s rule)

=d

dzh−1

i (z), (continuity of h−1i

).

In the decreasing case we get the negative sign with the derivative. ♠

Example - 8.1.1 - In Exercise (2.0.8) the reader showed that Y = X2 ∼ χ2(1) when

X ∼ N(0, 1). Now we verify this result using the above proposition. Since h(x) = x2

is a 2-one function, its inverse functions being h1(y) =√

y and h2(y) = −√y fory > 0, we see that the density of Y is

fY (y) = fX(h−11 (y)) ·

∣∣∣∣d

dyh−1

1 (y)

∣∣∣∣+ fX(h−12 (y)) ·

∣∣∣∣d

dyh−1

2 (y)

∣∣∣∣

=1√2π

e−y/2 1

2√

y+

1√2π

e−y/2 1

2√

y

=(1/2)

12√

πy

12−1e−y/2, y > 0.

This is the density of Gamma( 12 , 1

2 ) ∼ χ2(1) since Γ( 1

2 ) =√

π.

Example - 8.1.2 - (From uniform to any F , i.e., Simulation) Let U ∼Uniform(0, 1), and let F (x) be a given distribution. If F is invertible, thenX = F−1(U) has the distribution F . This idea works even when F is not one-one, if we construct an inverse carefully. Define an “inverse” function

H(u) := inf t : F (t) ≥ u . (1.1)

One can prove that, (i) the infimum is obtained, and (ii) X := H(U) ∼ F . (i) SinceF is a right continuous function, the minimum is always attained. Figure (8.1)explains how the above defined inverse operation works out for various values ofu, v, w and z. (ii) We need to show that P(X ≤ x) = F (x), for all real numbersx. Since X ≤ x = U : H(U) ≤ x, let A := U : H(U) ≤ x. Define anotherevent B := U : U ≤ F (x). We claim that A = B. To prove this we need to showthat A ⊆ B and B ⊆ A. Let U ∈ A. Then H(U) ≤ x. In other words

H(U) = min t : F (t) ≥ U ≤ x. (1.2)

66 Independence & Transformations

F(t)

t

w

v

u

z

H(u) H(v) H(w) H(z)

Figure 8.1: Inverse of a Distribution Function.

The monotonicity of F applied to the left and the far right sides of (1.2) gives thatF (H(U)) ≤ F (x). Since the minimum is attained, the left equality part of (1.2)also gives that U ≤ F (H(U)). Putting these two facts together gives U ≤ F (x).Hence, U ∈ B. We have shown that A ⊆ B. Conversely, if U ∈ B then U ≤ F (x).Hence, x is among all those values of t for which F (t) ≥ U . Therefore,

H(U) = mint : F (t) ≥ U ≤ x.

This gives that U ∈ A, and hence B ⊆ A. Hence, A = B, making their probabilitiesequal, namely P(X ≤ x) = P(U ≤ F (x)) = F (x), since U is a uniform randomvariable and F (x) ∈ [0, 1] for all real values of x.

HW19 Exercise - 8.1.1 - Let F be a continuous cdf and let H(p) be the inverse functionas defined in Example (8.1.2) for p ∈ (0, 1). Prove that

• (i) F (H(p)) = p.

• (ii) H(p) is strictly increasing in p and H(F (x)) ≤ x, for F (x) ∈ (0, 1).

• (iii) H(p) is left continuous.

The multivariate extension of the above proposition is similar. For simplicitywe mention the bivariate case, since the multivariate case is analogous. Supposethat U, V be two continuous random variables with joint density f(u, v). Supposewe want to find the density of a new random variable X = h(U, V ). One way thisis accomplished is by the following few steps.

• Step 1. Introduce another random variable Y = k(U, V ), so that the trans-formation (h(u, v), k(u, v)) is a one-one mapping.

• Step 2. Find the inverse of the transformation, which we denote by u =r1(x, y), v = r2(x, y).

• Step 3. Find the jacobian of the inverse transformation,

J =

∣∣∣∣∣

∂r1

∂x∂r1

∂y∂r2

∂x∂r2

∂y

∣∣∣∣∣

Page 39: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

8.1 Transformations of Random Variables 67

• Step 4. Obtain the joint density of X,Y be the formula

g(x, y) = f(u, v) |J |, where insert u = r1(x, y), v = r2(x, y).

(In the discrete case Step 3 is not needed and |J | is ignored.)

• Step 5. Finally get the marginal of X, by integrating out (or summing outin the discrete case) the variable y.

Example - 8.1.3 - Let us prove that if U, Viid∼ N(0, 1) then X = U +V ∼ N(0, 2).

Let us introduce a new random variable Y := U − V so that the transformation isone-one. Now the inverse transformation is u = (x+y)/2, v = (x−y)/2. This givesthat the Jacobian of the inverse transformation is J = 1

2 . Hence the joint densityof X,Y is

g(x, y) = f(u, v) |J | =1

2× 2πe−((x+y)2+(x−y)2)/8

=1

4πe−x2/4 e−y2/4.

This shows that X,Y are independent and X ∼ N(0, 2), as well as Y ∼ N(0, 2).

Example - 8.1.4 - (Ratio of standard normals is Cauchy) Finding the densityof a ratio of two random variables is very similar to the convolution density. Weillustrate this by showing that the density of the ratio of two indepenent standard

normals happens to be a Cauchy density. Let X,Yiid∼ N(0, 1), and let Z = X/Y .

Let V = Y . Now the inverse transformation is y = v and x = vz. In other words,

J =

∣∣∣∣v z0 1

∣∣∣∣ = v.

Therefore, the joint density of Z, V is

fZ,V (z, v) = fX(vz) gY (v) |v|.

Integrating out the unwanted variable gives the density of Z,

fZ(z) =

∫ ∞

−∞|v| fX(vz) fY (v) dv.

In our case, X,Y are standard normals, therefore,

fZ(z) =1

∫ ∞

−∞|v| e−(1+z2)v2/2 dv

=2

∫ ∞

0

v e−(1+z2)v2/2 dv

=1

π (1 + z2)

∫ ∞

0

e−t dt, t = (1 + z2)v2/2,

=1

π (1 + z2), z ∈ R.

68 Independence & Transformations

HW20 Exercise - 8.1.2 - (Separate functions of independent r.vs. are indepen-dent) Let X,Y be independent random variables. If U = h(X) and V = g(Y ) aresuch that U, V have joint mgfs, prove that U, V are also independent. (Hint: usethe independence property of mgfs.)

Exercise - 8.1.3 - (Separate functions of uncorrelated r.vs. are not neces-sarily uncorrelated) Construct two random variables X,Y that are uncorrelatedand functions, h, g, so that U = h(X) and V = g(Y ) are correlated.

8.2 Sequences of Independent Random Variables

Earlier we gave a definition of independence of a sequence of infinitely many events.How do we know that such a sequence exists? Intuitively it does not take muchto imagine an infinite sequence of fair coin tosses. But what sort of (Ω, E , P) cor-responds to such an experiment? Here we present a heuristic argument, which canbe made as precise as one desires, while gives a bit more general result of an in-finite sequence or independent random variables. Then we present some standardtransforms, such as their partial sums of which random walks is a famous example,rank statistics, record statistics and order statistics.

Remark - 8.2.1 - (Infinite sequence of independent events, Rademachersequence) If we identify a head by 1 and a tail by 0, then members of Ω must beinfinite sequences of 0’s and 1’s, denoted as ω = (u1, u2, · · · ). Such a sequence alsogives us a real number,

ω ↔ r :=u1

21+

u2

22+

u3

23+ · · · , r ∈ [0, 1]. (2.3)

which lies in the interval [0, 1]. Conversely, every number in [0, 1] may be written in(nonterminating) binary expansion of the above type. Therefore, we may identifyeach infinite coin toss experiment to represent a point in [0, 1].

Next note that the subset of Ω that describes the fate of the first n tossescorresponds to an interval in [0, 1] and vice-versa. Therefore, it is natural to considerthe smallest sigma field containing the subintervals of [0, 1] to be our event spaceE = B, i.e., the Borel sigma field over [0, 1].

Finally, as our model, we insist that the probability function, P that we aim todefine, must reflect the intuitive nature of the experiment. Namely,

• (i) Unrelatedness of the coin tosses must be preserved. That is, the fate ofthe first n coin tosses should not influence (or be influenced by) the fate ofanother batch of m tosses beyond the first n tosses.

• (ii) Since the coin is fair, we may identify a “random selection” of a point fromthe interval [0, 1] through the outcome of an infinite fair coin toss experiment.

It is not too difficult, then, to show that a function P can be defined which satisfiesthese requirements. In fact, that function P assigns the probability to any interval,(a, b) ⊆ [0, 1] to be its length,

P([a, b]) = P((a, b]) = P([a, b)) = P((a, b)) = b− a, 0 ≤ a ≤ b ≤ 1.

Page 40: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

8.2 Sequences of Independent Random Variables 69

With this probability space ([0, 1], E , P) we can now construct an example of aninfinite sequence of independent events. As long as the events describe separateportions of the infinite coin tosses, they will give an infinite sequence of independentevents.

Next, if we define a random variable U : [0, 1] → [0, 1] to be the identityfunction, then U ∼ Uniform(0, 1). By (2.3), we also see that

U =

∞∑

i=1

ui

2i.

In fact, with the help of this example, we can construct an infinite sequence of inde-pendent random variables, Y1, Y2, · · · , having any specified distributions F1, F2, · · ·respectively. The idea is very simple and here it is.

Instead of writing the outcomes of the infinitely many tosses as a sequence,write it as a double array as presented in the adjacent figure. That is, we put atthe head of each arrow the outcome of the coin toss. So, u1 goes at the head of thefirst arrow, u2 goes at the head of the second arrow, and so on.

-

?6

?

?- - - -6

6-

?

?

?6

6

6

?

u1

u2

Note that this is how we also show that the set of rationals is countable. So, thepositions of the heads of the arrows my be identified as the rationals p

q , whereq = 1, 2, · · · and p = 0,±1,±2, · · · . Considering column wise, we have partitionedthe sequence of infinite coin tosses into separate infinitely many infinite subse-quences of fair coin tosses. Each column, say p, gives rise to a uniform randomvariable, which we many denote as Up ∼ Uniform(0, 1). Since they are made fromseparate portions (columns), U1, U2, · · · forms an independent sequence (actually aniid sequence) of Uniform(0, 1) random variables. Finally, by taking Yi = Hi(Ui),where Hi is as defined in (1.1) from Fi, we see that Y1, Y2, · · · will be an infinitesequence of independent random variables with Yi ∼ Fi.

So, for instance, we may define random varibles X,Y,Z on ([0, 1],B, P) so thatX ∼ N(µ, σ2), Y ∼ Poisson(13) and Z ∼ Cauchy(0, 1) or a random vector X =[X1,X2, · · · ,Xd]

′ ∼ MV N(µ,V). In some sense ([0, 1],B, P) may be taken to bethe “mother” of all random variables.

HW21 Exercise - 8.2.1 - Let U, Viid∼ Exp(λ). Show that U + V ∼ Gamma(λ, 2). More

generally, U1 + U2 + · · ·+ Un ∼ Gamma(λ, n) when U1, U2, · · · , Uniid∼ Exp(λ).

Exercise - 8.2.2 - If X1,X2iid∼ B(n, p), find P (X1 + X2 = k) for k = 0, 1, · · · , 2n.

Exercise - 8.2.3 - If X,Y are independent with X ∼ Poisson(λ), Y ∼ Poisson(µ),find P (X + Y = k) for k = 0, 1, 2 · · · .

70 Independence & Transformations

HW22 Exercise - 8.2.4 - Let U, Viid∼ Gamma(λ, 1

2 ). Show that U +V ∼ Gamma(λ, 1) ∼Exp(λ). (Here you may use the fact that Γ( 1

2 ) =√

π.) Let X,Yiid∼ N(0, 1), deduce

that X2 + Y 2 ∼ χ2(2).

8.3 Generating Functions

Finally we should mention that there are several types of transforms that are usedin probability theory. The context is a bit different from the one we have presentedabove. Now the aim is not to find the distribution but instead find the expectation.These transforms store some features of the distribution of X which may be easierto study in the transformed domain than in the original distribution. Typicalexamples along these lines are as follows.

• Generating functions, (designed for nonnegative integer valued rvs).

E(tX) =

∞∑

k=0

tk P(X = k), t ∈ (0, 1].

• Laplace transforms, (designed for nonnegative continuous rvs).

E(e−λX) =

∫ ∞

0

e−λx f(x) dx, λ ∈ (0,∞).

• Moment generating functions, or bilateral Laplace transforms, (defined for allvarieties of random variables, as long as they exist). The moment generatingfunction (mgf) of X = [X1, · · · ,Xd]

′ is the following expectation (providedit exists for all θ in a neighborhood of 0.)

E(eθ·X) = E(eθ1X1+···+θdXd

), θ = [θ1, · · · , θd]

′.

• Fourier-Stieltjes transforms, also called the characteristic functions, (definedfor all varieties of random variables). The characteristic function of X =[X1, · · · ,Xd]

′ is defined to be the expectation,

E(eiθ·X) = E(eiθ1X1+···+iθdXd

), θ ∈ Rd.

A useful device to handle sums of independent random variables (and for severalother uses) is the concept of a transform or a generating function.

Definition - 8.3.1 - (Moment generating function) The moment generating function(mgf) of X = [X1, · · · ,Xd]

′ is the following expectation (provided it exists for allθ in a neighborhood of 0.)

E(eθ·X) = E(eθ1X1+···+θdXd

), θ = [θ1, · · · , θd]

′.

The characteristic function of X = [X1, · · · ,Xd]′ is defined to be the expectation,

E(eiθ·X) = E(eiθ1X1+···+iθdXd

), θ ∈ Rd.

Page 41: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

8.3 Generating Functions 71

Example - 8.3.1 - If X ∼ N(0, σ2) then, by completing the square, we have

E(eθX) =1

σ√

∫ ∞

−∞eθue−u2/(2σ2)du

= eσ2θ2/2

(1

σ√

∫ ∞

−∞e−(u−θσ2)2/(2σ2)du

)

= eσ2θ2/2.

Here we used the fact that the area under a normal (or any) density is one. Thisis a useful trick and we will use it whenever applicable. A bit more generally, ifU ∼ N(µ, σ2), then X = U − µ ∼ N(0, σ2). That is, U = X + µ, and the aboveresult gives the mgf of U ∼ N(µ, σ2) as

E(eθU ) = eµθ+(σ2θ2/2).

Similarly, the characteristic function of U ∼ N(µ, σ2) is E(eiθU ) = eiµθ−(σ2θ2/2).

Example - 8.3.2 - When X ∼ Gamma(λ, α), one can show that the mgf of X is

E(etX) =

λ− t

, t < λ.

This is easy to prove. Indeed,

E(etX)

=λα

Γ(α)

∫ ∞

0

extxα−1e−λx dx

=λα

(λ− t)α

((λ− t)α

Γ(α)

∫ ∞

0

xα−1e−(λ−t)x dx

)

=λα

(λ− t)α, t < λ.

Here we used that the area under the gamma (or any) density is one. In particular,when X ∼ χ2

(d), its mgf is

E(etX)

=1

(1− 2t)d/2, for t < 1

2 .

Remark - 8.3.1 - (Properties of mgf) Some of the nice features of moment gen-erating functions are listed below. Similar results hold for characteristic functionsas well. Through out we will take these facts for granted.

• (Generation of moments) By differentiation we can derive the moments ofthe random variable from its moment generating function. More precisely, ifφ(θ) = E(eθX) is the moment generating function of X then

E(X) =d

dθφ(θ)

∣∣∣∣θ=0

E(X2) =d2

dθ2φ(θ)

∣∣∣∣θ=0

E(X3) =d3

dθ3φ(θ)

∣∣∣∣θ=0

72 Independence & Transformations

• (Uniqueness) If two random variables have the same moment generating func-tion then they both have the same distribution.

• (Independence) If the joint moment generating function of X,Y ,

φ(θ1, θ2) = E(eθ1X+θ2Y ),

happens to be the product of the individual moment generating functions ofX and Y ,

E(eθ1X) E(eθ2Y ), for all (θ1, θ2) in a neighborhood of (0, 0),

then the two random variables are independent, and vice-versa.

Example - 8.3.3 - (Two normal random variables are independent if andonly if they are uncorrelated) In general if two random variables are indepen-dent then they must be uncorrelated (provided their moments exist, of course).However, the converse does not hold as one can easily construct counter examples(cf. Exercise (??)). One nice exception exists for normal random variables, forwhich the two concepts are equivalent, as we now show. A bit more general versionis in the next chapter (cf. Example (??)).

We will find the mgf for (U, V ) ∼ BV N((0, 0), (1, 1), ρ). The trick is to use thefact that the area under any normal density is 1. That is,

1√2π

∫ ∞

−∞e−

12σ2 (x−µ)2 dx = 1, for any µ ∈ R, and any σ > 0.

The joint mgf of U, V is,

E(etU+sV )

=1√2π

∫ ∞

−∞

etx

√2π(1− ρ2)

∫ ∞

−∞esy exp

−1

2(1 − ρ2)(x2 + y2 − 2ρxy)

dy dx

=1√2π

∫ ∞

−∞

etxe−x2/2(1−ρ2)

√2π(1− ρ2)

R

exp

−(y2 − 2(ρx + s(1 − ρ2))y)

2(1− ρ2)

dy dx

=1√2π

∫ ∞

−∞etxe−x2/2(1−ρ2)e

12(1−ρ2)

(ρx+s(1−ρ2))2dx

=1√2π

∫ ∞

−∞exp

−1

2(1 − ρ2)

(x2 − (ρx + s(1 − ρ2))2 − 2(1 − ρ2)tx

)dx

=1√2π

∫ ∞

−∞e−(x2(1−ρ2)−2s(1−ρ2)ρx−2(1−ρ2)tx−s2(1−ρ2)2)/(2(1−ρ2))dx

= es2(1−ρ2)/2 1√2π

∫ ∞

−∞exp

−1

2

(x2 − 2(sρ + t)x

)dx

= exp

s2(1− ρ2)

2+

(sρ + t)2

2

= exp(t2 + s2 + 2tsρ)/2

.

Page 42: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

8.3 Generating Functions 73

If (X,Y ) ∼ BV N((µ1, µ2), (σ21 , σ2

2), ρ) then it is easy to see that U = X−µ1

σ1and

V = Y −µ2

σ2have joint cdf

P (U ≤ u, V ≤ v) = P (X ≤ µ1 + uσ1, Y ≤ µ2 + vσ2).

Differentiating with respect to u, v shows that (U, V ) ∼ BV N((0, 0), (1, 1), ρ).Hence, the mgf of (X,Y ) is

Ψ(t, s) = E(e(tX+sY )

)

= E(et(µ1+σ1U)+s(µ2+σ2V )

)

= etµ1+sµ2E(e(tσ1)U+(sσ2)V

)

= etµ1+sµ2 exp((tσ1)

2 + (sσ2)2 + 2tsρσ1σ2)/2

.

Once again we see that X,Y are independent if and only if ρ = 0.

HW23 Exercise - 8.3.1 - If X,Y are independent random variables so that X ∼ N(µ1, σ21),

Y ∼ N(µ2, σ22), then find the mgf of X + Y . What might be its distribution?

HW24 Exercise - 8.3.2 - Verify that the joint mgf of Xn and (Xi −Xn), i = 1, 2, · · · , n,

when X1,X2, · · · ,Xniid∼ N(µ, σ2), is as follows, (see also Exercise (??))

E(etXn+

Pni=1 si(Xi−Xn)

)= eµt+ σ2t2

2n eσ2

2

Pni=1(si−sn)2 .

Classify the following statements as true or false, with justifications.

• (i) Xn is independent of (Xi −Xn), i = 1, 2, · · · , n.• (ii) Xn ∼ N(µ, σ2

n ).

• (iii) Xn is independent of∑n

i=1(Xi −Xn)2.

• (iv) Xn is independent of∑n

i=1(Xi −Xn)3.

• (v) Xn is independent of∑n

i=1(Xi −Xn)4.

74 Independence & Transformations

Page 43: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 9

Ranks, Order Statistics &

Records

For two independent random variables their addition yielded the convolution density.Now consider the operations of minimum and maximum performed on independentrandom variables. What will be their densities?

More elaborately, suppose X1,X2, · · · ,Xn are independent and identically dis-tributed continuous random variables with cumulative distribution function FX .Define new random variables,

X(1) := Y1 := minX1,X2, · · · ,Xn,X(2) := Y2 := next to the minX1,X2, · · · ,Xn,X(3) := Y3 := second next to the minX1,X2, · · · ,Xn,

......

X(n) := Yn := maxX1,X2, · · · ,Xn.

These random variables are called the order statistics1. When n is an odd integer,the middle of these Y ’s is called the median of the sample. When n is an eveninteger, the median of the sample is taken to be the average of the middle twoY ’s. Note that the average of all the Y ’s is the same as the average of all the X’s.Sometimes the order statistics are also denoted as X(i) = Yi, i = 1, 2, · · · , n. Inorder to show their dependence on n, we sometimes use the more elaborate notation

X(n:i) = Yi, i = 1, 2, · · · , n.

The vector of order statistics, D(X) := (X(1),X(2), · · · ,X(n)), where X = (X1,X2, · · · ,Xn),provides ranks, denoted as R1, R2, · · · , Rn, for the original X1,X2, · · · ,Xn. Inwords, R1 is the position of X1 when X1,X2, · · · ,Xn are rearranged in increasingorder. Therefore, YR1 = X1. Similarly, R2 is the position of X2 when X1,X2, · · · ,Xn

are rearranged in increasing order, and so on. More mathematically, Xj = YRj,

1For more on order statistics, see the last chapter.

76 Ranks, Order Statistics & Records

j = 1, 2, · · · , n. Yet in other words,

Rk =n∑

i=1

χXi≤Xk = The number of Xi which are ≤ Xk.

Note that R(X) = (R1, R2, · · · , Rn) forms a permutation of the positive inte-gers 1, 2, · · · , n. Knowing (Y1, Y2, · · · , Yn) does not allow us to reconstruct ourX1,X2, · · · ,Xn. However, knowing both (Y1, Y2, · · · , Yn) and (R1, R2, · · · , Rn)does allow us to reconstruct (X1,X2, · · · ,Xn), by the relationship Xj = YRj

, j =1, 2, · · · , n. Hence, (D(X), R(X)) is a one-to-one function of X = (X1,X2, · · · ,Xn).

Now consider a slightly different setup. Now allow n to be unrestricted andconsider X1,X2, · · · . Let N0 := 1, indicating the starting time as a base, and takeX1 as our starting (trivial) base value to compare the future record values with.Define N1 to be the time at which the first record is achieved. That is,

N1 := minn ≥ 1 : Xn+1 ≥ X1.

The amount XN1 is the first record value and N1 is the first record time. Futurerecord times and record values are defined similarly in a recursive fashion. That is,after Nm and XNm

, have been defined, let

Nm+1 := minn > Nm : Xn ≥ XNm, m = 0, 1, 2, · · · ,

be the (m + 1)-th record time and XNm+1 be the (m + 1)-th record value. Theamounts ∆m := Nm −Nm−1 are called inter-record times.

The aim of this section is to find the distributions of order and rank statisticsas well as the record times and record values. As we shall see there is a bit ofparallelism that exists between the two scenarios: We will show that

• The distribution of the ranks does not depend on F , however, the distributionof the order statistics is heavily dependent on F .

• Similarly, the distribution of the record times does not depend on F , however,the distribution of the record values is heavily dependent on F .

Example - 9.0.4 - (How do record and order statistics arise in real life?)It is quite intuitive to imagine how records are set in a continuing process whereidentically distributed experiments are performed, such as the Olympic games orfloods or tides or earthquakes. Note that

1− P(N1 ≤ n) = P(N1 > n) = P(all X2,X3, · · · ,Xn+1 ≤ X1)

=

∫ ∞

−∞F (x)n f(x) dx =

1

n + 1, n = 1, 2, · · · .

This gives that P(N1 = 1) = 12 and

P(N1 = j) = P(N1 ≤ j)− P(N1 = j − 1) =1

j(j + 1), j = 2, · · · .

Page 44: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Ranks, Order Statistics & Records 77

Note that the density of the first record time, N1, does not depend on F . As wewill see the distribution of the record value, XN1 does depend on F .

Order statistics, also, arise quite naturally in real life. For instance, supposea crane is lifting a tank with the help of a metal chain having n identical links.There is some possibility that, due to the heavy load, the chain might break. Wecan find the probability distribution of the breaking load of the chain by using thefirst order statistic Y1. Indeed, the chain when suspended with the load is goingto expose all its links with the same weight. The weakest of the links will breakfirst and at that moment the chain will break. If X1,X2, · · · ,Xn are the breakingstrengths of the individual links, then the breaking strength of the weakest link isthe first order statistic Y1. Thus,

P (chain breaks for weight ≤ y) = P (Y1 ≤ y) = 1− P (Y1 > y)

= 1− P (X1 > y,X2 > y, · · · ,Xn > y)

= 1− (1− FX(y))n.

Differentiating with respect to y gives the density of the breaking load of the chain:

fY1(y) =d

dyP (Y1 ≤ y) = n(1− FX(y))n−1fX(y).

Note that this clearly depends on F . We will see that the distribution of the rankR1 of X1, however, will not depend on F . Consider the order statistics first. Wewill denote the density of Yk by fk or fYk

.

Proposition - 9.0.1 - (Distribution of Yk) When X1,X2, · · · ,Xniid∼ F continu-

ous random variables, the density of the k-th order statistic, Yk, is as follows:

fk(y) =n!

(k − 1)! (n− k)!(F (y))k−1 f(y) (1− F (y))n−k.

Proof: We will first find the distribution of Yk. Its derivative will then give thedensity.

Fk(y) = P (Yk ≤ y) = P (at least k of X1,X2, · · · ,Xn are ≤ y)

=

n∑

i=k

(n

i

)(F (y))i (1− F (y))n−i.

The derivative gives the density. Indeed,

fk(y) =

n∑

i=k

(n

i

)i(F (y))i−1f(y) (1− F (y))n−i

−n∑

i=k

(n

i

)(n− i)(F (y))if(y) (1− F (y))n−i−1.

Here when i = n, the second term becomes zero since (n− i) becomes zero. So, wesee that

fk(y) = nf(y)

n∑

i=k

(n− 1

i− 1

)(F (y))i−1 (1− F (y))n−i

78 Ranks, Order Statistics & Records

−n−1∑

i=k

(n− 1

i

)(F (y))i (1− F (y))n−i−1

.

Now if we let j = i+1 in the second sum, we see that the two sums become identicalafter separating the first term of the first sum. This gives

fk(y) = nf(y)

(n− 1

k − 1

)(F (y))k−1 (1− F (y))n−k.

This finishes the proof ♠

Remark - 9.0.2 - (A faster way of obtaining the density of Yk) There is aneasy (heuristic) way to get the density fk of the order statistics Yk. Just use thefact that for h > 0,

fk(y) = limh→0

Fk(y + h)− Fk(y)

h= lim

h→0

1

hP (y < Yk ≤ y + h).

The following argument will be similar for the case when h < 0. The eventy < Yk ≤ y + h means that (for a very small h) we must have k − 1 of the X’sfall below y, one of the X’s must fall in the interval (y, y + h] and the remainingn− k of the X’s fall above y + h. By using the multinomial distribution we get

fk(y) = limh→0

1

hP (y < Yk ≤ y + h)

= limh→0

1

h

n!(F (y))k−1

(k − 1)!(n− k)!(F (y + h)− F (y)) (1− F (y + h))n−k

=n!(F (y))k−1

(k − 1)!(n− k)!limh→0

F (y + h)− F (y)

h(1− F (y + h))n−k

=n!

(k − 1)!(n− k)!(F (y))k−1 f(y) (1− F (y))n−k.

Remark - 9.0.3 - (Joint Density of Yk, Yj) The above trick that we used to findthe density of Yk goes through without much difficulty for the joint density of two(and more) order statistics. Note that for k < j,

Fk,j(y, z) = P (Yk ≤ y, Yj ≤ z)

= P (Yk ≤ y)− P (Yk ≤ y, Yj > z); (y < z).

When we differentiate with respect to both y and z the first term on the right willvanish. The second term can be taken care of as follows.

P (Yk ≤ y, Yj > z)

= P (at least k of the X’s ∈ (−∞, y) and at most j − 1 fall in (−∞, z)).

Now consider a three sided die with probabilities

p1 := P (X1 ≤ y) = F (y),

Page 45: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Ranks, Order Statistics & Records 79

p2 := P (y < X1 ≤ z) = F (z) − F (y); (y < z),

p3 := P (X1 > z) = 1− F (z).

We roll this die n times and need to find the probability of at least k faces of thefirst type and at most j−1 faces of the first and second types. This is just summingthe appropriate multinomial probabilities.

P (Yk ≤ y, Yj > z) =

j−1∑

i=k

j−1−i∑

ℓ=0

(n

i, ℓ, n− i− ℓ

)

(F (y))i (F (z)− F (y))ℓ (1− F (z))n−i−ℓ.

Differentiating gives the joint density fk,j(y, z) for y < z. However, an easier wayis as follows.

Remark - 9.0.4 - (A quick way to find the joint density) We want to findthe joint density of Yk, Yj where k < j. Then there must be k− 1 of the X’s whichfall below y, one should be at y and n− j should fall above z. One should fall at zand the remaining j − k − 1 must fall between y and z. The probability of fallingbelow y is F (y). The probability of falling above z is (1−F (z)). The probability offalling between y and z is (F (z)− F (y)). The “probability” of falling at y is f(y).We use the quotes to emphasize that this is certainly not the probability since theactual probability is zero and the number, f(y), could even be greater than 1. Whatwe have is the probability of falling in a small neighborhood of y and then lettingthe neighborhood collapse to y. Similarly, the “probability” of falling at z is f(z).Thus, if we view the n random variables X1, · · · ,Xn as n independent repetitionsof rolling a five sided die with the above probabilities then the joint density is theprobability of simultaneous occurring of the above five events. This gives the jointdensity, fk,j(y, z) for y < z, to be

(n

k − 1, 1, j − k − 1, 1, n− j

)(F (y))k−1 f(y) (F (z) − F (y))j−k−1 · f(z)

· (1− F (z))n−j .

When y ≥ z, fk,j(y, z) must be zero.

Continuing the above argument, we see that the joint density of the orderstatistics, (Y1, Y2, · · · , Yn) is

f1,2,··· ,n(y1, y2, · · · , yn) = n! f(y1) f(y2) · · · f(yn), y1 ≤ y2 ≤ · · · ≤ yn.

Theorem - 9.0.1 - (Independence of ranks and order statistics) Let X1,X2,

· · · ,Xniid∼ F , where F is absolutely continuous. The following results hold:

• The joint distribution of order statistics, (Y1, Y2, · · · , Yn), is

P(Y1 ≤ y1, · · · , Yn ≤ yn) = n!

n∏

i=1

F (yi), y1 ≤ y2 ≤ · · · ≤ yn.

80 Ranks, Order Statistics & Records

• The joint density of the ranks (R1, R2, · · · , Rn)

P(R1 = i1, R2 = i2, · · · , Rn = in) =1

n!,

where (i1, i2, · · · , in) is a permutation of (1, 2, · · · , n).

• The vector of order statistics (Y1, Y2, · · · , Yn) and the vector of ranks (R1, R2,· · · , Rn) are independent.

Proof: We already know the result of part (i). Recall Xj = YRj, j = 1, 2, · · · , n.

Let y1 < y2 < · · · < yn be fixed. Since F is absolutely continuous there is no chancefor any ties. Pick ǫ > 0 so small that the intervals (yi − ǫ, yi + ǫ] do not overlap.Let (r1, r2, · · · , rn) be any permuation of (1, 2, · · · , n). Since (D(X), R(X)) is aone-one transformation,

P(X(i) ∈ (yi − ǫ, yi + ǫ], Ri = ri, i = 1, 2, · · · , n

)

= P (Xi ∈ (yri− ǫ, yri

+ ǫ], i = 1, 2, · · · , n) =n∏

i=1

P (yi − ǫ < Xi ≤ yi + ǫ]) .

Divide by (2ǫ)n and let ǫ go to zero to get the joint density of (D(X), R(X)):

fD(X),R(X)(y1, · · · , yn, r1, · · · , rn) =

n∏

i=1

fX1(yi) =

(

n!

n∏

i=1

fX1(yi)

) (1

n!

)

= fD(X)(y1, · · · , yn) fR(X)(r1, · · · , rn).

This gives parts (ii) and (iii). There are several other ways one can prove the secondand the third parts.2 ♠

We close this section by stating the joint distribution of the first two recordvalues and the corresponding record times. The simple proof is left for the readerto provide. More results are provided in Exercises (??), (??), (??).

Proposition - 9.0.2 - (Distributions of first two record times and recordvalues)

• The joint distribution of (XN1 ,XN2) and (N1, N2) is

P(XN1 ≤ x, XN2 ≤ y, N1 = n, N2 = m)

=(F (x))m+1

nm (m + 1)+

(F (x))m(F (y)− F (x))

nm,

for x < y, m ≥ 2 and n ∈ 1, 2, · · · ,m− 1.• The marginal density of the first two record times is

P(N1 = n, N2 = m) =1

n m (m + 1), m ≥ 2, n ∈ 1, 2, · · · ,m− 1.

2For instance, one may invoked Basu’s theorem of mathematical statistics, along withthe second part to get a short proof of independence.

Page 46: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Ranks, Order Statistics & Records 81

• The joint density of the first two record values, XN1 ,XN2 , is

− ln(1− F (x))

1− F (x)f(x) f(y), x < y.

Exercise - 9.0.3 - Find the expected breaking strength of a 50 link chain whenthe links have common breakage distribution Exp(λ). What happens to the meanbreaking strength of the chain as the number of links increases?

Exercise - 9.0.4 - Provide a real life situation in which the largest order statistic,Yn, would be useful.

Exercise - 9.0.5 - Suppose that a bridge is constructed with 25 concrete pillarswhich are standing in a river bed. With the pressure of flowing water, the probabil-ity distribution of each pillar to wash away is Exp(λ). The bridge would collapse ifany of its 25 pillars is washed away. What is the expected amount of water pressurethat would collapse the bridge? Find the variance of the amount of water pressurethat will collapse the bridge. What is the probability that the water pressure of1/(5λ) would make the bridge to collapse?

HW25 Exercise - 9.0.6 - Find the density of the median of the sample when n is an odd

integer and X1,X2, · · · ,Xniid∼ Exp(λ).

Exercise - 9.0.7 - Find the expectation of Yk when the sample of size n is drawnfrom Exp(λ).

Exercise - 9.0.8 - Let X1,X2, · · · ,Xniid∼ Uniform(0, 1). Find Cor(Yk , Yj) when

1 ≤ k < j ≤ n.

HW26 Exercise - 9.0.9 - Find the conditional density of Yk given that Yj = z when thesample is taken from Uniform(0, 1) and 1 ≤ k < j ≤ n.

HW27 Exercise - 9.0.10 - For the above exercise find the random variable E(Yk |Yj).

HW28 Exercise - 9.0.11 - Continue the above exercise and find the random variableV ar(Yk |Yj).

Exercise - 9.0.12 - Prove Proposition (9.0.2).

Exercise - 9.0.13 - (Bernoulli process & record time process) An infinitesequence of independent and identically distributed random variables U1, U2, · · ·is called a Bernoulli process if Uk ∼ B(1, p). An infinite sequence of independentrandom variables T1, T2, · · · is called a record time process if T1 = 1 and if Tk ∼B(1, 1

k ) for k = 2, 3, · · · . To explain why, let X0,X1,X2, · · · iid∼ F , where F is acontinuous distribution. Take T1 ≡ 1 and for k = 2, 3, · · · , let

Tk =

1 if Xk is a record i.e.,Xk > max1≤i≤k−1 Xi,0 otherwise.

Show that T1, T2, · · · forms a record time process.

82 Ranks, Order Statistics & Records

Exercise - 9.0.14 - (Infinite number of records, but finitely many consec-utive records) Let U1, U2, · · · be a Bernoulli process and let T1, T2, · · · be a recordtime process (cf. Exercise (9.0.13)).

• (i) For the Bernoulli process show that with probability one infinitely manyheads will take place. [Hint: Use Borel-Cantelli lemma II]

• (ii) For the record time process show that with probability one infinitelymany records will take place.

• (iii) For the Bernoulli process show that with probability one infinitely manytimes consecutive (pair of) heads will be observed.

• (iv) For the record time process show that with probability zero infinitelymany times consecutive (pair of) records will be observed.

Exercise - 9.0.15 - (Expected waiting time for the first record is infinite)Let U1, U2, · · · be a Bernoulli process and let T1, T2, · · · be a record time process(cf. Exercise (9.0.13)).

• (i) Let X be the number of tails before the first head is observed. Show thatX ∼ Geometric(p) and E(X) = 1−p

p .

• (ii) Let N1 be the number of games before the first record is observed. Thatis, N1 = k if Tk = 1, Ti = 0 for i < k, for k = 2, 3, · · · . Show that

P(N1 = k) =1

k(k − 1)k = 2, 3, · · · .

This implies E(N1) =∞.

Even though expected waiting time for the first record is infinite, the total numberof records among the first n games does have a limiting distribution.

Page 47: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 10

Fourier Transforms

Recall that the characteristic function, φX(t), of a random variable X having dis-tribution F is the expectation

φX(t) = E(eitX

)=

R

eitx dF (x)

=

∫∞−∞ eitx f(x) dx, continuous case, f(x) = dF (x)

dx ,∑

k eitk P (X = k) integer case, P (X = k) = F (k)− F (k − 1),∑

k eitxk P (X = xk) discrete case, P (X = xk) = F (xk)− F (x−k ).

In the continuous case, the above integral is also called the Fourier transform of f ,and is the subject matter of the next section. The integer case leads to Fourier seriesand the general discrete case leads to Dirichlet series.

10.1 Examples

Fourier transform is a versatile tool of analysis with numerous applications. Here,we list a few tricks of the trade.

Definition - 10.1.1 - Let f be a function such that∫∞−∞ |f(x)| dx <∞, (such func-

tions will be called absolutely integrable or just integrable). Then the Fourier trans-formof f , denoted by F(f, t), is defined to be

F(f, t) =

∫ ∞

−∞eitxf(x) dx =

∫ ∞

−∞cos(tx) f(x) dx + i

∫ ∞

−∞sin(tx) f(x) dx.

Remark - 10.1.1 - In probability applications f is usually a density (satisfying theabsolute integrability condition automatically) and the resulting Fourier transformis known as the characteristic function of the density or the characteristic functionof the random variable having the density f . So, if X is a random variable withdensity f then the characteristic function, φX(t), is

EeitX = φX(t) =

∫ ∞

−∞eitxf(x) dx = F(f, t).

84 Fourier Transforms

Example - 10.1.1 - (Characteristic function of a standard uniform) LetX ∼ Uniform(0, 1), then the characteristic function is obtained as follows.

φX(t) = EeitX = Ecos(tX)+ iEsin(tX)

Ecos(tX) =

∫ 1

0

cos(tx) dx =sin t

t

Esin(tX) =

∫ 1

0

sin(tx) dx =− cos t

t+

1

t

φX(t) = EeitX =sin t− i cos t + i

t=

i(1− eit)

t=

eit − 1

it.

As is the case in this example, the complex integral often behaves just like thereal integral of exponential function. Also, note that when t = 0, we get φX(0) =E(e0) = 1. By using L’Hopital’s rule it is easy to see that the function, (eit−1)/(it),also gives the same value for t = 0. So, we can say that the characteristic functionis continuous at t = 0.

Example - 10.1.2 - (Characteristic function of triangular random vari-able) Let X ∼ Uniform(−a

2 , a2 ) where, a > 0, and let X,Y be independent and

identically distributed random variables. It is not hard to see that the density ofthe new random variable, Z = X + Y , is

f(z) =

a− |z|a2

if − a < z < a,

0 otherwise.

Due to its shape, this is called the triangular density. Now we find the characteristicfunction of Z. Indeed,

φZ(t) =

∫ a

−a

eitx

(a− |x|

a2

)dx.

By using the fact that eitx = cos(tx)+ i sin(tx), as we did in the previous example,the above integral reduces to

φZ(t) =2(1 − cos at)

a2t2=

(sin(at/2)

at/2

)2

.

Remark - 10.1.2 - Note that, by using the result of Exercise (10.1.1), this char-acteristic function is the product of the characteristic functions of X and Y . Thisshould not be surprising, keeping in mind the convolution property that we haveoften seen in the past for discrete random variables.

Example - 10.1.3 - (Characteristic function of exponential random vari-able) Let X ∼ Exp(λ). Its characteristic function is

φX(t) =

∫ ∞

0

cos(tx) λe−λx dx + i

∫ ∞

0

sin(tx) λe−λx dx.

Page 48: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

10.1 Examples 85

By (twice) integrating by parts, we get that∫ ∞

0

cos(tx) λe−λx dx =λ2

λ2 + t2.

Also, integration by parts gives that∫ ∞

0

sin(tx) e−λx dx =t

λ

∫ ∞

0

cos(tx) e−λx dx =tλ

λ2 + t2.

Thus, the characteristic function of X is φX(t) =λ2 + itλ

λ2 + t2=

λ

λ− it.

Example - 10.1.4 - Note that the function, f(x) = sin x e−|x|, is integrable. Tofind its Fourier transform, F(f, t), note that

sin x =−ieix + ie−ix

2.

Therefore, the Fourier transform is

i

2

∫ ∞

−∞(e−ix − eix) e−|x| eitx dx = i

[∫ ∞

−∞

1

2e−|x| ei(t−1)x dx

−∫ ∞

−∞

1

2e−|x| ei(t+1)x dx

]

= i

[1

1 + (t− 1)2− 1

1 + (t + 1)2

].

The last equality comes by using the result of Exercise (10.1.2).

Remark - 10.1.3 - It is a standard exercise in contour integration to show thatthe characteristic function of the Cauchy(0, 1) density is e−|t|. We will bypass thecontour integration by first inventing the inversion formula and then deduce thisresult. The characteristic function of N(0, 1) is obtained in the next section byusing some general properties of Fourier transforms.

Now we obtain some useful properties of F(f, t). In particular, we will get theproperties of the characteristic functions of continuous random variables. Here isthe main result of this section.

Theorem - 10.1.1 - (Fundamental properties of Fourier transforms) EveryFourier transform enjoys the following properties.

• (i) (Linearity) F(αf +βg, t) = αF(f, t)+βF(g, t), for any constants α, β.

• (ii) (Translation) F(eicxf(x), t) = F(f, t + c), for any constant c.

• (iii) (Scale) F(f(x), tc) = 1c F (f(x/c), t) , for any constant c > 0.

• (iv) (Uniform continuity) If f is integrable and piecewise continuous thenF(f, t) is a bounded function of t and the bound is

|F(f, t)| ≤∫ ∞

−∞|f(x)| dx <∞.

Furthermore, F(f, t) is a uniformly continuous function of t.

86 Fourier Transforms

• (v) (Smoothness) Let f be piecewise continuous and absolutely integrable.If xf(x) is also absolutely integrable then F(f, t) is continuously differentiableand

d

dtF(f, t) = iF(xf(x), t).

• (vi) (Anti-derivative) Let f be an integrable and differentiable function sothat f

is continuous and integrable as well. Then

−itF(f, t) = F(

d

dxf(x), t

).

• (vii) (Expansion) Let f be an integrable function so that xn+1f is alsoabsolutely integrable (for some positive integer n). Then

F(f, t) =n∑

j=0

(it)jF(gjf, 0)

j!+ ξn

|t|n+1F(|gn+1f |, 0)(n + 1)!

,

where gj(x) = xj and ξn is a complex number with |ξn| ≤ 1.

Proof: (Property (iv)) It is trivial to see that

|F(f, t)| ≤∫ ∞

−∞|eitx||f(x)|dx =

∫ ∞

−∞|f(x)|dx <∞.

To show continuity, note that

|F(f, t + h)−F(f, t)| =

∣∣∣∣

∫ ∞

−∞(eix(t+h) − eitx) f(x) dx

∣∣∣∣

≤∫ ∞

−∞|eitx| |eihx − 1| |f(x)| dx

=

∫ ∞

−∞|eihx − 1| |f(x)| dx.

The last integral does not depend on t. Its integrand is bounded by 2|f(x)| whoseintegral is finite. And the integrand goes to 0 as h goes to zero. Thus, we can takethe limit inside as h goes to zero to get the property. Therefore, we can interchangethe integral and the derivative operations. This is because the integrand and itsderivative (with respect to t) are integrable. To show the continuity of F(xf, t),note that

|F(xf, t + h)−F(xf, t)| =

∣∣∣∣

∫ ∞

−∞xf(x)eitx

(eihx − 1

)dx

∣∣∣∣

≤∫ ∞

−∞|xf(x)|

∣∣eihx − 1∣∣ dx.

The integrand is bounded by 2|xf(x)| which is integrable and the integrand dropsto zero as h goes to zero. So, taking the limit inside the integral gives the result.(Property (vi)). By the fundamental theorem of calculus

f(x) =

∫ x

0

f′

(u)du + f(0).

Page 49: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

10.1 Examples 87

By the integrability of f′

, limx→±∞ f(x) exist. Now the integrability of f impliesthat the limits must be zero. So, by integration by parts,

F(f′

, t) = eitxf(x)|∞−∞ − it

∫ ∞

−∞eitxf(x)dx = −itF(f, t).

(Property (vii)). For j ∈ 1, 2, · · · , n write gjf = gjfj/(n+1)f (n+1−j)/(n+1). Now

apply the Holder inequality, with p = (n + 1)/j, to see that gjf are absolutelyintegrable as well. Next recall the fact that

eitx =

n∑

j=0

(itx)j

j!+ Rn(itx); |Rn(itx)| ≤ |tx|

n+1

(n + 1)!.

Therefore,

F(f, t) =

n∑

j=0

(it)j

j!

∫ ∞

−∞xjf(x) dx +

∫ ∞

−∞Rn(itx)f(x) dx

=

n∑

j=0

(it)j

j!F(gjf, 0) +

∫ ∞

−∞Rn(itx)f(x) dx.

Finally, the last term is a complex number with magnitude∣∣∣∣

∫ ∞

−∞Rn(itx)f(x) dx

∣∣∣∣ ≤|t|n+1

(n + 1)!

∫ ∞

−∞|x|n+1|f(x)| dx

=|t|n+1

(n + 1)!F(|gn+1f |, 0).

This finishes the proof. ♠

Example - 10.1.5 - (Characteristic function of N(0, 1)) Let X ∼ N(0, 1). Bythe smoothness property,

d

dtφ(t) =

1√2π

∫ ∞

−∞ixeitx e−x2/2 dx, where φ(t) = E(eitX).

Since, x cos tx is an odd function of x and the density is an even function,

d

dtφ(t) = i2

√2

π

∫ ∞

0

x sin(tx) e−x2/2 dx.

By integration by parts, we get

d

dtφ(t) = −t

√2

π

∫ ∞

0

cos(tx) e−x2/2 dx = −tφ(t), or1

φ(t)dφ(t) = −tdt.

Integrating both sides, lnφ(t) = − t2

2 + c, or φ(t) = e−t2/2 ec. Now for t = 0, we getφ(0) = ec. For any random variable, φ(0) = 1. This gives that the characteristic

function of N(0, 1) must be φ(t) = e−t2/2.

88 Fourier Transforms

Remark - 10.1.4 - (Convolution property) Let f, g be piecewise continuous andintegrable functions. Recall that the convolution of these two functions, denotedby f ∗ g(t), is

f ∗ g(t) =

∫ ∞

−∞f(x)g(t− x)dx.

One way the convolution function arises is via adding independent random vari-ables. More precisely, if X,Y are independent continuous random variables withrespective densities f and g, then the density of Z = X + Y is the convolutionf ∗ g(t). This shows that the characteristic function of Z is the product of thecharacteristic functions of X and Y ,

F(f ∗ g, θ) = F(f, θ) F(g, θ), θ ∈ R.

HW29 Exercise - 10.1.1 - Find the characteristic function of X ∼ Uniform(−a, a) wherea > 0. Deduce that (sin t)/t is a characteristic function.

HW30 Exercise - 10.1.2 - Find the characteristic function of the Laplace density (also

called the double exponential density) f(x) = λ2 e−λ|x|; −∞ < x < ∞. (Hint: the

answer is λ2/(λ2 + t2).)

Exercise - 10.1.3 - Find the Fourier transform of f(x) = cos x e−|x|.

Exercise - 10.1.4 - Prove the first three properties of the Fourier transforms anddeduce that, for any constant c and a random variable X,

φX+c(t) = eictφX(t), φcX(t) = φX(tc).

Exercise - 10.1.5 - Prove that the characteristic function of a continuous randomvariable is always uniformly continuous and bounded by 1.

HW31 Exercise - 10.1.6 - Suppose the first moment of X is finite. Show that its charac-teristic function is differentiable and

d

dtφX(t) = EiXeitX, d

dtφX(t)

∣∣∣∣t=0

= iE(X).

HW32 Exercise - 10.1.7 - Let f be piecewise continuous and absolutely integrable. Ifx2f(x) is absolutely integrable then prove that F(f, t) is twice differentiable and

d2

dt2F(f, t) = −F

(x2f(x), t

).

Suppose that the second moment of X is finite. Deduce that its characteristicfunction is twice differentiable with

E(X2) = − d2

dt2φX(t)

∣∣∣∣t=0

.

Page 50: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 11

Summability Assisted

Inversion

Now we show how to recover the function f from its Fourier transform F(f, t). Inthis regards, the fundamental insight is due to Dirichlet.

Let f be an integrable function (i.e.,∫∞−∞ |f(x)| dx < ∞). The Fourier trans-

form of f is

F(f, θ) :=

R

eiθx f(x) dx, θ ∈ R.

The fundamental question of all transform theories1is “does the trasform, F(f, θ),contain all the information needed to reconstruct f from it”? One of the firstreconstruction schemes was suggested by Dirichlet. He wondered if f could bereconstructed from F(f, θ) as follows

f(x)?= lim

t→∞1

∫ t

−t

e−iθxF(f, θ) dθ (0.1)

Amazingly this works, at least for some special functions f , such as those integrablef ’s for which F(f, θ) is again integrable. Even in the case of those f ’s for which(0.1) does not work, some averaging (summability smoothing) operations performedon top of the above Dirichlet scheme can, again, reconstruct f from F(f, θ). Theirunderlying ideas are summarized below.

Step 1. Summability assistance. Let Lx(t) := 12π

∫ t

−t e−iθxF(f, θ) dθ. Themost basic averaging (summability) operation is the simple averaging,

1

2T

∫ 2T

0

Lx(t) dt.

1such as generating functions, moment generating functions, Laplace transforms,Fourier transforms, Fourier-Stieltjes transforms etc.

90 Summability Assisted Inversion

This is called the Cesaro method. Another averaging (summability) operation is

1

T

∫ ∞

0

e−t/T Lx(t) dt,

and is known as the Cauchy/Abel method. Note that in both cases we used aprobabilistic expectation of Lx — in the first case we used a Uniform(0, 2T ) densityand in the second case Exp( 1

T ) density —. In general, if pT (t) is some density thenthe summability smoothing of Lx(·) is the following expectation

∫ ∞

0

pT (t)Lx(t) dt =1

∫ ∞

0

pT (t)

∫ t

−t

e−iθxF(f, θ) dθ dt

=1

∫ ∞

−∞e−iθxF(f, θ) gT (θ) dθ, gT (θ) :=

∫ ∞

|θ|pT (t) dt.

Step 2. Parseval’s relation. When f, gT are integrable, Parseval’s relationsays that

R

e−iθxF(f, θ) gT (θ) dθ =

R

f(t)F(gT , t− x) dt =

R

f(x + θ)F(gT , θ) dθ.

This relationship follows by writing the Fourier transform as an integral and thenswitching the order of integration. The reader should fill in the details.

Example - 11.0.6 - (Three examples) Note that regardless of which pT (t) den-sity we choose to use, we will have gT (θ) ≥ 0. If gT (θ) is integrable, it is only offby a constant to be a probability density in itself. If cgT is a probability density,for some constant c, we have

1

R

e−iθxF(f, θ) gT (θ) dθ =1

c2π

R

f(x + θ)F(cgT , θ) dθ,

Since gT (θ) is symmetric, F(cgT , θ) will be a real-valued function. Consider thefollowing three examples.

• (i) When we take pT (t) to be Uniform(0, 2T ) density, we get gT (θ) =

(1 − |θ|2T ), for |θ| ≤ 2T , and zero otherwise. Note that gT is a constant mul-

tiple of the Triangular(−2T, 2T ) density, 12T (1 − |θ|

2T ), where c = 12T . The

characteristic function (Fourier transform) of cgT is ( sin TθTθ )2. So, if we let

CT have density Tπ

(sin Tθ

)2, θ ∈ R, the Cesaro inversion formula becomes

limT→∞

1

∫ 2T

−2T

e−iθxF(f, θ)

(1− |θ|

2T

)dθ = lim

T→∞

T

π

R

f(x + θ)(sin Tθ

Tθ)2 dθ,

= limT→∞

E (f(x + CT )) .

• (ii) When we take pT (t) = 1T e−t/T , for t > 0 and zero otherwise, i.e., the

Exp( 1T ) density, then gT (θ) = e−|θ|/T . It is a constant multiple of the dou-

ble exponential density, 12T e−|θ|/T , with c = 1

2T . The characteristic function

Page 51: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Summability Assisted Inversion 91

(Fourier transform) of cgT is 11+(Tθ)2 . So, the Cauchy inversion formula be-

comes

limT→∞

1

R

e−iθxF(f, θ)e−|θ|/T dθ = limT→∞

T

π

R

f(x + θ)1

1 + (Tθ)2dθ,

= limT→∞

E (f(x + KT )) ,

where KT has the density Tπ

11+(Tθ)2 , θ ∈ R.

• (iii) When we take pT (t) to be the density of√

Exp( 12T ) random variable, we

get gT (θ) = e−θ2/2T . It is a constant multiple of the N(0, T ) density, where

c = 1√2πT

. The characteristic function (Fourier transform) of cgT is e−Tθ2/2.

Now the inversion scheme, called the Gauss-Weierstrass inversion formula,becomes

limT→∞

1

R

e−iθxF(f, θ)e−θ2/2T dθ = limT→∞

√T√2π

R

f(x + θ)e−Tθ2/2 dθ,

= limT→∞

E (f(x + WT )) ,

where WT ∼ N(0, 1T ).

Step 3. Approximate identity. So, continuing from Step 2, the issue is, “whenis it that

f(x)?= lim

T→∞

R

e−iθxF(f, θ) gT (θ) dθ = limT→∞

R

f(x + θ)F(gT , θ) dθ.

The answer depends on f . If f is integrable so that F(f, θ) is also integrable, theneven the limit in Dirichlet’s original idea, (0.1), holds2 and we get

f(x) =1

∫ ∞

−∞e−iθxF(f, θ) dθ, θ ∈ R. (0.2)

When F(f, θ) is not integrable, in each of the above three examples,

limT→∞

Ef(x + CT ) = limT→∞

Ef(x + KT ) = limT→∞

Ef(x + WT ) = f(x),

if |f(x)| does not grow “too rapidly” as |x| → ∞, and if x is a point of continuity off . The key reason behind this is that each random variable, CT ,KT ,WT , convergesto zero in probability as T gets large. In other words, if hT (θ) denotes the densityof any of these three random varibles, then in each case,

limT→∞

|θ|>δ

hT (θ) dθ = 0, for any δ > 0.

A sequence of functions, hT (θ), T = 1, 2, · · · , is called an approximate identity if

2Thanks to a theorem, called the Lebesgue dominated convergence theorem.

92 Summability Assisted Inversion

• (i) limT→∞∫

RhT (θ) dθ = 1,

• (ii) supT

∫R|hT (θ)| dθ <∞,

• (iii) limT→∞∫|θ|>δ |hT (θ)| dθ = 0 for any δ > 0.

The densities of the random variables, CT ,KT ,WT , are all approximate identities.Now for any bounded function f , if x is any point of continuity of f , then for anyǫ > 0, there exists a δ > 0 so that |f(x + θ) − f(x)| < ǫ whenever |θ| ≤ δ. Thisgives that

∣∣∣∣

R

f(x + θ)hT (θ) dθ − f(x)

∣∣∣∣ ≤ |f(x)|∣∣∣∣

R

hT (θ) dθ − 1

∣∣∣∣+ Kf

|θ|>δ

|hT (θ)| dθ

+

|θ|≤δ

|f(x + θ)− f(x)| |hT (θ)| dθ, (0.3)

where Kf is a bound for 2f . The last term is bounded by ǫ supT

∫R|hT (θ)| dθ. So

the last term can be made arbitrarily small by picking ǫ small enough. The firsttwo terms go to zero as T gets large. By using a slightly more elaborate argument,(see Khan [?]), we can somewhat relax the continuity assumption on x and get

limT→∞

R

f(x + θ)hT (θ) dθ =f(x+) + f(x−)

2,

at any point of simple discontinuity. Also, the boundedness assumption on f can berelaxed somewhat, but the amount of relaxation is tied to which inversion techniquewe choose to use. This gives us the general inversion theorem.

limT→∞

1

∫ ∞

−∞e−iθxF(f, θ) gT (θ) dθ =

f(x+) + f(x−)

2.

Remark - 11.0.5 - (Summary) The moral of this elaborate story can be summa-rized as follows:

• Take any density, pT (t), defined over (0,∞) and get gT (θ) =∫

t>|θ| pT (t) dt.

If the summability density, pT (t), has a finite first moment, then gT (t) willbe integrable. Hence, hT (θ) := F(gT , θ) will be well defined and will be real,because of the symmetry of gT .

• If the collection, hT (θ), T = 1, 2, · · · , forms an approximate identity,

then

limT→∞

∫ ∞

0

pT (t)Lx(t) dt =1

2πlim

T→∞

∫ ∞

0

pT (t)

∫ t

−t

e−iθxF(f, θ) dθ = f(x),

for any bounded integrable function f , and x is its point of continuity. This is theunderlying idea behind Fourier inversion.

Page 52: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Summability Assisted Inversion 93

Remark - 11.0.6 - (Delta function) We may convey the above summary/idea ina language, and a notation, that avoids mentioning which summability techniquewas deployed. This is done with the help of the delta function which you may thinkof as the limit of the density of either of the three random variables CT ,KT ,WT :

δ(θ) = limT→∞

T

π

(sin Tθ

)2

= limT→∞

T

π

1

1 + (Tθ)2= lim

T→∞

√T√2π

e−Tθ2/2, θ ∈ R.

The delta function is given the following properties:

• (i) δ(0) = +∞, and δ(x) = 0 for any x 6= 0, so that δ(x) = δ(−x).

• (ii)∫

Rδ(x) dx = 1, or more generally, as our last three examples suggested,

R

f(x + θ) δ(θ) dθ =f(x+) + f(x−)

2.

• (iii)∫

Reiθx δ(x) dx ≡ 1, i.e, F(δ, θ) ≡ 1, which is visible from our three

examples since

limT→∞

(1− |θ|

2T

)= lim

T→∞e−|θ|/T = lim

T→∞e−θ2/2T ≡ 1.

• (iv)∫ t

−∞ δ(x) dx = ∆(t), where ∆(t) = 0 for t < 0 and ∆(t) = 1 for t ≥ 0.The idea behind this is the fact that each of the three random variables,CT ,KT ,WT , converged to zero in probability, and ∆ is the distribution of ar.v. which takes value zero with probability one.

It should be noted that these properties of δ(θ) are only operational devices abbrevi-ately representing more elaborate approximation (limit) arguments. Such functionsare studied under the topic of “Schwarz distributions”, which is a bit different no-tion than the usual probability distribution.

Theorem - 11.0.2 - (Uniqueness) Let f, g be two piecewise continuous and inte-grable functions so that F(f, θ) = F(g, θ). Then f and g are essentially3 the samefunctions.

Proof: By the above inversion theorem,

f(x+) + f(x−)

2= lim

T→∞

1

∫ ∞

−∞e−iθxF(f, θ) gT (θ) dθ

= limT→∞

1

∫ ∞

−∞e−iθxF(g, θ) gT (θ) dθ

=g(x+) + g(x−)

2.

3They may differ on a “negligible” set, such as artificially making the functions to differat the points of discontinuity.

94 Summability Assisted Inversion

Hence, f, g must agree at their points of continuity. (At the points of discontinuitythey may differ.) ♠The following proposition shows that we really cannot do better than what we havewhen there are points of discontinuity. The integrability of the Fourier transformis really a strong assumption. It necessitates that the function f should have nopoints of discontinuity.

Proposition - 11.0.1 - Let f be a piecewise continuous and integrable functionwith the Fourier transform, F(f, t). If f has a point of discontinuity then F(f, t)can not be integrable.

Proof: Suppose F(f, t) is integrable. Recall that F(f, t) is always continuous forintegrable f . When F(f, t) is also integrable then our inversion theorem gives

f(x+) + f(x−)

2= lim

T→∞

1

∫ ∞

−∞e−iθxF(f, θ) gT (θ) dθ

=1

∫ ∞

−∞e−iθxF(f, θ)dθ.

The last expression being a Fourier transform of F(f, t), it must be a continuousfunction. So the left side must also be a continuous function. ♠

Remark - 11.0.7 - Many of the continuous densities which are defined on the halfreal line or finite intervals have points of discontinuity. The above propositionshows that all such densities cannot have integrable characteristic functions. Thus,we have to resort to the summability assisted limit operation to invert such Fouriertransforms. The inversion formula is more useful to prove other theoretical resultsthan to actually perform the inversion.

Example - 11.0.7 - Now we find the Fourier transform of the Cauchy(0, 1) den-sity. Recall (Exercise (10.1.2)) that the Laplace density (1/2)e−|x| had the Fouriertransform (characteristic function) F(f, t) = 1/(1+t2). Now both the Laplace den-sity and its Fourier transforms are continuous and integrable. So, the first inversiontheorem gives that

1

∫ ∞

−∞e−ixt 1

1 + t2dt =

1

2e−|x|.

By making the transformation y = −t we get

∫ ∞

−∞eixy 1

π(1 + y2)dy = e−|x|.

Since, g(y) = 1/(π(1 + y2)) is the Cauchy(0, 1) density, we see that the Fouriertransform (characteristic function) of this density is F(g, t) = e−|t|.

Here is another nice example of the many uses of the inversion formula. In thisexample we see how to find some not too obvious integrals.

Page 53: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Summability Assisted Inversion 95

6

- x

2−2

1/2

f(x)

Figure 11.1: Triangular Density

Example - 11.0.8 - Recall (Exercise (10.1.1)) that Uniform(−1, 1) density had

the characteristic function sin t/t. Also that if X,Yiid∼ Uniform(−1, 1) then X+Y

had the triangular density f . The density f looks like

Its characteristic function is (sin t/t)2, since E(eit(X+Y )) = E(eitX)E(eitY ).Note that both the density and its characteristic function are continuous and inte-grable functions. So, the inversion formula immediately gives that

1

∫ ∞

−∞e−ixt (sin t)2

t2dt = f(x).

In particular, when x = 0,

1

∫ ∞

−∞

(sin t)2

t2dt = f(0) =

1

2, or

∫ ∞

−∞

(sin t)2

t2dt = π.

96 Summability Assisted Inversion

Page 54: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 12

General Inversion

Now we present a particular form of the inversion theorem that is useful in proba-bility and statistics.

Theorem - 12.0.3 - (General inversion) Let φ(θ) =∫

Reiθx dF (x) be the char-

acteristic function of a distribution F of a random variable X.

• (i) (Clever trick). Define a new random variable Z = X + U , where U isindependent of X and U ∼ Uniform(−h, 0). Prove that Z has the followingprobability density

fZ(x) =F (x + h)− F (x)

h, x ∈ R.

(The fact that fZ is a density, and hence integrable, opens the door forinversion by our earlier summability assisted tools.)

• (ii) Show that the Fourier transform of fZ is F(fZ , θ) = φ(θ) 1−e−iθh

iθh .

• (iii) Show that fZ(x) is a bounded function of x and if x, x + h are points ofcontinuity of F , then x is a point of continuity of fZ .

• (iv) Explain why the following equalities and convergences hold as T →∞.

1

R

e−iθxφ(θ)1− e−iθh

iθhe−θ2/2T dθ =

√T√2π

R

fZ(x + θ)e−Tθ2/2 dt

→ fZ(x) :=F (x + h)− F (x)

h,

1

∫ 2T

−2T

e−iθxφ(θ)1− e−iθh

iθh

(1− |θ|

2T

)dθ → fZ(x),

1

∫ ∞

−∞e−iθxφ(θ)

1− e−iθh

iθhe−|θ|/T dθ → fZ(x),

• (v) Explain why the last part characterizes F .

98 General Inversion

Proof: It does not take much effort (by the usual convolution argument) to showdirectly that the said function, fZ(x), is indeed a density of Z. Moving along, if(F (x+h)−F (x))/h is a density then note that its characteristic function (by usingthe anti-derivative property of Fourier transforms) has to be

∫ ∞

−∞eitx F (x + h)− F (x)

hdx =

1

h

F(f(x + h), t)

−it− F(f, t)

−it

=1

h

e−ithφ(t)

−it− φ(t)

−it

= φ(t)1− e−ith

ith.

Since the last term is a characteristic function. This is a bounded function sinceit is the product of two characteristic functions, each of which has to be bounded.The continuity of fZ(x) is trivial when x, x + h are points of continuity of F . Nowtake W ∼ N(0, n) and use Parseval’s relation to get

1√2πn

∫ ∞

−∞e−itxφ(t)

1− e−ith

ithe−t2/(2n) dt =

∫ ∞

−∞fZ(y)e−n(y−x)2/2 dy

=

∫ ∞

−∞fZ(y + x)e−ny2/2 dy.

Multiplying both sides by (n/(2π))1/2 and invoking the Gauss-Weierstrass inversiongives the result. Finally, the point of discontinuity of (any distribution) F is acountable set. So, taking an infinite sequence of points x = t0 < t1 < t2 < · · · → ∞,all points of continuity of F , we see that

1− F (x) = limM→∞

F (tM+1)− F (x)

Each of the terms of the last limit is obtained by the inversion scheme that we haveproved. Hence, F (x) can be computed at each of its points of continuity. But thenby the right continuity of F , it is fully specified at all x ∈ R. ♠

HW33 Exercise - 12.0.8 - Let f be any integrable function. Prove that,1 for any realnumber x,

1

∫ T

−T

e−iθxF(f, θ)dθ =

∫ ∞

−∞f(x + y)

sinTy

πydy.

D Exercise - 12.0.9 - (Plancheral’s identity) Let X be a continuous random vari-able with bounded density f , distribution F and characteristic function φ(t). Then|φ(t)|2 is integrable if and only if f2 is integrable and in this case

∫ ∞

−∞f2(x) dx =

1

∫ ∞

−∞|φ(t)|2 dt.

1This approach leads to the Dirichlet inversion formula. However, its proof requires

some more work. Note that the right hand side function, hT (y) := sin Tyπy

, is neithernonnegative nor |hT (y)| is integrable. It is not an approximate identity.

Page 55: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

12.1 Fourier & Dirichlet Series 99

HW34 Exercise - 12.0.10 - (Parseval’s relation) Prove the following results.

• (i) For any integrable functions f(x), g(x) having respective Fourier trans-forms F(f, t), F(g, t), we have

∫ ∞

−∞g(t)F(f, t) dt =

∫ ∞

−∞f(y)F(g, y) dy.

• (ii) For any random variables X,U with respective characteristic functionsφX(t), φU(t), EφX(U) = EφU (X).

• (iii) Let X,W be two random variables with respective characteristic func-tions φX(t) and φW (t). Then, for any x ∈ R,

Ee−ixW φX(W )

= E φW (X − x) .

• (iv) For any integrable functions f, g with respective Fourier transformsF(f, t), F(g, t), and any x ∈ R, we have

∫ ∞

−∞e−ixtg(t)F(f, t) dt =

∫ ∞

−∞f(y)F(g, y − x) dy.

Exercise - 12.0.11 - (Continuity theorem) Let Xn be a sequence of randomvariables with respective characteristic functions φn(t). Then P (Xn ≤ x) con-verge P (X ≤ x) at all x at which F (x) := P (X ≤ x) is continuous, if and only ifthe sequence φn(t) converges to a continuous limit φ(t) (in which case φ is thecharacteristic function of X). Give a sketch of the proof in the backward directionby using the result of Theorem (12.0.3).

12.1 Fourier & Dirichlet Series

Fourier transform is a function-to-function transform. Fourier and Dirichlet seriesare function-to-sequence transformations.

Definition - 12.1.1 - (Trigonometric & Fourier series) Any series of the form

a0

2+

∞∑

k=1

(ak cos kt + bk sin kt),

is called a trigonometric series. If the coefficients, ak, bk, are obtained from anintegrable function f over [−π, π], which is 2π-periodic, by the formulas

an =1

π

∫ π

−π

f(x) cos nxdx, bm =1

π

∫ π

−π

f(x) sin mxdx, (1.1)

where n = 0, 1, 2, · · · and m = 1, 2, · · · , then the resulting trigonometric series iscalled the Fourier series of f .

100 General Inversion

So, someone comes along and gives us the Fourier coefficients (1.1) of a 2π-periodic integrable function f . One of the fundamental issues regarding Fourierseries is to see if the sequence of partial sums of the resulting Fourier series,

Sn(f, t) :=a0

2+

n∑

k=1

(ak cos kt + bk sin kt), n = 1, 2, · · · ,

converges in some sense. And if it does, to which limiting function. This is againan inversion question. To see why, it will be more illuminating if we rewrite thepartial sums in complex form,

Sn(f, t) =a0

2+

n∑

k=1

ak

eikt + e−ikt

2+ bk

eikt − e−ikt

2

=

n∑

k=−n

f(k)eikt,

where we take b0 = 0 and for k = 0, 1, · · · , n, f(k) = ak−ibk

2 = 12π

∫ π

−π f(u)e−iku du,

f(−k) = ak+ibk

2 = 12π

∫ π

−π f(u)eiku du. Conversely, ak = f(k) + f(−k) and bk =

i(f(k)− f(−k)).

Remark - 12.1.1 - (Reconstruction of 2π-periodic functions) The aim of thefollowing set of exercises is to make a connection with our earlier discussion aboutFourier inversion formulas and convergence of Fourier series. When f is 2π-periodicand integrable, let the Fourier transform of f be

F(f, j) :=

∫ π

−π

f(u) eiju du; j = 0,±1,±2, · · ·

Again, the issue is: “does the collection F(f, j), j = 0,±1, · · · contain all the

information needed to reconstruct f”? Once again, Dirichlet proposed a reconstruc-tion scheme, (which happens to be the partial sums sequence),

Sn(f, x) :=

n∑

j=−n

f(j)eijx =1

n∑

k=−n

e−ikxF(f, k), (1.2)

and showed (now known as Dirichlet-Jordan theorem[?], p. 57. [?]) that it doesconverge to f(x) provided

• f is 2π-periodic and continuous and,

• f ′ exists and is bounded and has at most finitely many points of discontinuity.

It was discovered by du Bois and Raymond (see Hardy and Rogosinski [?]) thatassuming only continuity of f is not enough, in the sense that there exist continuousf for which lim supn |Sn(f, 0)| = ∞. But once again, this failure is primarily dueto Dirichlet’s reconstruction scheme. The collection F(f, k), k = 0,±1, · · · doesstore enough information to construct f , if we use an averaged version (summabilityassisted form) of the Dirichlet reconstruction scheme. This is the subject matter ofthe following developments.

Page 56: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

12.1 Fourier & Dirichlet Series 101

Theorem - 12.1.1 - (Dirichlet kernel) Let f be 2π-periodic integrable functionwith partial sums of its Fourier series Sn(f, x) = 1

∑nk=−n e−ikxF(f, k), where

F(f, k) =∫ π

−π f(x)eikx dx, k = 0,±1,±2, · · · . Then we have

Sn(f, x) =

∫ π

−π

f(x+u) Dn(u) du, Dn(u) :=sin((n + 0.5)u)

2π sin(u/2),

∫ π

−π

Dn(u) du = 1.

The kernel, Dn(u), is called the Dirichlet kernel.

Proof: Figure (12.1) gives two representative shapes of Dn(u), for n = 5 andn = 8. Using the fact that cos(A−B) = cos A cos B + sin A sin B.

−5 0 5−0.5

0

0.5

1

1.5

2

D5(u

)

u−5 0 5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

D8(u

)

u

Figure 12.1: Dirichlet Kernels for n = 5 and n = 8.

SN (f, x) =1

∫ π

−π

f(x) dx +N∑

k=1

(1

π

∫ π

−π

f(t) cos kt cos kx + sin kt sin kx)

=1

∫ π

−π

f(t)

1 + 2N∑

k=1

cos k(t− x)

dt.

Note that 2 sin A sin B = sin(A + B) + sin(A−B) and taking A = x2 and B = kx,

by the telescoping effect we get,

N∑

k=1

2 sin(x/2) cos(kx) =N∑

k=1

sin(k + (1/2))x − sin(k − (1/2))x

= − sin(1− (1/2))x + sin(N + (1/2))x.

Dividing by sin(x/2), we get

2

N∑

k=1

cos(kx) =1

sin(x/2)

N∑

k=1

2 sin(x/2) cos(kx)

= −1 +sin(N + (1/2))x

sin(x/2).

102 General Inversion

This gives that

SN (f, x) =1

∫ π

−π

f(t)sin((N + 0.5)(t− x))

sin((t− x)/2)dt

=

∫ π−x

−π−x

f(u + x) DN (u) du, u = t− x,

where we take DN (u) := sin((N+0.5)u)2π sin(u/2) . Finally, for f(u) ≡ 1 over [−π, π], by the

orthogonality of sin, cos functions, we have a0 = 2 and ak = bk = 0 for k ≥ 1.Hence, Sn(f, x) = a0

2 = 1 for all n ≥ 1. This gives that

∫ π

−π

Dn(u) du =

∫ π

−π

f(x + u)Dn(u) du = Sn(f, x) = 1.

This finishes the proof. ♠Here is the main result of this section which shows how to get the inversion for

any continuous 2π-periodic function.

Theorem - 12.1.2 - (Fejer’s theorem) Let f be 2π-periodic integrable functionwith partial sums of its Fourier series Sn(f, x) = 1

∑nk=−n e−ikxF(f, k), where

F(f, k) =∫ π

−πf(x)eikx dx, k = 0,±1,±2, · · · . Then the following results hold.

(i) The Cesaro mean of Sn(f, x), n = 0, 1, 2, · · · is

σT (f, x) :=1

T + 1

T∑

n=0

Sn(f, x) =T∑

j=−T

(1− |j|

T + 1

)F(f, j)e−ijx.

(ii) We also have

T∑

j=−T

(1− |j|

T + 1

)F(f, j)e−ijx =

∫ π

−π

f(x + θ)1

T + 1

(sin(T + 1) θ

2

sin θ2

)2

= Ef(x + YT ),

where YT is a random variable with density hT (θ) = 1T+1

(sin(T+1) θ

2

sin θ2

)2

for

θ ∈ (−π, π), and zero otherwise.

(iii) The sequence of functions, hT (θ), T = 1, 2, · · · , is an approximate identity.That is,

limT→∞

|θ|>δ

hT (θ) dθ = 0, for any δ > 0.

(iv) If f is continuous at x, then σT (f, x) converges to f(x).

(v) If f is a continuous function then σT (f, x) converges to f(x) uniformly in x.

(vi) (Uniqueness). When f, g are 2π-periodic and continuous functions, and if

f(j) = g(j) for all j, then f(t) = g(t) for all t.

Page 57: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

12.1 Fourier & Dirichlet Series 103

(vii) (Weierstrass approximation theorem). Show that for any continuous2π-periodic function f , there exists a sequence of trigonometric polynomialsthat converge to f uniformly.

Proof: The function hT (θ) is shown in Figure (12.2), and is called the Fejer kernel.(i) The fact that

−5 0 50

0.2

0.4

0.6

0.8

1 h

5(u)

u−5 0 50

0.5

1

1.5

h8(u

)

u

Figure 12.2: Fejer Kernels for T = 5 and T = 8.

σT (f, x) :=1

T + 1

T∑

n=0

Sn(f, x) =T∑

j=−T

(1− |j|

T + 1

)F(f, j)e−ijx

is just that the sum of all the uniform probabilities, pT (t), (over 0, 1, · · · , T ) forvalues greater than or equal to |j| is

1

T + 1|j|+ (|j|+ 1) + · · ·+ T =

T + 1− |j|T + 1

= 1− |j|T + 1

.

(ii) One direct way of obtaining the Fejer kernel is to take the cesaro sum of theDirichlet kernel. Indeed,

hT (u) =1

T + 1

T∑

k=0

Dk(u)

=1

T + 1

T∑

k=0

sin(u(k + 0.5))

sin(u/2)

=1

T + 1

1

sin(u/2)

T∑

k=0

sin(u(k + 0.5))

.

Here we can use some trigonometry (as before). Multiply by 2 sin(u/2) and usingthe equation

2 sin(u/2) sin(u(k + 0.5)) = cos(ku)− cos((k + 1)u),

104 General Inversion

we get

2 sin(u/2)

T∑

k=0

sin(u(k + 0.5)) =

T∑

k=0

(cos(ku)− cos((k + 1)u)

= cos 0− cos((T + 1)u)

= 1− cos((T + 1)u).

Thus we have

hT (u) =1

T + 1

1

sin(u/2)

1− cos((T + 1)u)

2 sin(u/2)

=1

T + 1

2(sin((T + 1)u/2))2

2 (sin(u/2))2.

To see how Parseval’s relation gives the same result, note that

T∑

j=−T

(1− |j|

T + 1

)F(f, j)e−ijx =

∫ π

−π

f(u)T∑

j=−T

(1− |j|

T + 1

)e−ijx du.

The reader may simplify the sum to get the same result. Next, it is clear thathT (θ) ≥ 0. Since for f(u) = 1, Sn(f, x) ≡ 1 for all x and all n, its average willalso be σT (f, x) = 1 for all T and all x. Hence,

∫ π

−π hT (θ) dθ = 1. This shows thatYT is a random variable with density hT (θ). (iii) The graph of the density showsthat it ought to be an approximate identity, since most of the area is concentratednear zero. Indeed, for any δ > 0, note that if |θ| > δ then sin θ

2 > cδ > 0 for someconstant cδ. This shows that, as T →∞,

|θ|>δ

hT (θ) dθ ≤ 1

(T + 1)

c2δ

→ 0.

Parts (iv) and (v) now follow from our earlier argument concerning approximations

via approximate identities. Part (vi) follows immediately, since f − g(j) = f(j) −g(j) = 0 for all j. Now the zero function has all Fourier coefficients equal tozero, whose resulting Fourier series must converge to it uniformly, by part (v).Therefore, f(t) = g(t) for all t. Now part (vii) is obvious once we note that σn(f, x)is a trigonometric polynomial. ♠

Exercise - 12.1.1 - (Abel/Poisson theorem) Let f be 2π-periodic integrablefunction with partial sums of its Fourier series Sn(f, x) = 1

∑nk=−n e−ikxF(f, k),

where F(f, k) =∫ π

−π f(x)eikx dx, k = 0,±1,±2, · · · .(i) Verify that the Abel/Poisson mean of Sn(f, x), n = 0, 1, 2, · · · is

Ar(f, x) := (1− r)

∞∑

n=0

rn Sn(f, x) =

∞∑

j=−∞r|j|F(f, j)e−ijx.

Page 58: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

12.1 Fourier & Dirichlet Series 105

(ii) Either by Parseval’s relation, or directly, verify that

∞∑

j=−∞r|j|F(f, j)e−ijx =

∫ π

−π

f(x + θ)(1− r2)

2π(1− r)2 + 2r sin2(θ/2) dθ

= Ef(x + Zr),

where Zr is a random variable with density hr(θ) := (1−r2)2π(1−r)2+2r sin2(θ/2)

for θ ∈ (−π, π), and zero otherwise. The function hr(θ) is shown in Figure(12.3) and is called the Poisson kernel.

−5 0 50

0.5

1

1.5

h.8

(u)

u−5 0 50

0.5

1

1.5

2

2.5

3

3.5

h.9

(u)

u

Figure 12.3: Poisson Kernels for r = 0.8 and r = 0.9.

(iii) Verify that hr(θ), r ∈ (0, 1) is an approximate identity by showing that

limr→1−

|θ|>δ

hr(θ) dθ = 0, for any δ > 0.

(iv) Now conclude that if f is continuous at x, then Ar(f, x) converges to f(x).

(v) Conclude that if f is a continuous function then Ar(f, x) converges to f(x)uniformly in x.

Exercise - 12.1.2 - (The moral of Fourier series convergence) Let f be 2π-periodic integrable function with partial sums of its Fourier series Sn(f, x) =12π

∑nk=−n e−ikxF(f, k), where F(f, k) =

∫ π

−π f(x)eikx dx, k = 0,±1,±2, · · · . Ex-plain the general steps behind the convergence of summability assisted inversion.

Remark - 12.1.2 - (Is Dirichlet inversion scheme the only game in town?)There are many functions which do not fit into the varieties that we have studiedso far. For instance,

f(t) =∞∑

k=−∞fkeitλk , t ∈ R.

106 General Inversion

Assume that∑

k |fk| < ∞. These functions are not periodic, when λk are notintegers, and such series are called Dirichlet series. Note that we cannot talk aboutthe Fourier transform of such f ’s, since

∫∞−∞ |f(t)| dt involves meaningless terms

of the type∫∞−∞ eitθeitλk dt. Such functions commonly arise in probability theory

— as the characteristic functions of discrete (but non-lattice) random variables —.Here a probabilist faces the following question, “if I give you the fuction f(t) and

tell you that it is some Dirichlet series, can you find its fk and the corresponding

λk”? The answer is yes, and is the subject matter of the following exercise. Thispoint of view is useful for spectral representation theory of second order stationaryprocesses.

Exercise - 12.1.3 - (Inversion for nonintegrable functions) Let f be inte-grable over every interval of the type [−T, T ] and represent an absolutely convergentDirichlet series

∑k fkeiλkt.

• (i) Show that if x is a point which does not equal any λk of the function f ,then

Lx(T ) :=1

2T

∫ T

−T

e−itx f(t) dt =∞∑

j=−∞fj

sin T (λj − x)

T (λj − x).

• (ii) Show that if x = λk, then

Lx(T ) =1

2T

∫ T

−T

e−itx f(t) dt =∞∑

j=−∞,λj 6=λk

fjsin T (λj − x)

T (λj − x)+ fk.

• (iii) Letting T get large in parts (i) and (ii), deduce that

limT→∞

1

2T

∫ T

−T

e−itxf(t) dt =

fk if x = λk for some k = 0,±1,±2, · · ·0 otherwise.

(Hence, we may recover the coefficients and the exponents of the Dirichletseries by knowing the function f . Note that this “inversion” scheme uses 2Tinstead of 2π, compared to the Fourier coefficients.)

Page 59: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 13

Basic Limit Theorems

Much of the modern probability theory is based on various types of limit theorems.Here we look at the three or four broad categories.

Recall that a random variable is a function on a sample space, Ω. The heart ofall limit theorems lies in specifying how do we decide two random variables, U, V ,are “close”.

1. We could say U, V are “close” if, |U(ω)−V (ω)| is small for each ω ∈ Ω. Thisapproach is called pointwise comparison. By the way, this assumes that therandom variables be defined for the same experiment. Sometimes we add adisclaimer that the closeness may fail for a few ωs as long as these “bad”ωs form a set (event) whose probability is zero. This version of pointwisecomparison is called almost sure comparison.

2. We could say U, V are “close” if on the average (in the sense of expectations)their difference, |U − V |, is small. (Once again this approach would bemeaningless if U, V are defined over two different sample spaces.) Here wecould consider several sub-varieties. For instance:

• Pick a p ≥ 1, and we may compute the Lp distance:

(E|U − V |p)1/p, (0.1)

• We may compute the, so called L0 distance,

E

( |U − V |1 + |U − V |

)(0.2)

and see if this is small. This unusual looking expectation approachcan be studied via probabilities. The expectation in (0.2) is finite forall random variables whereas Lp distance may not be finite for somerandom variables.

3. All of the above comparisons are stringent in the sense that they lock us intorequiring that both U and V must be defined for the same random experi-ment. It is quite possible that U could be coming from one random experi-ment while V could be coming from a totally different random experiment.

108 Basic Limit Theorems

For instance, U could be a binomial random variable and V could be a Poissonrandom variable, or a normal random variable. How should we compare U, Vin such a case? Well, the natural thing to do is to see if their respective cdf(and/or densities) are similar. This is called the distributional comparisonand it compares

FU (x) = P(U ≤ x) with FV (x) = P(V ≤ x), for each x ∈ R. (0.3)

The weak laws of large numbers use the (0.2) to measure closeness of two randomvariables. The strong laws of large numbers use the above mentioned almost suresense and the central limit theorem uses (0.3) for comparison.

The above three comparison methods are distinctly different and lead to differ-ent types of limit theorems. There are, however, some general links between themthat we will provide in this lecture.

13.1 Convergence in Distribution

Definition - 13.1.1 - (Convergence in distribution) We say that a sequence,X1,X2, · · · , of random variables converges in distribution to a random variable Xif

Fn(x) := P (Xn ≤ x) → P (X ≤ x) =: F (x),

for all real numbers, x, at which the distribution F (x) is continuous. We denote

this type of covergence by Xndist→ X, or Fn

dist→ F .

It is natural to ask, is the limiting distribution unique? To see that it is, supposethat G is another possible such limit. Now

|F (x)−G(x)| ≤ |F (x)− Fn(x)|+ |Fn(x)−G(x)|

goes to zero when x is a point of continuity of both F and G. For other points, weuse the right continuity of both F,G to get that F (x) = G(x) for all x.

Proposition - 13.1.1 - (Equivalent form fordist→ ) Let X,X1,X2, · · · be a se-

quence of random variables. The following statements are equivalent.

• (i) Xndist→ X.

• (ii) E (f(Xn))→ E (f(X)) for every bounded continuous function f over R.

Proof: Let F,Fn be the distributions of X,Xn respectively. Assume (i) holdsand let f be a nonzero bounded continuous function over R. Denote its bound byB = supx |f(x)| > 0. Now cut the tails off of the distribution of X. That is, forany ǫ > 0, find continuity points ±c of F so that P(|X| > c) ≤ ǫ

B . Since Fn(±c)→F (±c), there exists an N such that for all n ≥ N we have P(|Xn| > c) ≤ 2ǫ

B . Nextfor this ǫ, over the interval [−c, c] approximate the continuous function f by a stepfunction h so that h(t) =

∑mi=1 aiχ(ci−1,ci](t), where −c = c0 < c1 < · · · < cm = c

Page 60: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

13.1 Convergence in Distribution 109

and all these ci are points of continuity of F , and supt∈[−c,c] |f(t) − h(t)| < ǫ.Extend h(t) = 0 for t 6∈ [−c, c]. Note that

E (h(Xn)) =m∑

i=1

ai Fn(ci)− Fn(ci−1) →m∑

i=1

ai F (ci)− F (ci−1) = E (h(X)) .

Furthermore, for all n ≥ N , we also have

|E (f(Xn))− E (f(X))|≤

∣∣E(f(Xn)χ|Xn|≤c

)− E

(f(X)χ|X|≤c

)∣∣+ E∣∣f(Xn)χ|Xn|>c

∣∣

+E∣∣f(X)χ|X|>c

∣∣

≤∣∣E(f(Xn)χ|Xn|≤c

)− E

(f(X)χ|X|≤c

)∣∣+ BP(|Xn| > c) + BP(|X| > c)

≤∣∣E(f(Xn)χ|Xn|≤c

)− E

(f(X)χ|X|≤c

)∣∣+ 3ǫ

≤∣∣E(f(Xn)χ|Xn|≤c

)− E (h(Xn))

∣∣+ |E (h(Xn))− E (h(X))|+∣∣E (h(X))− E (f(X))χ|X|≤c

∣∣+ 3ǫ

≤ |E (h(Xn))− E (h(X))|+ 5ǫ → 5ǫ.

Since ǫ is arbitrary, the left side must go to zero.The converse is easy. Let x, x + ǫ, x− ǫ be points of continutiy of F . Consider

the bounded continuous functions f(t) shown below.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

-

1

x + ǫx0

By part (ii) we see that

F (x + ǫ) ≥ E(f(X)) = limn

E (f(Xn)) ≥ lim supn

P(Xn ≤ x).

Letting ǫ drop to zero over those x + ǫ which are points of continuity of F givesthat lim supn Fn(x) ≤ F (x). For the other side, consider the shifted version, h(t) =f(t + ǫ), which is also bounded and continuous. This gives

F (x− ǫ) ≤ E(h(X)) = limn

E (h(Xn)) ≤ lim infn

P(Xn ≤ x).

Letting ǫ drop to zero over those x − ǫ which are points of continuity of F givesthat lim infn Fn(x) ≥ F (x). ♠

HW35 Exercise - 13.1.1 - If Xndist→ X then show that h(Xn)

dist→ h(X) for any continuousfunction h over R.

110 Basic Limit Theorems

Exercise - 13.1.2 - Show that X,Y have the same distribution if and only ifE(h(X)) = E(h(Y )) for all bounded real-valued continuous functions h.

Remark - 13.1.1 - (The continuity theorem and the Cramer-Wold de-vice) Let X,X1,X2, · · · is a sequence of random variables with respective char-

acteristic functions φ(t), φ1(t), φ2(t), · · · . If Xndist→ X then Proposition (13.1.1)

gives that φn(t) → φ(t) for every t ∈ R. This is because cos(tx), sin(tx) arebounded continuous functions of x for each fixed t ∈ R. The converse holds aswell, namely if φn(t) → φ(t) for every t ∈ R then Xn

dist→ X. This is known asthe continuity theorem. We will prove this later after studying Fourier-Stieltjestransforms.

A d-dimensional version of convergence in distribution is defined analogouslyand the corresponding analog of Proposition (13.1.1) holds as well. Furthermorethe continuity theorem also holds. More precisely if X,X1,X2, · · · is a sequence

of d-dimensional random vectors then Xndist→ X if and only if t′Xn

dist→ tX for allvectors t′ = [t1, t2, · · · , td] consisting of real numbers. This result is known as theCramer-Wold device.

Example - 13.1.1 - (Limiting distributions of extreme order statistics) Itturns out that there are essentially three types of limiting distributions when onetries to find the limiting distribution of

Zn :=X(n) − bn

an, X(n) = maxX1, · · · ,Xn,

where an > 0, and bn are constants chosen so that Zndist→ G for some distribution

G. The following three cases give the three varieties of G.

Case 1: (Gumbel’s extreme value distribution) Let X1,X2, · · · iid∼ Exp(1).Now P(X(n) ≤ t) = (1− e−t)n for any t > 0. Therefore, any an > 0 and bn, we seethat

P(Zn ≤ t) = P(X(n) ≤ bn + tan) =(1− e−(bn+tan)

)n

, t > − bn

an.

When we take an = 1 and bn = lnn, we see that a limiting distribution exists andequals

limn→∞

P(Zn ≤ t) = limn→∞

(1− e−te− ln n

)n, = exp

−e−t

=: G(t), t ∈ R.

This G is called Gumbel’s extreme value distribution.

Case 2: (Frechet’s extreme value distribution) Let X1,X2, · · · iid∼ Pareto(a, 1).that is, f(x) = a

xa+1 for x > 1. Now P(X(n) ≤ t) = (1− 1ta )n for any t > 1. There-

fore, any an > 0 and bn, we see that

P(Zn ≤ t) = P(X(n) ≤ bn + tan) =

(1− 1

(bn + tan)a

)n

, t >1− bn

an.

Page 61: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

13.2 Convergence in Probability & WLLN 111

So, if we take bn = 0 and an = n1/a then

limn→∞

P(Zn ≤ t) = = limn→∞

(1− (1/t)a

n

)n

= exp−(1/t)a = G(t), t ∈ (0,∞).

This G is called Frechet’s extreme value distribution, for any fixed constant a > 0.

Case 3: (Weibull’s extreme value distribution) Let X1,X2, · · · iid∼ Uniform(0, 1).an > 0 and bn, we see that Now P(X(n) ≤ t) = tn for any t ∈ (0, 1). Therefore, anyan > 0 and bn, we see that

P(Zn ≤ t) = P(X(n) ≤ bn + tan) = (bn + tan)n , − bn

an< t <

1− bn

an.

If we try bn = 1 and an = 1n , we get

limn→∞

P(Zn ≤ t) = limn→∞

(1 +

t

n

)n

= et = e−(−t), t ∈ (−∞, 0).

This is a special case of G(t) = exp−(−t)α, for t < 0 and G(t) = 1 for t ≥ 0, fora positive constant α, known as Weibull’s extreme value distribution.

13.2 Convergence in Probability & WLLN

Let us start off by giving an official name to the unusual looking expectation sense,(0.2), of distance.

Definition - 13.2.1 - (Convergence in probability) A sequence of randomvariables, Yn, n = 1, 2, · · · , (all random variables defined over the same proba-bility space) is said to converge in probability to a random variable, Y , denoted by

Ynprob→ Y , if

limn→∞

E

( |Yn − Y |1 + |Yn − Y |

)= 0.

The reason the above form of convergence is called convergence in probabilityis that it can be performed via probabilities, instead of the above types of expec-tations.

Proposition - 13.2.1 - (Equivalent form for convergence in probability)Let Y, Yn, n = 1, 2, · · · be random variables all defined over the same probabilityspace (Ω, E , P ). Then the following statements are equivalent.

1. (i) Yn converge to Y in probability, in the sense of Definition (13.2.1).

2. (ii) For every ε > 0, we have

limn→∞

P(|Yn − Y | ≥ ε) = 0.

Uniqueness: If Ynprob→ Y and Yn

prob→ Z then P(Y = Z) = 1.

112 Basic Limit Theorems

Proof: Since x1+x is a continuous strictly increasing function of x > 0, therefore

|Yn − Y | ≥ ε if and only if|Yn − Y |

1 + |Yn − Y | ≥ε

1 + ε.

Therefore, Markov’s inequality shows that if (i) holds then

P(|Yn − Y | ≥ ε) = P

( |Yn − Y |1 + |Yn − Y | ≥

ε

1 + ε

)≤ 1 + ε

εE

( |Yn − Y |1 + |Yn − Y |

)→ 0.

This gives (ii). For the converse, a “Hungarian trick” gives

E

( |Yn − Y |1 + |Yn − Y |

)= E

( |Yn − Y |1 + |Yn − Y |χ|Yn−Y |≥ε

)+ E

( |Yn − Y |1 + |Yn − Y |χ|Yn−Y |<ε

)

≤ E(χ|Yn−Y |≥ε

)+

ε

1 + ε= P(|Yn − Y | ≥ ε) +

ε

1 + ε.

When (ii) holds, the right side goes to ε1+ε which can be made arbitrarily close to

zero since ε > 0 is arbitrary. For the uniqueness of the limit,

P(|Y − Z| ≥ ε) ≤ P(|Yn − Y |+ |Yn − Z| ≥ ε)

≤ P(|Yn − Y | ≥ ε/2) + P(|Yn − Z| ≥ ε/2).

which goes to zero. As ε ↓ 0, by the continuity property P(|Y − Z| > 0) = 0. ♠

Remark - 13.2.1 - (Lp

→ impliesprob→ ) The last proposition says that convergence

in probability can be proved by showing that either of the two quantities:

P(|Yn − Y | ≥ ε), or E

( |Yn − Y |1 + |Yn − Y |

),

gets small as n gets larger. Unfortunately, neither of these two expressions is easyto compute exactly. Instead often E(Yn − Y )2 is not hard to compute when it isfinite. Via Chebyshev’s inequality,

P(|Yn − Y | ≥ ε) ≤ E|Yn − Y |pεp

, for any p > 0.

When E|Yn−Y |p → 0 we say that Yn converge to Y in Lp, denoted as YnLp

→ Y . Theabove observation shows that Lp convergence implies convergence in probability.So, if Y,Z are two potential limits of Lp convergence, then by the uniqueness of thelimit obtained by convergence in probability, it must be that P(Y = Z) = 1 as well.By the way, convergence in probability does not imply convergence in Lp. We willaddress the converse after introducing the concept of uniform integrability.

Proposition - 13.2.2 - (prob→ implies

dist→ ) Let Y, Yn, n = 1, 2, · · · be randomvariables all defined over the same probability space (Ω, E , P ). Then the followingresults hold.

Page 62: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

13.2 Convergence in Probability & WLLN 113

• (a) Ynprob→ Y implies Yn

dist→ Y .

• (b) Yndist→ Y and P(Y = c) = 1 for some constant c imply Yn

prob→ Y .

Proof: For any ǫ > 0, the fact P(|Yn − Y | ≥ ǫ) → 0 says that there exists apositive integer N such that for all n ≥ N we have

P(|Yn − Y | ≥ ǫ) ≤ ǫ.

If F,Fn are the distributions of Y and Yn respectively then another Hungarian trickgives

F (x− ǫ) = P (Y ≤ x− ǫ) = P (Y ≤ x− ǫ, |Yn − Y | ≥ ǫ)

+P (Y ≤ x− ǫ, |Yn − Y | < ǫ)

≤ P (|Yn − Y | ≥ ǫ) + P (Yn ≤ x)

≤ ǫ + Fn(x), for all n ≥ N.

With Fn on the left side gives an analogous inequality. So,

F (x− ǫ) ≤ ǫ + Fn(x), Fn(x− ǫ) ≤ ǫ + F (x), for all n ≥ N.

Replacing x by x + ǫ in the second inequality, we get

F (x− ǫ) ≤ ǫ + Fn(x) ≤ 2ǫ + F (x + ǫ).

Hence, we see that

F (x− ǫ) ≤ ǫ + lim infn

Fn(x) ≤ ǫ + lim supn

Fn(x) ≤ 2ǫ + F (x + ǫ).

When x is a point of continuity of F , letting ǫ drop to zero gives that Fn(x)→ F (x).This proves part (a). To prove part (b), the reader may verify that for any ǫ > 0,we have

P(|Yn − Y | ≥ ǫ) ≤ 1− P(Yn ≤ c + ǫ) + P(Yn ≤ c− 0.5ǫ).

The right sides goes to zero. ♠

Remark - 13.2.2 - (Bernoulli’s, Chebyshev’s and Khintchin’s WLLN) Bernoulliwas the first one to notice that if X1,X2, · · · forms a sequence of iid fair coin tossrandom variables, (i.e., P(Xi = 1) = 1− P(Xi = 0) = 1

2 ) then

Yn :=X1 + X2 + · · ·+ Xn

n

prob→ E(X1) =1

2.

This is called Bernoulli’s weak law of large numbers. Chebyshev extended Bernoulli’sWLLN by noting that there was nothing special about fair coin tosses in his proof.One could have used any sequence of independent and identically distributed ran-dom variables, X1,X2, · · · , as long as they had finite variance. The proof takes oneline where we take P(Y = µ) = 1. For any ε > 0, by Chebyshev’s inequality,

P(|Yn − Y | ≥ ε) = P(|Yn − µ| ≥ ε) ≤ E(Yn − µ)2

ε2=

V ar(Yn)

ε2=

σ2

nε2→ 0,

114 Basic Limit Theorems

as n gets large. In fact, Chebyshev invented his inequality for this purpose. By theway, here we have proved a bit more by showing that the convergence occurs in theL2 sense.

Not too long after Chebyshev, Khintchin imporved Chebyshev’s version of

WLLN substantially in two directions. He showed that 1n

∑ni=1 Xi

prob→ E(X1)holds under only the following two conditions

• (i) The sequence of X1,X2, · · · consists of only pairwise independent randomvariables.

• (ii) Each Xi has the same distribution with finite mean.

That is, now V ar(Xi) need not be finite nor Xi need be mutually independentanymore. We will postpone its proof for now. A slightly more “expensive” versionis in our reach provided we do not mind taking the continuity theorem of Remark(13.1.1) for granted. At this moment we will also take for granted that if E|X1| <∞then the characteristic function of X1 is differentiable.

Proposition - 13.2.3 - (A WLLN) Let X1,X2, · · · be a sequence of independentand identically distributed random variables with finite mean E(X1) = µ. Thenthe sample mean Xn := 1

n

∑ni=1 Xi converges to µ in probability.

Proof: By part (b) of Proposition (13.2.2) we need only show that the samplemean Xn converges in distribution to a constant random variable Y , namely P(Y =µ) = 1. Since the characteristic function of this Y is E(eitY ) = eitµ, by the

continuity theorem we need only verify that E(eitXn) converges to eitµ for all realt. For this purpose, note that if φ(t) = E(eitX1) then the characteristic functionof Xn is (φ(t/n))

n. When E|X1| < ∞ the characteristic function is differentiable

giving φ′(0) = iµ. Thus, L’Hopital’s rule gives

limn→∞

ln (φ(t/n)n) = limn→∞

ln (φ(t/n))

1/n,

0

0form,

= limn→∞

φ′(t/n) (−t/n2)

φ(t/n) (−1/n2)= t

φ′(0)

φ(0)= itµ.

Therefore, limn E(eitXn) = eitµ = E(eitY ). ♠

Exercise - 13.2.1 - (Slutsky’s theorem) Let Xn,X, Yn be defined over the sameprobability space.

• (i) Let Ynp→ 0 and let Xn

dist→ X. Then show that Xn + Yndist→ X and

XnYndist→ 0.

• (ii) If Ynp→ c, where c is any real number and Xn

dist→ X, then show that

Xn + Yndist→ X + c and XnYn

dist→ cX.

[Hints: For first part of item (i) verify

P(Xn ≤ x− ǫ) ≤ P (|Yn| > ǫ) + P (Xn + Yn ≤ x), and

Page 63: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

13.2 Convergence in Probability & WLLN 115

P (Xn + Yn ≤ x) ≤ P (|Yn| > ǫ) + P (Xn ≤ x + ǫ).

Prove the second part of item (i) by verifying

P (|XnYn| > ǫ) ≤ P (|Yn| > δ) + P (|Xn| > ǫ/δ).

For part (ii) consider Zn := Yn − c.]

Exercise - 13.2.2 - Let X1,X2, · · · be a sequence of independent random variablesfrom N(µ, σ2). Prove that the sample variance

1

n− 1

n∑

i=1

(Xi −Xn)2

converges to σ2 in probability. [Hint: You may use Khintchin’s WLLN.]

Exercise - 13.2.3 - Do the above exercise when the Xi form a random sample fromsome distribution having finite variance σ2.

Exercise - 13.2.4 - If Zn converge to Z in distribution where Z ∼ F and if xn → xwhere x is a point of continuity of F , then

limn→∞

P (Zn ≤ xn) = F (x).

Lemma - 13.2.1 - (Taylor expansion) Suppose f is well defined in [a, b]. Fix apoint c ∈ (a, b). Suppose f is continuous in a neighborhood around c and differen-tiable at c. (Note that we are not asking that f be differentiable in a neighborhoodaround c). Then we have

f(c + x) = f(c) + xf′

(c)− xδ(x),

for a function δ, where δ(x)→ 0 as x→ 0. Thus,

Proof: Define a new function g by

g(x) = f(c + x)− f(c)− xf ′

(c)− ǫ; −(c− a) ≤ x ≤ (b− c),

where ǫ > 0 is a number. Note that g(0) = 0. Since g may not be differentiable ina neighborhood of 0, we will compare the slopes of the secant lines. The differen-tiability of f at c gives that

limx→0

f(c + x)− f(c)

x= f

(c).

Therefore, for any ǫ > 0 and all sufficiently small x, x ∈ (0, ǫ),

f(c + x)− f(c)

x> f

(c)− ǫ andf(c + x)− f(c)

x< f

(c) + ǫ.

116 Basic Limit Theorems

After multiplying by x on both sides and combining these two inequalities we get

xf ′

(c)− ǫ < f(c + x)− f(c) < xf ′

(c) + ǫ

for any ǫ > 0 and all sufficiently small x ∈ (0, ǫ). In particular, we can always finda δ(x) such that,

f(c + x)− f(c) = xf ′

(c) − δ(x).The above two inequalities imply that, for all sufficiently small x ∈ (0, ǫ), it mustbe that 0 ≤ |δ(x)| < ǫ. A similar argument gives the same conclusion for smallnegative values of x. Now, letting ǫ drop to zero forces δ(x) and x to go to zero.That is,

f(c + x) = f(c) + xf′

(c)− xδ(x),

where δ(x)→ 0 as x→ 0. ♠

Exercise - 13.2.5 - (Delta method) Let kn be a sequence of positive numbersdiverging to infinity. Let Tn be a sequence of random variables so that for some

constant µ we have kn(Tn − µ)dist→ Z.

• (i) For any function f which is differentiable at µ with f ′(µ) 6= 0, show that

kn(f(Tn)− f(µ))dist→ f ′(µ)Z.

• (ii) For any function f which is twice differentiable at µ with f ′′(µ) 6= 0, andf ′(µ) = 0, show that

k2n(f(Tn)− f(µ))

dist→ f ′′(µ)

2Z2.

Page 64: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 14

Almost Sure Convergence &

SLLN

Definition - 14.0.2 - On a probability space a property holds almost surely if thereexists an event A with P(A) = 1 and the property holds for each ω ∈ A. (We donot care whether the property holds or fails on Ac.)

What this phrase means is that for those ω ∈ Ω for which the property maynot hold (or we are unable or we do not want to check for some reason) form a“bad” set that has zero probability and intend to ignore. In the above definitionthat ignorable “bad” set is Ac.

Example - 14.0.1 - Let X ∼ N(0, 1). If we define Y = 1X−3 then the new random

variable Y is not completely well defined since X can take the value 3 and wewill have the unpleasant situation “1

0”. However, P(X = 3) = 0. Therefore, Y iswell defined for each ω in the set A = ω : X(ω) 6= 3 and P(A) = 1. Hence,even though Y is not a well defined function mathematically, Y is a well definedfunction probabilistically in the almost sure sense. Probabilists ignore events ofzero probability, even countably infinitely many of them since

P(∪iBi) ≤∑

i

P(Bi),

and a sum of infinitely many zeros is still zero.

Definition - 14.0.3 - (Almost sure convergence) We say that a sequence ofrandom variables Yn converges almost surely to another random variable Y , de-

noted by Yna.s.→ Y , if there exists a set A with P(A) = 1 and limn→∞ Yn(ω) = Y (ω)

for each ω ∈ A. That is, for each fixed ω ∈ A, we have the ordinary type ofconvergence: for every ε > 0, there exists a positive integer, N(ω, ε), such that

|Yn(ω)− Y (ω)| < ε, for all n ≥ N(ω, ε).

118 Almost Sure Convergence & SLLN

The first order of business is to know where does this kind of convergence fitinto the grand scheme of things in relation to the other forms of convergences thatwe already have defined so far.

Proposition - 14.0.4 - (Equivalent forms fora.s.→ ) Let Y1, Y2, · · · be a sequence

of random variables defined on probability space (Ω, E , P ) and let An(ǫ) = ω ∈Ω : |Yn(ω)| ≥ ǫ. The following statements are equivalent.

• (i) Yn converge to 0 almost surely.

• (ii) For every ǫ > 0 we have P (lim supn An(ǫ)) = 0.

• (iii) For every ǫ > 0 we have limn→∞ P (∪k≥nAk(ǫ)) = 0.

• (iv) supk≥n |Yk| prob.→ 0.

(Uniqueness): If Yna.s.→ Y and Yn

a.s.→ Z then P(Y = Z) = 1.

Proof: The main thing to notice is that almost sure convergence of Yn to zerocan be symbolically stated as

P (ω ∈ Ω : ∀ǫ > 0 ∃N ∈ N so that ∀n ≥ N, |Yn(ω)| < ǫ) = 1.

We may restate this in terms of set notation as

P (∩ǫ>0 ∪∞N=1 ∩n≥Nω ∈ Ω : |Yn(ω)| < ǫ) = 1.

Note that ǫ could be taken to be positive rationals. Since the probability is one forthe intersection over all ǫ > 0, and it can not get any higher, it must be that foreach individual ǫ > 0, we have

P (∪∞N=1 ∩n≥N ω ∈ Ω : |Yn(ω)| < ǫ) = 1.

This probability is that of the lim infn An(ǫ)c, which is the same as (lim supn An(ǫ))c.So, parts (i) and (ii) are equivalent. The continuity property of P shows that parts(ii) and (iii) are equivalent. Now part (iv) says that for every ǫ > 0 we have

0 = limn→∞

P

(supk≥n|Yk| ≥ ǫ

)≥ lim

n→∞P (∪k≥n |Yk| ≥ ǫ)

giving part (iii). The converse holds since supk≥n |Yk| ≥ 2ǫ ⊆ ∪k≥n|Yk| ≥ ǫ.The uniqueness of the limit follows trivially. ♠

Corollary - 14.0.1 - For the notation of Proposition (14.0.4) we have

• (a) (a.s.→ implies

prob→ ) If Yna.s.→ 0 then Yn

prob→ 0.

• (b) (Sufficient condition fora.s.→ ) If

∑∞k=1 P(Ak(ǫ)) < ∞ for every ǫ > 0

(called complete convergence) then Yna.s.→ 0.

Page 65: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Almost Sure Convergence & SLLN 119

• (c) (Necessary condition fora.s.→ under independence) If Yn is a se-

quence of independent random variables and Yna.s.→ 0 then for every ǫ > 0,∑∞

k=1 P(Ak(ǫ)) <∞.

• (d) (a.s.→ versus

L1

→) If Yna.s.→ 0 and if supn E|Yn|p <∞ for some p > 1 then

YnL1

→ 0.

• (e) (prob→ implies

a.s.→ subsequentially) If Ynprob→ 0 then there exists a

increasing subsequence of positive integers, k1, k2, · · · so that Ykn

a.s.→ 0.

• (f) (Subsequential characterization ofprob→ ) Yn

prob→ 0 if and only if forevery subsequence Yn(k) there exists a further subsequence Yn(kj) that con-verges to 0 almost surely.

Proof: Part (a) follows from part (iv) of Proposition (14.0.4) and that 0 ≤|Yn| ≤ supk≥n |Yk|. For part (b), when

∑∞k=1 P(Ak(ǫ)) < ∞ then automatically

the tail of this convergent series must go to zero, i.e., limn→∞ P (∪k≥nAk(ǫ)) = 0.So by part (iii) of Proposition (14.0.4) part (b) holds. For part (c) note thatthe sequence Ak(ǫ) is given to be independent and, by part (ii) of Proposition(14.0.4), P(lim supn An(ǫ)) = 0. Contrapositivity of Borel-Cantelli lemma-II givesthat

∑∞k=1 P(Ak(ǫ)) <∞. For part (d) Holder’s inequality gives that

E|Yn| =

Ω

|Yn| χ|Yn|≥ǫ dP +

Ω

|Yn| χ|Yn|<ǫ dP

≤∫

Ω

|Yn| χ|Yn|≥ǫ dP + ǫ

≤ (E|Yn|p)1/p (P(|Yn| ≥ ǫ))1/q + ǫ,1

p+

1

q= 1

≤(

supn

E|Yn|p)1/p

(P(|Yn| ≥ ǫ))1/q + ǫ

→ ǫ as n→∞.

Since ǫ > 0 is arbitrary YnL1

→ 0. For part (e), consider the events

An(ǫn) = |Yn| ≥ ǫn, for ǫn =1

n.

Since Yn converges to zero in probability, P(An(ǫ1)) → 0 giving a k1 so thatP (An(ǫ1)) < ǫ21 for all n ≥ k1. Similarly, P(An(ǫ2)) → 0. From this pick ak2 > k1 so that P(An(ǫ2)) < ǫ22 for all n ≥ k2. Continuing this way we obtain asubseqence k1 < k2 < · · · , So, we see that

P

(|Ykn| ≥ 1

n

)<

1

n2, for all n = 1, 2, · · · .

Therefore,∑

n P(|Ykn| ≥ 1

n ) < ∞. So, P(lim supn|Ykn| ≥ 1

n) = 0 by the firstBorel-Cantelli lemma. Note that if ω 6∈

lim supn|Ykn

| ≥ 1n)

then only for

120 Almost Sure Convergence & SLLN

finitely many n we have |Ykn(ω)| ≥ 1

n . That is, |Ykn(ω)| < 1

n for all but finitelymany n. That is, Ykn

(ω) must be converging to zero. Finally, to prove part (f),

since convergence in probability is ordinary convergence of E

(|Yn|

1+|Yn|

)to zero, if

Yn converge to 0 in probability then every subsequence Yn(k) will also converge to0 in probability. For it, part (e) gives a further subsequence that converges to zeroalmost surely. Now to prove the converse, we assume that for every subsequenceYn(k) there exists a further subsequence Yn(kj) that converges to 0 almost surely.

Now if Yn did not converge to 0 in probability, then E( |Yn|1+|Yn| ) 6→ 0. Hence, for

some ǫ > 0 there exists an infinite subsequence, Yn(k) such that E(|Yn(k)|

1+|Yn(k)| ) > ǫ

for all k. No subsequence of Yn(k) can therefore converge in probability, but thiscontradicts the fact that it has a subsequence which converges to 0 almost surely.This contradiction proves the proposition. ♠

In the above discussion the limiting random variable is specified. In the ab-sence of the limiting random variable, we can proceed to characterize almost sureconvergence as follows. This point of view will be needed while discussing randomseries.

Exercise - 14.0.6 - (Convergence without specifying the limit) Let Sn, n =1, 2, · · · be a sequence of measurable functions on some probability space (S, ‖∑, P ).Let C be the set of all s ∈ S so that Sn(s) converges as n gets large. Verify that

C =

∞⋂

k=1

∞⋃

n=1

∞⋂

m=n+1

s : max

n<j<i≤m|Sj(s)− Si(s)| ≤

1

k

.

Then prove that P (C) = limk→∞

limn→∞

limm→∞

P

(max

n<j<i≤m|Sj − Si| ≤

1

k

).

HW36 Exercise - 14.0.7 - Let f : R → R be continuous. If Xnprob→ X then prove that

f(Xn)prob→ f(X).

Theorem - 14.0.1 - (Skorokhod’s representation theorem) Let Xn and X

be random variables with respective distributions Fn and F such that Xndist→ X.

Then there exists a probability space (S, ‖∑, P ) and random variables, Yn and Y

defined on it, having respective distributions Fn and F and Yna.s.→ Y .

Proof: Take a uniform random variable U ∼ Uniform(0, 1) defined over someprobability space (S, E , P). (For instance, S = [0, 1], E = B and P((a, b)) = b − awith U(u) = u.) Now take

Y (u) := inf t : F (t) ≥ u , Yn(u) := inf t : Fn(t) ≥ u , u ∈ (0, 1).

By Example (8.1.2), we have Yn(U) ∼ Fn and Y (U) ∼ F . To show that Yn convergeto Y almost surely, take any s ∈ (0, 1) and fix it. Compute the corresponding Y (s)and consider the interval (Y (s)− ǫ, Y (s)). Pick an

x ∈ (Y (s)− ǫ, Y (s))

Page 66: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Almost Sure Convergence & SLLN 121

which is a continuity point of F . We can do this since there are at most countablymany points of discontinuity of F . So, x < Y (s) implies that that F (x) < s, by thedefinition of Y (s). But we know that Fn(x)→ F (x), it must be that Fn(x) < s forall large values of n. Hence, for all large values of n we must have

Y (s)− ǫ < x < Yn(s) =⇒ Y (s) < Yn(s) + ǫ =⇒ Y (s) ≤ lim infn

Yn(s) + ǫ.

This being true for every ǫ > 0, we must have Y (s) ≤ lim infn Yn(s).To go the other way, pick any t such that s < t < 1. Now compute Y (t) and

consider the interval (Y (t), Y (t) + ǫ). Pick a continuity point x of F such that

Y (t) < x < Y (t) + ǫ.

Now Y (t) < x implies that F (x) ≥ t > s. The convergence of Fn(x) → F (x)implies that Fn(x) > s for each large n. The definition of Yn(s), being the smallestsuch x gives

Yn(s) ≤ x < Y (t) + ǫ for all large n.

Making n large and then letting ǫ drop to zero gives that

lim supn

Yn(s) ≤ Y (t), whenever t > s.

Now Y (t) is a nondecreasing function, it must have atmost countably many pointsof jump. Ignoring those s, letting t → s gives that lim supn Yn(s) ≤ Y (s). Hence,Yn(s) converge to Y (s) for all s ∈ (0, 1) except perhaps a countably many s. ♠

Remark - 14.0.3 - (Summability & the strong law of large numbers) Summa-bility theory came into being while trying to create an alogorithm that assignsa limit to nonconvergent sequences. Of course to avoid any arbitrariness in theassigned limit we require that the algorithm, when applied to a convergent se-quence, must give the correct limit. Any such limit assignment algorithm is calleda regular summability method.1 There are many such methods and one of the mostpopular of them all is called the Cesaro method. For any sequence a1, a2, · · · , ofreal (or complex) numbers the Cesaro method assigns the value

limn→∞

1

n

n∑

k=1

ak,

provided this limit exists. It is not difficult to show that if ak, k ≥ 1 is a con-vergent sequence then the Cesaro method does give the correct answer (i.e., theCesaro method is regular). There are many nonconvergent sequences for which theCesaro method does not work (it is ineffective). The following proposition givesuncountably many such examples. On the other hand, there are quite a few non-convergent sequences for which it is effective. For instance, for ak = (−1)k, it givesthe limit 0. The reader may try to find a few more examples.

1Although most well known summability methods are linear operations, they do nothave to be.

122 Almost Sure Convergence & SLLN

The strong laws of large numbers use the Cesaro method to assign an almostsure limit to a sequence of random varibales X1,X2, · · · , especially when the Xi’sare independent and identically distributed. The following result shows that to haveany hope that the Cesaro method will be effective for independent and identicallydistributed random variables we must assume that the first moment of the randomvariables is finite.

Proposition - 14.0.5 - (Cesaro method is ineffective when mean does notexist) Let a1, a2, · · · be a sequence of real numbers so that 1

n

∑nk=1 ak converges

to a finite limit. Then

|ak|k→ 0, (usually written as ak = o(k)).

Let X1,X2, · · · be a sequence of independent and identically distributed randomvariables so that E|X1| = +∞. Then the following results hold:

• (i) P(ω ∈ Ω : |Xk(ω)| ≥ k infinitely often) = 1.

• (ii)P(ω ∈ Ω : 1n

∑nk=1 Xk converges to a finite limit) = 0.

Proof: Let 1n

∑nk=1 ak → L. Therefore,

|an|n

=

∣∣∣∣∣1

n

n∑

k=1

(ak − L)− n− 1

n

1

n− 1

n−1∑

k=1

(ak − L) +L

n

∣∣∣∣∣ → 0.

To prove (i) just note that P(|X1| > x) is a decreasing function of x. Therefore,

+∞ = E|X1| =

∫ ∞

0

P(|X1| > x) dx =

∞∑

k=0

∫ k+1

k

P(|X1| > x) dx

≤∞∑

k=0

P(|X1| > k) =

∞∑

k=0

P(|Xk| > k).

In the last equality we used the fact that X1,X2, · · · are identically distributed.The events Ak = |Xk| > k being independent, the second Borel-Cantelli lemmagives that |Xk| > k occurs infinitely often almost surely. This gives part (i). Nowpart (ii) follows since the Cesaro method is ineffective for any such sequence. ♠

Example - 14.0.2 - Let X1,X2, · · · iid∼ Cauchy(0, 1). Since E|X1| = +∞, the aboveproposition shows that P

(1n

∑ni=1 Xi converges

)= 0. Either by using the unique-

ness of characteristic functions, (a result that we will prove later) or using convo-

lutions, Xn = 1n

∑ni=1 Xi ∼ Cauchy(0, 1), so Xn

dist→ Cauchy(0, 1).

Example - 14.0.3 - (A strong law of large numbers, SLLN) In light of thelast result one must impose finiteness of some moment in order to hope that astrong law of large numbers might hold. A remarkable result of Kolmogorov saysthat assuming E|X1| < ∞ does make the Cesaro method effective, and that the

Page 67: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Almost Sure Convergence & SLLN 123

Cesaro limit turns out to be E(X1). We will postpone the proof of this result untilwe build several necessary results. However, if E(X4

1 ) is finite, the proof is easy.Note that, by Markov’s inequality, for any ε > 0,

∞∑

n=1

P(|Xn − µ| ≥ ε) ≤∞∑

n=1

E|Xn − µ|4ε4

=∞∑

n=1

nE(X1 − µ)4 + 3n(n− 1)σ4

ε4n4, (by Exercise (14.0.8)),

≤ E(X1 − µ)4

ε4

∞∑

n=1

1

n3+

3σ4

ε4

∞∑

n=1

1

n2< ∞.

Item (b) of Corollary (14.0.1) shows that Xn converge to µ almost surely. Fur-thermore, since we have shown that supn≥1 E|Xn−µ|4 <∞, item (d) of Corollary

(14.0.1) shows that Xn converge to µ in L1 sense as well, i.e., limn→∞

E|Xn − µ| = 0.

As a special case, a result of Borel falls out, which he proved when Xi ∼ B(1, 12 ).

Borel’s result is also known as Borel’s normal number theorem.

Exercise - 14.0.8 - Verify E(Xn − µ)4 = nE(X1−µ)4+3n(n−1)σ4

n4 .

Remark - 14.0.4 - (Fatou’s lemma & asymptotic variance) If (Yn−E(Yn))a.s.→

Y , by Fatou’s lemma, all we can say is

V ar(Y ) ≤ E(Y 2) ≤ lim infn

E(Yn − E(Yn))2 = lim infn

V ar(Yn).

In general the inequality can be strict. However, for certain martingales (as men-tioned below) equality can be guaranteed.

Example - 14.0.4 - (Martingale convergence theorem) One nice thing aboutmartingales is that they converge almost surely under one simple condition. IfM1,M2, · · · is a martingale, with supn E|Mn|p < ∞ for some constant p ≥ 1,

then there exists a random variable Z such that Mna.s.→ Z. This is called the

martingale convergence theorem. In fact, if p = 2 then we can further say thatE|Mn − Z|2 → 0 as well. In particular, V ar(Mn) → V ar(Z). Its proof is quitedeep and we will postpone it until we build the necessary tools later on.

To see an example of its use, consider the random harmonic series,

Mn :=n∑

k=1

2Uk − 1

k, n = 1, 2, · · · ,

where U1, U2, · · · iid∼ B(1, 12 ), i.e., fair coin toss outcomes. Ui = 1 if head occurs and

Ui = 0 otherwise. Note Mn is a martingale and

E|Mn|2 = V ar(Mn) =n∑

k=1

V ar(2Uk − 1)

k2=

n∑

k=1

1

k2.

124 Almost Sure Convergence & SLLN

−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values of Mn

His

togr

am o

f Mn a

nd N

(0,π

2 /6)

Den

sity

n = 30

−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values of Mn

His

togr

am o

f Mn a

nd N

(0,π

2 /6)

Den

sity

n = 50

−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values of Mn

His

togr

am o

f Mn a

nd N

(0,π

2 /6)

Den

sity

n = 100

−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values of Mn

His

togr

am o

f Mn a

nd N

(0,π

2 /6)

Den

sity

n = 500

Figure 14.1: Density of Random Harmonic Series

This gives that supn E|Mn|2 =∑∞

k=11k2 = π2

6 . The martingale convergence theo-rem therefore guarantees the existence of a random variable, say Z, so that Mn → Zalmost surely. In other words,

Z =

∞∑

k=1

2Uk − 1

k.

What is the distribution of Z? It is known that Z has a density (with respect toLebesgue measure) [?], [?], [?], [?], [?]. No one knows its closed form expression.

But, E(Z) = 0, and V ar(Z) = π2

6 . Many types of extensions and variants have beenexplored, [?], [?], [?], [?], [?], [?], [?], [?], [?], [?], [?]. Four simulation approximationsof the density of Z are presented in Figure (14.1). The superimposed curve is the

density of N(0, π2

6 ) for comparison purposes.

HW37 Exercise - 14.0.9 - (Martingale convergence) Let U1, U2, · · · iid∼ Uniform0, 1, 2,· · · , 9, ,

Xn :=

n∑

i=1

Ui − 4.5

10i, n ≥ 1.

Does X1,X2, · · · obey the martingale property with respect to U1, U2, · · · ? If so,does the martingale converge? If so, to which random variable and in which sense?

Page 68: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Almost Sure Convergence & SLLN 125

Exercise - 14.0.10 - (Martingale convergence) For Exercise (14.0.9) write acomputer program and simulate the distribution of Xn for n = 5, 10, 15 and 20.

HW38 Exercise - 14.0.11 - (How to pick a point at random from [−0.5, 0.5]?) ByExercise (14.0.9) explain how you can pick a point (approximately) at random from[−0.5, 0.5]?

126 Almost Sure Convergence & SLLN

Page 69: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 15

The Lp Spaces & Uniform

Integrability

We start off with an exercise for the reader. It is a special case of the concept ofuniform integrability which we will pick up later in this lecture.

HW39 Exercise - 15.0.12 - (Uniform integrability of integrable r.v.) Let X be arandom variable. Prove that the following three statements are equivalent.

• (a) E|X| <∞.

• (b) If An is a sequence of events so that P (An)→ 0 then∫

An|X| dP → 0.

• (c) limn→∞∫|X|≥n |X| dP = 0.

In particular, deduce that (a) implies limn nP (|X| ≥ n) = 0, but the converse maynot hold.

The linear space consisting of all the random variables (actually equivalenceclasses) defined over a probability space (Ω, E , P) will be denoted by L0. The spaceL0 is broken down into various subsets with the help of a constant p ∈ [0,+∞].When p > 0, the set Lp consists of all those random variables for which E|X|p <∞.It is not difficult to see that Lp itself is linear, called the Lp space.

Definition - 15.0.4 - Let L be a linear space. A real valued function ‖ · ‖ on L iscalled a seminorm if for any f, g ∈ L,

• (i) ‖f‖ ≥ 0.

• (ii) ‖αf‖ = |α| ‖f‖ for any α ∈ R.

• (iii) ‖f + g‖ ≤ ‖f‖+ ‖g‖.Furthermore, ‖ · ‖ is called a norm if in addition

• (iv) ‖f‖ = 0 if and only if f ≡ 0.

128 The Lp Spaces & Uniform Integrability

(Note that f ≡ 0 is always a member of L since for any g ∈ L, 0 = g − g ∈ L).

Minkowski’s inequality shows that on Lp, for each 1 ≤ p <∞,

‖X‖p :=

(∫

S

|X|p dP

) 1p

is a semi-norm. To make it a norm, i.e., to make ‖X‖p = 0 imply that X = 0, (it

only implies that Xa.s.= 0) we identify all those r.v.s which are equal almost surely

as one equivalence class of r.vs. We denote this set by Lp as well. More formally, if[X] is the class of functions which are equal almost surely to X, and [Y ] is anothersuch class then we define

α[X] := [αX] for any α ∈ R

[X] + [Y ] := [X + Y ].

This definition does not, clearly, depend on the choice of X1 ∈ [X] and Y1 ∈ [Y ].Therefore, Lp becomes a linear space. If we define

S

|[X]|pdP :=

S

|X|pdP

then it is a well defined number and Lp becomes a normed linear space. In fact, if[X], [Y ] ∈ Lp then for p ≥ 1,

ρ([X], [Y ]) :=

(∫

S

|X − Y |pdP

)1/p

= ‖X − Y ‖p

To avoid this complicated notation we will denote the elements [X] by X from nowon and continue to use the terminology “random variable in Lp”.

Here are the four basic varieties.

• (i) (L0, case) Over the whole L0 we may define the distance between X,Yas

ρ0(X,Y ) = E

( |X − Y |1 + |X − Y |

).

One can verify that ρ0 obeys the requirements for a distance function. In

this case XnL0→ X, or ρ0(Xn,X)→ 0, stands for Xn

prob→ X.

• (ii) (Lp, p ∈ (0, 1) case) It turns out that for p ∈ (0, 1), if we define

ρp(X,Y ) := E|X − Y |p,then again the resulting linear space becomes a metric space. In this case

XnLp

→ X or ρp(Xn,X)→ 0, stands for E|Xn −X|p → 0.

• (iii) (Lp, 1 ≤ p <∞ case) Using Minkowski’s inequality, when 1 ≤ p <∞,

ρp(X,Y ) := ||X − Y ||p = (E|X − Y |)1/p ,

makes Lp a normed linear space. In this case XnLp

→ X or ρp(Xn,X) → 0,stands for (E|Xn −X|p)1/p → 0.

Page 70: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

The Lp Spaces & Uniform Integrability 129

• (iv) (L∞ case) Finally, there is one further case, namely that of L∞. This isthe space of essentially bounded random variables. That is, X ∈ L∞ if and

only if there exists a positive real number K such that |X|a.s.≤ K. In this

space the size of a random variable is measured by

ρ∞(X,Y ) = inf

K : |X − Y |a.s.≤ K

= inf K : P(|X − Y | > K) = 0

The number ρ∞(X,Y ) is called the essential supremum of |X − Y |. It is

again a metric space. In this case XnL∞

→ X or ρ∞(Xn,X)→ 0.

As subsets, the various Lp spaces form a tower, thanks to the Lyapunov inequality.

L0

L1/2

L1

L2

L∞

As far as convergence is concerned, Fatou’s lemma gives that for any X1,X2, · · · ∈L0 we always have

E

(lim inf

n|Xn|p

)≤ lim inf

nE|Xn|p.

Remark - 15.0.5 - (Some other “Lp” spaces) We should mention that there areare several other classification techniques that use the tail P(|X| > t) itself, ratherthan the moments, to define the, so called, weak Lp space.

We should also mention the weighted Lp space used for defining orthogonalpolynomials. In this space, a single random varible, X, which lies in Lp for all p ≥ 1(i.e., X has all moments) is used to collect all functions h(t) so that E|h(X)|2 <∞.The resulting L2(X) space consists of deterministic functions h. Such a space hasall polynomials as its members and infinitely many orthogonal polynomials, whenX takes infinitely many values.

Proposition - 15.0.6 - Let 0 < p < ∞ be a number. If XnLp

→ 0 then Xnprob→ 0.

Conversely, if Xnprob→ 0 and there exists a random variable Y such that |Xn| < Y

so that E|Y |p <∞ then XnLp

→ 0.

Proof: The forward case is obvious. Conversely,

E(|Xn|p) =

|Xn|<ǫ|Xn|p dP +

|Xn|≥ǫ|Xn|p dP

130 The Lp Spaces & Uniform Integrability

≤ ǫp +

|Xn|≥ǫ|Y |p dP.

the last integral goes to zero by Exercise (15.0.12). Since ǫ is arbitrary, the resultfollows. ♠

Since the concepts of convergence in probability and Lp convergence are definedover metric spaces, it is natural to ask if the Cauchy criterion can be invoked? Thatis, does every Cauchy sequence have to converge to some element of the space? Ifthe answer is yes, we say that the metric space is complete. It turns out thatall these Lp spaces are complete metric spaces for their respective metrices. Thecompleteness of Lp spaces for p ≥ 1 is a theorem due to Riesz and Fischer whichis the main focus of this section. In the proof of the Riesz-Fisher theorem we willneed the following proposition.

Proposition - 15.0.7 - Let 0 < p < ∞ and let Xn, n ≥ 1 be a sequence in Lp

so that

(E|Xn −Xn+1|p)1/p<

1

4n

for all n ≥ 1, then Xn converge almost surely to a random variable.

Proof: Consider the set

An :=s ∈ S : |Xn(s)−Xn+1(s)| ≥ 2−n

, (n ≥ 1).

By the Chebyshev inequality we get

P (An) ≤ 2np

An

|Xn −Xn+1|pdP ≤ 2npE|Xn −Xn+1|p < 2−np.

Therefore,

∞∑

n=1

P (An) <∞. By the first Borel-Centelli lemma P

(lim sup

nAn

)= 0.

That is, for almost all s ∈ S, there exists an N(s) such that

|Xn(s)−Xn+1(s)| < 2−n for all n ≥ N(s).

Thus, for all m > n ≥ N(s),

|Xn(s)−Xm(s)| ≤m−1∑

k=n

|Xk(s)−Xk+1(s)| <m−1∑

k=n

2−k ≤ 2−n+1.

That is, Xn(s), n ≥ 1 is a Cauchy sequence of real numbers for almost all s ∈ Sand hence it converges. ♠

Remark - 15.0.6 - The existence of the random variable X in the above propo-sition does not guarantee that it is a member of Lp. Of course, the followingRiesz-Fischer theorem shows when it is.

Note that if Xn → X in the metric space Lp, 1 ≤ p < ∞, then X is uniquealmost surely. Also, it implies that

| ‖Xn‖p − ‖X‖p | ≤ ‖Xn −X‖p → 0 as n→∞.

That is, ‖ · ‖p is a continuous real valued function.

Page 71: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

The Lp Spaces & Uniform Integrability 131

Theorem - 15.0.2 - (Riesz-Fischer — completeness of Lp) For 1 ≤ p < ∞the space Lp is a complete metric space.

Proof: Let Xn, n ≥ 1 be a Cauchy sequence in (Lp, ρp). That is, for any ε > 0,there exists N such that ‖Xn − Xm‖p < ε for all n,m ≥ N . For ε = 1

4 let N1 be

the integer so that ‖Xn −Xm‖p < 14 for all n,m ≥ N . For ε =

(14

)2let N2 > N1

be the integer so that ‖Xn − Xm‖p <(

14

)2for all n,m ≥ N2. Continue this way

and consider the subsequence XNi, i ≥ 1. Since,

∣∣∣∣XNi−XNi+1

∣∣∣∣p

<1

4i, i ≥ 1,

the previous proposition says that XNi, i ≥ 1 converge to a random variable X

a.s. on S. Therefore, X is measurable and Fatou’s lemma gives∫

S

|Xn −X|pdP ≤ lim infi

S

|Xn −XNi|pdP = lim inf

i‖Xn −XNi

‖pp.

Now, for any ε > 0, picking n large enough, the Cauchy sequence Xn gives that

‖Xn −XNi‖pp < ε for all large i.

That is,

S

|Xn − X|pdP < ∞ for large n and limn→∞

S

|Xn − X|pdP = 0. Thus,

X = −(Xn −X) + Xn ∈ Lp and ‖Xn −X‖p → 0 as n→∞. ♠

Remark - 15.0.7 - Any complete normed linear space is called a Banach space.We have just proved that the space Lp, for each p ≥ 1, is a Banach spaces. Anotherimportant example of a Banach space is C[a, b] with

‖f‖ := maxa≤x≤b

|f(x)|.

Other common examples are (R, | · |), (Rn, ‖ · ‖p), (Rn, ‖ · ‖∞), ℓp, 1 ≤ p ≤ ∞. Thelist of examples is extremely large (infinite).

Exercise - 15.0.13 - In (R, | · |) we have a simple way of checking summabilityof a series. One useful result for characterizing completeness of a normed linearspace is the following summability comparison of absolute (norm) summability andordinary summability.

Let (B, ‖ · ‖) be a normed linear space. Then show that the following areequivalent

• (i) (B, ‖ · ‖) is a Banach space.

• (ii) For any fn, n ≥ 1 in B such that∑∞

n=1 ‖fn‖ <∞ implies that∑n

i=1 fi

converges to an element f in the metric space (X, ‖ · ‖).

Exercise - 15.0.14 - (L0 is complete) Show that the space L0 is complete. Thatis, show that for any Cauchy sequence Xn in L0, there exists a random variable

X so that Xnprob→ X. [Hint: Use Exercise (14.0.7).]

132 The Lp Spaces & Uniform Integrability

Exercise - 15.0.15 - (Lp, 0 < p < 1 case) For any fixed p ∈ (0, 1) prove or disprovethat Lp, is a complete metric space.

Exercise - 15.0.16 - (Completeness of ℓ∞) Let ℓ∞ be space of all real (or com-plex) sequences. That is, functions defined over 1, 2, · · · with norm ‖f‖ =supk |f(k)|. Show that the space is complete.

Exercise - 15.0.17 - (Completeness of L∞(µ)) Let (Ω, E , µ) be a measure space.Let L∞(µ) be the space of functions consisting of those f for which

||f ||ess := infK>0K : µ(s : |f(s)| > K) = 0

is finite. Show that ||f ||ess is a norm. Is L∞(µ) complete?

15.1 Uniform Integrability

The basic theme of this section is to improve the Lebesgue dominated convergencetheorem. Recall that the Lebesgue dominated convergence theorem gives sufficientconditions for a sequence of random variables, Xn converging almost surely toanother random variable X so that

limn→∞

Ω

|Xn −X| dP = 0.

Now we will provide some necessary and sufficient conditions for this to hold. Thisrequires the notion of uniform integrability.

Definition - 15.1.1 - Let (Ω, ‖∑, P) be a probability space. Let K = Xα, α ∈ Ibe a collection of random variables. We say the collection is uniformly integrable if

limt→∞

supα∈I

x∈Ω:|Xα(x)|≥t|Xα| dP = 0.

Note that any collection consisting of only one random variable, X, with E|X| <∞, is automatically uniformly integrable, cf. Exercise (15.0.12). Using this we seethat any finite collection of such random variables will also be uniformly integrable.

Natanson [ ] uses the term equi-absolutely continuous integrals to describe uni-form integrability. Uniform integrability plays a major role in Probability, Analysisand other fields. Here is another way of of looking at uniform integrability.

Proposition - 15.1.1 - A collection Xα, α ∈ I is uniformly integrable if and onlyif the following results hold.

(i) supα∈I

E|Xα| <∞,

(ii) For any ε > 0, there exists a δ > 0 such that for any event A withP(A) < δ we have,

supα∈I

A

|Xα| dP < ε.

Page 72: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

15.1 Uniform Integrability 133

Proof: Assume the two conditions hold. By Chebyshev’s inequality,

P (|Xα| ≥ t) ≤ E|Xα|t

≤ supα∈I E|Xα|t

.

Thus, if we let Aα,t := |Xα| ≥ t and let t > 1δ supα∈I E|Xα|, then P(Aα,t) < δ

for all α ∈ I. Thus,∫

Aα,t

|Xα|dP ≤ supβ∈I

Aα,t

|Xβ|dP < ε, for each α ∈ I.

Since, this is true for each α ∈ I, we have,

supα∈I

Aα,t

|Xα|dP < ε for all large t.

To prove the converse, let ε > 0. Then there exists a T (ǫ) > 0 such that

supα∈I

|Xα|≥t|Xα| dP <

ε

2; for all t ≥ T (ε).

We use this T = T (ε) carefully to show that both (i) and (ii) hold. Indeed,

supα∈I

Ω

|Xα| dP = supα∈I

|Xα|≥T|Xα| dP +

|Xα|<T|Xα| dP

≤ supα∈I

|Xα|≥T|Xα| dP + T

≤ ε

2+ T < ∞.

This gives (i). To show (ii), take δ = ε/(2T ). Then for any event A, with P(A) < δ,

supα∈I

A

|Xα| dP = supα∈I

A∩|Xα|≥T|Xα| dP +

A∩|Xα|<T|Xα| dP

≤ supα∈I

|Xα|≥T|Xα| dP + TP(A)

≤ ε

2+

2T= ε.

This finishes the proof. ♠

Here is the main result of this section.

Theorem - 15.1.1 - (Characterization of L1 convergence) Let Xn, n ≥ 1be a sequence of random variables with finite expectations. Then the followingstatements are equivalent.

• (1) There is a random variable X such that

limn→∞

E|Xn −X| = 0

• (2) The following three results hold.

134 The Lp Spaces & Uniform Integrability

– (i) Xnprob→ X,

– (ii) X has finite expectation,

– ((iii) Xn, n ≥ 1 is uniformly integrable.

Proof: Assume (1) holds. For n = 1, we see that

E|X| ≤ E|X1 −X|+ E|X1|.

This shows that X has finite expectation giving (ii). By Chebyshev’s inequality,

P (|Xn −X| > ε) ≤ E|Xn −X|ε

→ 0

as n→∞. This gives part (i). To show uniform integrability note that for ε = 1,

E|Xn| ≤ E|Xn −X|+ E|X|≤ 1 + E|X| = K1, for all n ≥ N1.

Since Xn are integrable, we see that

supn

E|Xn| ≤ max E|Xi|+ K1; 1 ≤ i < N1 < ∞.

Next, for any ε > 0, using the uniform integrability of X and the given information,

(a) there exists a δ > 0 such that E|X| < ε2 whenever P(A) < δ,

(b) there exists an N2 such that E|Xn −X| < ε2 for all n ≥ N2.

By the integrability of X1,X2, · · · ,XN2 we have δ1, δ2, . . . , δN2 > 0 so that

E|Xi| <ε

2whenever P(A) < δi, 1 ≤ i ≤ N2.

Take δ∗ = minδ, δ1, δ1, . . . , δN2. Then, for any event A with P(A) < δ∗,∫

A

|Xn|dP ≤∫

A

|Xn −X|dP +

A

|X|dP

≤∫

Ω

|Xn −X|dP +

A

|X|dP

2+

ε

2= ε for all n ≥ N2.

Therefore, (2) holds. To prove the converse, assume (2) holds. Since X has finiteexpectation, it is uniformly integrable, so

limt→∞

|X|≥t|X| dP = 0.

Therefore, it is easy to see that the sequence Xn − X, n = 1, 2, · · · is alsouniformly integrable. Hence,

limt→∞

supn

|Xn−X|≥t|Xn −X| dP = 0.

Page 73: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

15.1 Uniform Integrability 135

So, for any ǫ > 0, we have

E|Xn −X| =

|Xn−X|≥t|Xn −X| dP +

|Xn−X|<t|Xn −X| dP

≤ supn

|Xn−X|≥t|Xn −X| dP +

Ω

|Xn −X|χ|Xn−X|<t dP.

As n gets large, the second term on the right goes to zero by the ordinary Lebesguedominated convergence theorem. Then letting t get large makes the first term onthe right go to zero. ♠

HW40 Exercise - 15.1.1 - If Xi are identically distributed and have finite expectationsthen prove that Xi is uniformly integrable.

Exercise - 15.1.2 - Let Xt, t ∈ T and Ys, s ∈ U be uniformly integrable fami-lies over the same probability space. Prove that

• (1) Xt + Ys, t ∈ T, s ∈ U is also uniformly integrable

• (2) Any subcollection of a uniformly integrable family is again uniformlyintegrable.

Exercise - 15.1.3 - Let Xt, t ∈ T be a family of random variables defined over aprobability space so that |Xt| ≤ X almost surely for each t ∈ T and that E|X| <∞.Prove that the family is ui.

Exercise - 15.1.4 - Let Xna.s.→ X (or that Xn

prob→ X or that Xndist→ X) and let

Xn be uniformly integrable. Prove that X has finite expectation and that

limn→∞

E(Xn) = E(X).

Exercise - 15.1.5 - Consider the probability space (−1, 1),B, P), where P((a, b]) =b−a2 , and let Xn be the function over (−1, 1) such that it takes the value n over

the interval (0, 1/n) and takes the value −n over the interval (−(1/n), 0) and zero

otherwise. Show that Xn are not uniformly integrable. However Xna.s.→ 0 =: X

and that

E(Xn) → E(X).

Does this contradict Theorem (15.1.1)?

Exercise - 15.1.6 - Let Xn, n ≥ 1 be a collection of random varibles having finiteexpectations. Show that the following two statements are equivalent.

• (a) limn→∞ E|Xn −X| = 0.

• (b) E|X| <∞ and E|Xn| → E|X| and Xnprob→ X.

136 The Lp Spaces & Uniform Integrability

Exercise - 15.1.7 - (Beppo Levi’s theorem) Let Xn be random variables withfinite expectations such that

supn

E(Xn) < ∞.

If Xn ↑ X then prove that E|X| <∞ and that limn→∞ E(Xn) = E(X).

Exercise - 15.1.8 - (Scheffe’s theorem) Let Xn be a sequence of random vari-ables such that the distribution of Xn has a density fn (with respect to the Lebesgue

measure m). If fnm→ f (or that fn

a.e.→ f with respect to the Lebesgue measure)where f is also a density of some random variable X then prove that

limn→∞

R

|fn(t)− f(t)| dm(t) = 0.

And hence Eh(Xn)→ Eh(X) for any bounded continuous function h.

Exercise - 15.1.9 - (Modified LDCT) Let |Xn| ≤ Yn where Yn have finite ex-

pectations and that Xnprob→ X and Yn

prob→ Y with Y having finite expectation andE(Yn)→ E(Y ). Then show that

limn→∞

E|Xn −X| = 0.

Exercise - 15.1.10 - Let K = Xα, α ∈ I be a collection of random variables inLp for a p > 1. If supα∈I ||Xα||p <∞ then show that K is uniformly integrable.

Exercise - 15.1.11 - Let p > 0 (note that p could be less than one). Let K =Xn, n = 1, 2, · · · be a set in Lp. Prove that the following are equivalent.

• (1) |Xn|p, n ≥ 1 is uniformly integrable and Xnprob→ X,

• (2) E|Xn −X|p → 0,

• (3) X ∈ Lp, E|Xn|p → E|X|p and Xnprob→ X.

Exercise - 15.1.12 - Let (S,F , P ) be a probability space. Let Φ(x) : [0,∞) →[0,∞) be an increasing function so that Φ(x)/x → ∞ as x → ∞. Let K ⊆ L1(P )be a collection of random variables (automatically having finite means). If

supX∈K

E(Φ(|X|)) = U <∞,

then prove that K is uniformly integrable.

Exercise (15.1.12) contains the gist of uniform integrability over probabilityspaces. The following theorem shows why.

Page 74: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

15.1 Uniform Integrability 137

Theorem - 15.1.2 - (de la Vallee Poussin) Let (S, ‖∑, P ) be a probability spaceand let K = Xα, α ∈ I be a collection of random variables having finite firstmoments. Then K is uniformly integrable if and only if there exists a convex, evenand real valued function Φ over the real line such that

(1) Φ(0) = 0,

(2) limt→∞Φ(t)

t =∞,(3) supα∈I EΦ(|Xα|) <∞.

Proof: Exercise (15.1.12) shows the reverse direction of the proof. To prove theconverse, let K be uniformly integrable. We will provide the function Φ by

Φ(x) =

∫ x

0

φ(t) dt; x > 0,

where φ(t) is a non-negative and non-decreasing function with φ(t) → ∞. By theuniform integrability of K, for ε = 1/2n, we find positive numbers Tn (we can takeTn to be positive integers so that Tn < Tn+1, if we like) so that

supα∈I

|Xα|>Tn|Xα| dP <

1

2n; n = 1, 2, · · · .

Now use the standard trick of expectations to have,

∞∑

k=T

P (|X| ≥ k) =∞∑

k=T

∞∑

j=k

P (j ≤ |X| < j + 1)

=

∞∑

j=T

P (j ≤ |X| < j + 1)∑

T≤k≤j

1

≤∞∑

j=T

jP (j ≤ |X| < j + 1)

≤∞∑

j=T

j≤|X|<j+1|X| dP =

|X|≥T|X| dP.

Hence, we see that for any α ∈ I,

1

2n≥∫

|Xα|>Tn|Xα| dP ≥

∞∑

k=Tn

P (|Xα| ≥ k).

Summing this inequality over n gives that

1 ≥∞∑

n=1

∞∑

k=Tn

P (|Xα| ≥ k) =∞∑

k=1

∞∑

n=1

χTn≤k(n) P (|Xα| ≥ k)

=∞∑

k=1

P (|Xα| ≥ k)∞∑

n=1

χTn≤k(n) =∞∑

k=1

P (|Xα| ≥ k) φk,

138 The Lp Spaces & Uniform Integrability

where φk is the number of possible values of n for which Tn ≤ k. Since, Tn < Tn+1,we see that φk is a positive integer such that φk ≤ φk+1. And φk →∞ as k →∞.Define the function φ(t) over [0,∞) by φ(t) = φk if t ∈ [k, k + 1), k = 0, 1, 2, · · · .Thus φ(t)ր∞ as t→∞. Now define

Φ(x) =

∫ |x|

0

φ(t) dt; x ∈ R.

It is clearly a non-negative function with Φ(0) = 0, To show that Φ is convex, justnote that Φ′(x) = φ(x) which is non-decreasing over [0,∞). Now to show that Φobeys condition (2), note that for x > 0,

Φ(x)

x=

1

x

∫ x

0

φ(t) dt ≥ 1

x

∫ x

x/2

φ(t) dt

≥ φ(x/2)

x

∫ x

x/2

1dt =φ(x/2)

2→∞.

Finally, to show condition (3), just use the same old trick of change of variableformula of expectations, i.e.,

EΦ(|Xα|) =∞∑

k=0

E(Φ(|Xα|)χk≤|Xα|<k+1

)

≤∞∑

k=0

E

(∫ k+1

0

φ(t) dt · χk≤|Xα|<k+1

)

=

∞∑

k=0

k+1∑

j=0

φjP (k ≤ |Xα| < k + 1)

=

∞∑

j=0

φj

k>j

P (k ≤ |Xα| < k + 1)

=∞∑

j=0

φjP (|Xα| > j) ≤ 1.

Since the bound does not depend on α, condition (3) follows. ♠

Exercise - 15.1.13 - If Xn is uniformly integrable then prove that Xn is alsouniformly integrable. (Averages preserve uniform integrability of random vari-ables).

Exercise - 15.1.14 - (This assumes WLLN) If Xn are independent and identi-

cally distributed random variables with E|X1| <∞ then prove that XnL1→ E(X1).

Exercise - 15.1.15 - Extend the above two exercises as follows. Let A = (ank) bea non-negative summability method whose rows add up to 1. If Xn is uniformlyintegrable then prove that (AX)n, n = 1, 2, · · · is also uniformly integrable.

Page 75: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

15.1 Uniform Integrability 139

Exercise - 15.1.16 - (This assumes WLLN for summability methods) Let Xn beindependent and identically distributed random variables with E|X1| <∞ and letA be a non-negative summability method whose rows add up to one. If (AX)n

obey the WLLN then prove that (AX)nL1→ E(X1).

140 The Lp Spaces & Uniform Integrability

Page 76: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 16

Laws of Large Numbers

Suppose Xn, n = 1, 2, · · · is a sequence of random variables defined over a prob-ability space (S, E , P ). Consider the new sequence of partial sums

Sn := X1 + X2 + · · ·+ Xn; n = 1, 2, · · · .

Here are some of the typical questions probability theory deals with:-

• What is the limiting behavior of Sn as n gets large? Does the series∑

n Xn

itself make sense? For instance if |Xn| ≤ n−1−ǫ for sure for some ǫ > 0, thenthe series converges absolutely and the random series is a genuine randomvariable. Our restriction is, however, very strong and we need to explorewhat we can do without it. When the Xn are independent random variables,it turns out that we can give a precise answer towards the convergence of therandom series. This is the celebrated “three series theorem” of Kolmogorov.

• The distribution properties of Sn when Xi are iid are explored under the ti-tle of “random walk”. Further specialization of random walk occurs whenthe Xi are non-negative random variables. This goes under the title of“renewal theory”.

• Estimates of the probabilities P(|Sn| > nǫ) are called “large deviations” re-sults.

• If we use normalizing contants and consider

Zn :=Sn − an

bn, n = 1, 2, · · · ,

then the asymptotic behavior of Zn opens the door to a rich history. Forconvergence in distribution these topics come under “central limit theorem”.

• Topics such as the weak and strong “laws of large numbers”, consider theconvergence aspects of Sn

n .

• Further refinements of the rate of convergence in the law of large numbers,i.e., Sn

anwith an = o(n) are considered by the “law of itterated logarithms”.

142 Laws of Large Numbers

Here our aim is to focus on the laws of large numbers. The following tools/techniquesshow up again and again.

1. The subsequence method.

2. The truncation method along with Kolmogorov-type inequalities.

3. Usage of Kornecker’s lemma.

4. And the relatively newer approach of martingales.

We will postpone the topic on martingales to be picked up later. The rest areexplained one by one.

16.1 Subsequences & Kolmogorov Inequality

Chebyshev invented his famous inequality in order to prove the weak law of largenumbers, which goes as follows. If Xk, k ≥ 1 are pairwise uncorrelated random

variables with E(X2n) < M for all n then

Zn :=1

n

n∑

k=1

(Xk − E(Xk)) =Sn − E(Sn)

n

L2

→ 0,

and hence in probability automatically. This result of Chebyshev was extended byRajchman by considering subsequences and he showed that the convergence takesplace in almost sure sense. Note that,

E(Zn) = 0, V ar(Zn) =1

n2

n∑

k=1

σ2k, where σ2

k = V ar(Xk).

Theorem - 16.1.1 - (Subsequence method of Rajchman) Let Xk, k ≥ 1 bepairwise uncorrelated random variables with E(X2

n) < M for all n then

Zn :=1

n

n∑

k=1

(Xk − E(Xk)) =Sn −E(Sn)

n

a.s.→ 0.

Proof: Without loss of any generality, we may assume that E(Xk) = 0 for all k.By Chebyshev inequality we have

∞∑

i=1

P (|Zn(i)| > ǫ) ≤ 1

ǫ2

∞∑

i=1

1

n(i)2

n(i)∑

k=1

σ2k ≤

1

ǫ2

∞∑

i=1

M

n(i).

If we pick n(i) growing fast enough so that the last series converges, then, by the

first Borel-Cantelli lemma, we have |Zn(i)| a.s.→ 0 as i → ∞. For instance, taken(i) = i2.

Take g(m) = ([√

m])2. That is, g(m), for m = 1, 2, 3, 4, · · · looks like

1, 1, 1, 4, 4, 4, 4, 4, 9, 9, 9, 9, 9, 9, 9, 16, · · ·

Page 77: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

16.1 Subsequences & Kolmogorov Inequality 143

Since Zn(i) converge to zero almost surely, we have

Z1, Z4, Z9, Z16, · · · a.s.→ 0.

If we repeat a few terms of the convergent sequence, we should still get a convergentsequence. That is,

Z1, Z1, Z1, Z4, Z4, Z4, Z4, Z4, Z9, Z9, Z9, Z9, Z9, Z9, Z9, · · · a.s.→ 0.

The triangle inequality gives that

|Zm| ≤ |Zm − Zg(m)|+ |Zg(m)|.

So, all we need to show is that the first term on the right side also goes to zero almostsurely. For this we again use the Chebyshev inequality and the first Borel-Cantellilemma. Now,

E(Zm − Zg(m))2

= V ar(Zm − Zg(m))

= V ar(Zm) + V ar(Zg(m))− 2Cov(Zm, Zg(m))

=1

m2

m∑

k=1

V ar(Xk) +1

g(m)2

g(m)∑

k=1

V ar(Xk)− 2

mg(m)

m∑

k=1

g(m)∑

i=1

Cov(Xk,Xi).

Since, Cov(Xk,Xi) = 0 whenever k 6= i, we have

E(Zm − Zg(m))2 =

1

m2

m∑

k=1

σ2k +

1

g(m)2

g(m)∑

k=1

σ2k −

2

mg(m)

g(m)∑

i=1

σ2i ,

since g(m) ≤ m for all m. Therefore, we have

E(Zm − Zg(m))2

=1

m2

g(m)∑

k=1

+

m∑

k=g(m)+1

σ2k +

1

g(m)2

g(m)∑

k=1

σ2k −

2

mg(m)

g(m)∑

i=1

σ2i

=1

m2

m∑

k=g(m)+1

σ2k +

g(m)∑

k=1

σ2k

m2+

g(m)∑

k=1

σ2k

g(m)2− 2

g(m)∑

i=1

σiσi

mg(m)

≤ Mm− g(m)

m2+

g(m)∑

k=1

(σk

m− σk

g(m)

)2

≤ Mm− g(m)

m2+ Mg(m)

(1

m− 1

g(m)

)2

= Mm− g(m)

m2+ M

(m− g(m))2

m2g(m).

Now note the fact that

m− 2√

m ≤(√

m− 1)2 ≤ ([

√m])2 = g(m) ≤ m.

144 Laws of Large Numbers

Hence, we have 0 ≤ m− g(m) ≤ 2√

m. By the Chebyshev inequality we get

∞∑

m=1

P (|Zm − Zg(m)| > ǫ) ≤∞∑

m=1

E(Zm − Zg(m))2

ǫ2

≤ 1

ǫ2M

∞∑

m=1

(2√

m

m2+

4m

m2g(m)

)<∞.

Therefore, by the first Borel-Cantelli lemma, |Zm−Zg(m)| goes to zero almost surelyas m goes to infinity. ♠

Remark - 16.1.1 - (Cantelli’s SLLN) Let Xn be independent random vari-ables (only pairwise independence is needed however) and let E(X4

n) < K for alln. Then

Sn − E(Sn)

n:=

(X1 −E(X1)) + (X2 − E(X2)) + · · ·+ (Xn −E(Xn))

n

goes to zero almost surely. Of course this is just a simple consequence of the aboveSLLN of Rajchman and the CBS inequality

E(X2n) ≤ (E(X4

n))1/2 ≤√

K; for all n.

Direct calculations are not hard. Again without loss of generality assume thatE(Xn) = 0 for all n. Then just note that

E(S4n) = E

i

j

k

XiXjXkXℓ

=∑

i

j

k

E(XiXjXkXℓ)

=

n∑

i

E(X4i ) + 2

i6=

j

E(X2i X2

j )

= Kn + 2Kn(n − 1).

Therefore,∞∑

n=1

P

(∣∣∣∣Sn

n

∣∣∣∣ > ǫ

)≤

∞∑

n=1

Kn + 2Kn(n− 1)

ǫ4n4<∞.

This finishes the result.

Exercise - 16.1.1 - Construct a sequence of random variables so that E(X2n) → 0

but the almost sure convergence of 1n

∑nk=1 Xk fails.

HW41 Exercise - 16.1.2 - Let A = [ank] be a regular summability method. If Xn isany sequence of random variables such that E(X2

n) → 0 then prove the followingresults.

• (i)∑∞

k=1 E|Xk| ank → 0.

Page 78: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

16.1 Subsequences & Kolmogorov Inequality 145

• (ii) (AX)n :=∑∞

k=1 Xk ank exists almost surely for each n.

• (iii) (AX)nL1

→ 0.

• (iv) (AX)nL2

→ 0.

(Both (iii) and (iv) imply (AX)nprob→ 0). Construct an example of independent

random variables for which E(X2n) 6→ 0, Xn

prob→ 0 and 1n

∑nk=1 Xk fails to converge

to zero in probability. So, automatically 1n

∑nk=1 Xk will fail to go to zero in L2.

In contrast to the last exercise, if pairwise uncorrelation is assumed, variancescan be allowed to be unbounded as the following exercise shows.

Exercise - 16.1.3 - (L2-version of Rajchman’s SLLN) Let X1,X2, · · · be pairwiseuncorrelated random variables so that V ar(Xk) = o(k). Let A = [ank] be a row-finite regular summability method for which supn

∑k k |ank|2 <∞. Prove that

k

(Xk − E(Xk)) ankL2

→ 0.

In particular, this result holds for the Cesaro method.

Remark - 16.1.2 - The above results point out two counter-balancing effects oftwo types of assumptions:

• (i) Some sort of lack of “dependence” of the terms of Xk.

• (ii) Some kind of control on the moments of Xk.

Their judicial mix gives rise to various forms of weak and strong laws of largenumbers.

Remark - 16.1.3 - (Komlos’ remarkable result) Existence of variance is a strong(very restrictive) condition. Consider the above exercises along with a result ofRevesz,1 which says that if supn E(X2

n) <∞ then it guarantees the existence of asubsequence over which the strong law of large numbers holds. (In fact, he proveda little more than that, concerning the convergence of a random series. Note that ifthe random variables were uncorrelated then Rajchman’s result applies for Revesz’ssituation.) Later Komlos,2 proved that for any sequence of random variables, if

supn E|Xn| <∞ then there always exists a subsequence Xnjand another random

variable Y having finite mean such that every subseuqence of Xnjhas its Cesaro

means converge to Y almost surely.

1P. Revesz, (1965), “On a Problem of Steinhaus”, Acta Math. Acad. Sci. Hungar.,vol. 16, pp. 310-318.

2J. Komlos, (1967), “A Generalization of a Problem of Steinhaus”, Acta Math. Acad.

Sci. Hungar., vol. 18, pp. 217-229.

146 Laws of Large Numbers

HW42 Exercise - 16.1.4 - (Improvement of Rajchman’s result) Let Xn be pairwise

uncorrelated random variables and let E(X2n) < K for all n. Then prove that

Sn −E(Sn)

a.s.→ 0; for any α >3

4.

Exercise - 16.1.5 - (A variant of Rajchman’s result) Let Xn be pairwise

uncorrelated random variables and let V ar(Xn) = O(nθ) for some θ ∈ [0, 12 ]. Then

find various combinations of α > 0 so that

Sn − E(Sn)

a.s.→ 0.

Remark - 16.1.4 - (How Kolmogorov’s inequality comes into the picture)Now we explain the main reason why Kolmogorov’s inequality (and its general-ity called Hajek-Renyi inequality) is so useful from the point of view of provingthe strong law of large numbers. Recall that a sequence of random variables Wn

converges to a random variable W if and only if

P

(supm≥n|Wm −W | ≥ ǫ

)→ 0; as n→∞,

for any ǫ > 0. That is, if Am := |Wm −W | ≥ ǫ, then we need to show that

P (∪m≥n|Wm −W | ≥ ǫ) → 0; as n→∞.

If the crude bound is finite, that is, if

P (∪m≥n|Wm −W | ≥ ǫ) ≤∞∑

m=n

P (Am) <∞,

then the desired convergence takes place without any independence assumption andwithout the help of any other inequality. If the crude bound is infinite, we couldstill lump some of the events Am together judiciously and apply the crude boundon these lumps, and hope to have the resulting bound to be finite. That is, we pickn = m0 < m1 < m2 < · · · and write

m≥n

Am =

(m1⋃

k=m0

Ak

)⋃(

m2⋃

k=m1+1

Ak

)⋃· · · =: B0 ∪B1 ∪ · · · .

Then again we have the bound P (∪m≥n|Wm −W | ≥ ǫ) ≤ ∑∞j=0 P (Bj). If this

bound is finite for some such sequence mj and goes to zero as n tends to infinity,then again we do not need any independence assumption nor do we need any otherinequality to get the almost sure convergence of Wn to W . This is exactly whatthe subsequence method was all about while proving the SLLN for Wn = Sn/n.However, often finding the probability P (Bj) is not easy and requires an upperbound. For proving the SLLN, it is exactly at this point Kolmogorov’s inequalitycomes into the picture (and hence forces us to assume the independence of the

Page 79: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

16.1 Subsequences & Kolmogorov Inequality 147

random variables). To see why, note that if Wk = Sk/k where E(Xi) = 0, we maytake W = 0. Then

Bj =

mj+1⋃

k=mj+1

Ak =

max

mj<k≤mj+1

|Wk −W | ≥ ǫ

=

max

mj<k≤mj+1

1

k|Sk| ≥ ǫ

If we assume that Xk are independent and have finite variances, here comes Kol-mogorov’s inequality (or its extension the Hajek-Renyi inequality) and gives a muchbetter bound on P(Bj). This then leads to the Kolmogorov’s criterion for the stronglaw of large numbers. This criterion pays the price of assuming independence ofrandom variables.

Theorem - 16.1.2 - (Hajek-Renyi inequality)3 Let X1,X2, · · · be a sequenceof independent random variables with mean zero and finite variances. Let cn bea sequence of non-increasing positive numbers and let Sn = X1 + X2 + · · · + Xn.Then, for any ǫ > 0 and any positive integers m and n, n > m, we have

P

(max

m≤k≤nck|Sk| ≥ ǫ

)≤ 1

ǫ2

(

c2m

m∑

k=1

V ar(Xk) +

n∑

k=m+1

c2kV ar(Xk)

)

.

Proof: For ǫ > 0, define a sequence of indicator functions χk,

χk :=

1 if cm|Sm| < ǫ, · · · , ck−1|Sk−1| < ǫ, ck|Sk| ≥ ǫ0 otherwise,

for k = m,m + 1, · · · , n. That is, χk is the indicator function which checks ifcj |Sj | ≥ ǫ for the first time at j = k or not. So, we have

max

m≤j≤ncj |Sj | ≥ ǫ

= cj |Sj | ≥ ǫ for some m ≤ j ≤ n

= χm = 1 or χm+1 = 1 or · · · or χn = 1= χm + χm+1 + · · ·+ χn ≥ 1 .

Note that χi + χj > 1 cannot happen since this implies that both χi = 1 andχj = 1, which is contradictory. Same argument shows that

χm + χm+1 + · · ·+ χn ≥ 1 = χm + χm+1 + · · ·+ χn = 1 .

So, P

(max

m≤j≤ncj |Sj | ≥ ǫ

)= P χm + χm+1 + · · ·+ χn = 1

= E χm + χm+1 + · · ·+ χn =n∑

k=m

E(χk).

3Hajek, J. and Renyi, A. (1955). Generalization of an inequality of Kolmogorov. Acta

Math. Acad. Sci. Hung. vol. 6, pp. 281-283.

148 Laws of Large Numbers

We will prove that

n∑

k=m

E(χk) ≤ E(Y )

ǫ2; where Y = c2

nS2n +

n−1∑

k=m

(c2k − c2

k+1)S2k. (∗)

It is easy to verify (see Exercise (16.1.6) below) that

E(Y ) = c2m

m∑

k=1

V ar(Xk) +n∑

k=m+1

c2kV ar(Xk).

So, we need only prove the inequality in (∗). For this purpose, let χ0 be the eventthat non of the χk equals one, k = m,m + 1, · · · , n. So,

E(Y ) =

n∑

k=m

χk=1Y dP +

χ0=1Y dP

≥n∑

k=m

χk=1Y dP ; (since Y ≥ 0).

Now, we will show that E(χkS2j ) ≥ E(χkS2

k) for any j > k. Indeed,

E(χkS2j ) = E(χk(Sk + Sj − Sk)2)

= E(χkS2k) + E(χk(Sj − Sk)2) + 2E(χkSk(Sj − Sk)).

Note, that the random variable χkSk depends only on X1, · · · ,Xk. Whereas therandom variable Sj−Sk depends only on Xk+1, · · · ,Xj . By independence, we thenget that E(χkSk(Sj − Sk)) = E(χkSk) E(Sj − Sk) = 0. Therefore, for k < j, wehave E(χkS2

k) ≤ E(χkS2j ). Finally, for m ≤ k ≤ n,

E(Y ) ≥n∑

k=m

χk=1Y dP

=n∑

k=m

c2n

χk=1S2

n dP +n−1∑

j=m

(c2j − c2

j+1)

χk=1S2

j dP

≥n∑

k=m

c2n

χk=1S2

n dP +n−1∑

j=k

(c2j − c2

j+1)

χk=1S2

j dP

≥n∑

k=m

c2n

χk=1S2

k dP +n−1∑

j=k

(c2j − c2

j+1)

χk=1S2

k dP

≥n∑

k=m

c2nǫ2

c2k

χk=11dP +

ǫ2

c2k

n−1∑

j=k

(c2j − c2

j+1)

χk=1dP

=

n∑

k=m

c2nǫ2

c2k

P (χk = 1) +ǫ2P (χk = 1)

c2k

n−1∑

j=k

(c2j − c2

j+1)

Page 80: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

16.1 Subsequences & Kolmogorov Inequality 149

= ǫ2n∑

k=m

E(χk)

c2n

c2k

+1

c2k

n−1∑

j=k

(c2j − c2

j+1)

= ǫ2n∑

k=m

E(χk)

c2n

c2k

+c2k − c2

n

c2k

= ǫ2

n∑

k=m

E(χk)

= ǫ2P

(max

m≤j≤ncj |Sj | ≥ ǫ

).

This finishes the proof. ♠

HW43 Exercise - 16.1.6 - Show that for the Y as defined in (∗) above,

E(Y ) = c2m

m∑

k=1

V ar(Xk) +n∑

k=m+1

c2kV ar(Xk).

Exercise - 16.1.7 - (Kolmogorov inequality) Let X1,X2, · · · be a sequence ofindependent random variables with mean zero and finite variances. Let Sn = X1 +X2 + · · ·+ Xn. Then, for any ǫ > 0, explain why

P

(max

1≤j≤n|Sj | ≥ ǫ

)≤ 1

ǫ2

n∑

k=1

V ar(Xk).

Exercise - 16.1.8 - Let Sn = X1+X2+· · ·+Xn, where X1,X2, · · · iid∼ with E(X1) =0, V ar(X1) = 1. Prove that

P

(max

1≤j≤nSj ≥ x

)≤ 2P

(Sn ≥ x− (2n)1/2

).

[Hint: consider Ak = S1 < x, S2 < x, · · · , Sk−1 < x, Sk ≥ x, k = 1, 2, · · · , n.]

Theorem - 16.1.3 - (Kolmogorov criterion (1930)) Let Xk be mutually inde-pendent random variables with finite variances σ2

k. If∑

k σ2k/k2 <∞ then

1

n

n∑

k=1

(Xk − E(Xk))a.s.→ 0.

Proof: Without loss of generality, take E(Xk) = 0 for all k = 1, 2, · · · and letSn = X1 + X2 + · · ·+ Xn. Take ck = 1/k in Hajek-Renyi inequality to get

P

(max

m≤j≤n

1

j|Sj | ≥ ǫ

)≤ 1

ǫ2

1

m2

m∑

j=1

V ar(Xj) +n∑

j=m+1

V ar(Xj)

j2

.

Letting n go to infinity, the continuity property of probability gives that

P

(supj≥m

1

j|Sj | ≥ ǫ

)≤ 1

ǫ2

1

m2

m∑

j=1

V ar(Xj) +

∞∑

j=m+1

V ar(Xj)

j2

.

150 Laws of Large Numbers

We need only show that the right hand side goes to zero. For this, just note that

1

m2

m∑

j=1

V ar(Xj) ≤1

m2

N∑

j=1

V ar(Xj) +m∑

j=N+1

V ar(Xj)

j2.

Letting m and then N go to infinity ends the proof. ♠

Exercise - 16.1.9 - Does convergence in L2 take place in the above Kolmogorovcriterion (Theorem (16.1.3))? Justify. If yes, what if we only had pairwise inde-pendence, would convergence in L2 sense still hold?

Exercise - 16.1.10 - Let X1,X2, · · · be a sequence of independent random variables

with finite variances, σ2k, k = 1, 2, · · · . Let a1, a2, · · · be any sequence of real

numbers.

• (i) If∑∞

k=1a2

kσ2k

k2 <∞, then show that 1n

∑nk=1 ak(Xk − E(Xk))

a.s→ 0.

• (ii) In particular, if∑∞

k=1 a2k/k2 < ∞ then prove that for 1

n

∑nk=1 ak(Uk −

p)a.s.→ 0, where U1, U2, · · · iid∼ B(1, p).

Page 81: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 17

WLLN, SLLN & Uniform

SLLN

The truncation method was invented to get rid of the need to assumpe the finitenessof the second moments of the random variables as demanded by the Kolmogorovcriterion (Theorem (16.1.3)).

Remark - 17.0.5 - (The truncation method) So, let Xk be identically dis-tributed random variables with finite first moment. There are many forms of trun-cation, however, the following one will suffice for us.

Yk :=

Xk if |Xk| ≤ k0 if |Xk| > k.

Note that

∞∑

k=1

P (Xk 6= Yk) =∞∑

k=1

P (|Xk| > k) =∞∑

k=1

P (|X1| > k) <∞,

since E|X1| < ∞. Hence, by using the first Borel-Cantelli lemma we see thatP (lim supkXk 6= Yk) = 0. That is, Xk(ω) 6= Yk(ω) for infinitely many k onlyover a set of probability zero. So, almost surely, Xk = Yk for all but finitely manyvalues of k. This gives that

n∑

k=1

(Xk − Yk) converges a.s. and1

n

n∑

k=1

(Xk − Yk)a.s.→ 0.

So, note that

1

n

n∑

k=1

Xk =1

n

n∑

k=1

(Xk − Yk) +1

n

n∑

k=1

Yk

shows that the limiting behavior of the last term on the right side is the same asthe limiting behavior of the left side term.

152 WLLN, SLLN & Uniform SLLN

To get rid of the assumption on the second moments of the random variables(in the above strong law of large numbers) we often use the truncation approach.However, we still pay the price by assuming that the random variables are (at leastpair wise) independent. This section pulls the main benefits that the truncationmethod brings out.

Definition - 17.0.1 - (Equivalence of random sequences) Two sequences ofrandom variables Xk, k ≥ 1 and Yk , k ≥ 1 are said to be equivalent if

k

P (Xk 6= Yk) < ∞.

(This implies that P(lim supkXk 6= Yk) = P (Xk 6= Yk i.o.) = 0.)

So, the above truncation method shows that if Xk are iid with E|X1| < ∞,then our truncated sequence is equivalent to Xk. The following proposition pullsout the main benefit of the truncated seuquence.

Proposition - 17.0.1 - If two sequences of random variables (Xk) and (Yk) areequivalent, then

• (i)∑n

k=1(Xk − Yk) converges almost surely as n gets large.

• (ii) If ak ր∞, then 1an

∑nk=1(Xk − Yk)

a.s.→ 0, as n gets large.

Proof: Since P (Xk 6= Yk i.o.) = 0, therefore, for any ω ∈ Bc, Xk(ω) = Yk(ω)for all but finitely many k, where B := ω : Xk(ω) 6= Yk(ω) i.o.. Therefore, forall except finitely many values of k, we have Xk(ω)− Yk(ω) = 0, for each ω ∈ Bc.Hence,

∞∑

k=1

(Xk(ω)− Yk(ω)) <∞ for each ω ∈ Bc.

Since, P (Bc) = 1, part (i) follows. Part (ii) is then a trivial consequence of part(i). ♠

HW44 Exercise - 17.0.11 - If two sequences of random variables Xk , k ≥ 1 and Yk , k ≥1 are equivalent, then show that almost surely the limiting behavior of

1

an

n∑

k=1

Xk is the same as that of1

an

n∑

k=1

Yk

for any sequence (ak) of positive numbers going to infinity. Furthermore, if for somerandom variable X,

1

an

n∑

k=1

Xkprob→ X, then show that

1

an

n∑

k=1

Ykprob→ X.

Page 82: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

WLLN, SLLN & Uniform SLLN 153

Let pk, k ≥ 1 be any sequence of numbers, and let Xk, k ≥ 1 and Yk, k ≥1 be equivalent. Then the almost sure convergence of

∑k pkXk is the same as

that of∑

k pkYk .

HW45 Exercise - 17.0.12 - Prove the following results.

• (i) Let (Xk) be a sequence of random variables such that∑

k P (|Xk| > k) <∞. If

Yk :=

Xk if |Xk| ≤ k0 if |Xk| > k,

then prove that (Yk) is equivalent to (Xk).

• (ii) Let (Xk) be identically distributed sequence of random variables withE|X1| <∞. Let Yk be defined as in part (i). Prove that Yk is equivalent to(Xk). [Hint: E|X| <∞ is related to

∑k P (|X| > k) <∞]

UC Exercise - 17.0.13 - Let (Xk) be any sequence of random variables such that∑k P (|Xk| > g(k)) <∞ for some sequence of numbers g(k). If

Yk :=

Xk if |Xk| ≤ g(k)0 if |Xk| > g(k),

then prove that (Yk) is equivalent to (Xk).

Now we present an application of the truncation method. We will prove theWLLN by assuming only pairwise independence. When we have mutual indepen-dence, the characteristic functions approach gives WLLN rather easily.

Theorem - 17.0.4 - (Khintchin’s WLLN — pairwise independent case) Let(Xk) be identically distributed random variables which are pairwise independentand let X1 have finite mean µ. Then,

1

n

n∑

i=1

Xiprob→ E(X1).

Proof: By using the truncation scheme of Remark (17.0.5), we need only showthat Y n → µ in probability. As in the SLLN of Kolmogorov, if we show that

1

n

n∑

k=1

(Yk − E(Yk))prob→ 0; as n→∞

then we will be done. We will use Chebyshev inequality and some clever tricks.First Chebyshev’s inequality gives that

P (|Y n − E(Y n)| > ǫ) ≤∑n

i=1 V ar(Yi)

n2ǫ2≤∑n

i=1 E(Y 2i )

n2ǫ2.

At this stage if we use a crude estimate of E(Y 2i ) by

E(Y 2i ) = E(X2

i χ|Xi|≤i

) = E(X21χ

|X1|≤i) ≤ iE|X1|,

154 WLLN, SLLN & Uniform SLLN

then the upper bound

∑ni=1 E(Y 2

i )

n2ǫ2≤ E|X1|n(n + 1)

2n2ǫ2

does not go down to zero. The trick is to use a sequence an of positive integersgoing to infinity slower than n as shown below. So,

n∑

i=1

E(Y 2i ) =

n∑

i=1

E(X2

1χ|X1|≤i

)=

(an∑

i=1

+n∑

i=an+1

)

E(X2

1χ|X1|≤i

).

We use the crude estimate on the first part and a sharp estimate on the secondpart. That is,

an∑

i=1

E(X2

1χ|X1|≤i

)≤

an∑

i=1

iE|X1| =an(an + 1)

2E|X1|.

So, if an/n goes to zero (e.g., an = lnn) then this upper bound would go down tozero. For the second part, use the fact that

|X1| ≤ i = |X1| ≤ an ∪ an < |X1| ≤ i ,

where the above union is disjoint. So we have

n∑

i=an+1

E(X2

1χ|X1|≤i

)

=n∑

i=an+1

E(X2

1χ|X1|≤an

)+

n∑

i=an+1

E(X2

1χan<|X1|≤i

)

≤ an(n− an)E|X1|+n∑

i=an+1

iE(|X1|χan<|X1|≤i

)

≤ an(n− an)E|X1|+n∑

i=an+1

nE(|X1|χan<|X1|

)

= an(n− an)E|X1|+ n(n− an)E(|X1|χan<|X1|

)

≤ nanE|X1|+ n2E(|X1|χ|X1|>an

).

When we divide this by n2 and let n go to infinity, both of the terms drop to zero.This finishes the proof. ♠

If we assume that the sequence Xk is iid then we may hope to avoid makingthe assumption that E|X1| <∞. However, what should then be the limiting value?Well, we may not have any. But the following result shows that there is a sequenceµn which picks up the limiting behavior in probabilistic sense.

Page 83: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

WLLN, SLLN & Uniform SLLN 155

Theorem - 17.0.5 - (WLLN of Khintchin — iid case) Let Xk be an iidsequence so that

limn

nP (|X1| > n) = 0. (0.1)

Let µn = E(X1χ|X1|≤n

). Then

1

n

n∑

i=1

Xi − µnprob→ 0.

Instead of proving the result directly, we pull out the main steps into a separateproposition as follows.

Proposition - 17.0.2 - Consider the triangular array Xn,k, 1 ≤ k ≤ n, n = 1, 2, · · ·of random variables where for each fixed n, Xn,k, k = 1, 2, · · · , n are independent.Define a truncated triangular array

Yn,k =

Xn,k if |Xn,k| ≤ bn,0 if |Xn,k| > bn,

where bn is a sequence of positive numbers going to infinity so that

limn→∞

n∑

k=1

P (|Xn,k| > bn) = 0, and limn→∞

1

b2n

n∑

k=1

E(Y 2n,k) = 0.

If Sn = Xn,1 + Xn,2 + · · ·+ Xn,n then we have

Sn −∑n

k=1 E(Yn,k)

bn

prob→ 0.

Proof: Let ǫ > 0 be arbitrary and let Tn = Yn,1 + Yn,2 + · · · + Yn,n. Just notethat

P

(∣∣∣∣Sn −

∑nk=1 E(Yn,k)

bn

∣∣∣∣ > ǫ

)

≤ P (Sn 6= Tn) + P

(∣∣∣∣Tn −

∑nk=1 E(Yn,k)

bn

∣∣∣∣ > ǫ

)

≤ P (Xn,k 6= Yn,k, for some k ≤ n) + P

(∣∣∣∣Tn −

∑nk=1 E(Yn,k)

bn

∣∣∣∣ > ǫ

)

≤n∑

k=1

P (Xn,k 6= Yn,k) + P

(∣∣∣∣Tn −

∑nk=1 E(Yn,k)

bn

∣∣∣∣ > ǫ

)

=n∑

k=1

P (|Xn,k| > bn) + P

(∣∣∣∣Tn −

∑nk=1 E(Yn,k)

bn

∣∣∣∣ > ǫ

)

≤n∑

k=1

P (|Xn,k| > bn) +V ar(Tn)

ǫ2b2n

, (Chebyshev’s inequality)

≤n∑

k=1

P (|Xn,k| > bn) +1

ǫ2b2n

n∑

k=1

E(Y 2n,k).

156 WLLN, SLLN & Uniform SLLN

Both of these terms go to zero as given. ♠

Proof of Theorem (17.0.5). We will use the result of Proposition (17.0.2) withXn,k = Xk and take bn = n. Then note that

n∑

k=1

P (|Xn,k| > n) = nP (|X1| > n)→ 0.

To prove the second condition of Proposition (17.0.2), let Yn,k = Xn,kχ|Xn,k|≤n

.

Since Xn,k = Xk, we see that Yn,k = Xkχ|Xk|≤n

. Hence, we need to prove that

1

n2

n∑

k=1

E(X2

kχ|Xk|≤n

)=

1

n2

n∑

k=1

E(X2

1χ|X1|≤n

)=

1

nE(X2

1χ|X1|≤n

)

goes to zero as n gets large. Now note that

1

nE(X2

1χ|X1|≤n

)=

1

n

∫ ∞

0

2xP(|X1|χ|X1|≤n

> x)

dx

=2

n

∫ n

0

xP(|X1|χ|X1|≤n

> x)

dx

≤ 2

n

∫ n

0

xP (|X1| > x) dx.

It is given to us that the integrand goes to zero. The integral version of the Cesarotransfrom will converge to the same limit by regularity. ♠

Exercise - 17.0.14 - Give an example of a sequence of iid random variables forwhich condition (0.1) does not hold and Theorem (17.0.5) fails.

Remark - 17.0.6 - (Converse of WLLN of Khintchin — iid case) It turnsout that the conditions in Khintchin’s WLLN (Theorem (17.0.5)) happen to besufficient as well. Let X1,X2, · · · be iid random variables. The following statementsare equivalent.

• (i) 1n

∑nk=1 Xk

prob→ µ.

• (ii) limn nP (|X1| > n) = 0 and limn µn = limn E(X1χ|X1|≤n

)= µ.

• (iii) φ(t) := E(eitX1) is differentiable at t = 0 with φ′(0) = iµ.

For a proof see Laha and Rohatgi p. 320.One can have a sequence of iid random variables for which the WLLN holds

and the SLLN does not hold. For instance, P(X1 ≤ x) = 1− 1x ln x as x→∞.

Here is a second application of truncation method in conjunction with theKolmogorov inequality.

Page 84: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

WLLN, SLLN & Uniform SLLN 157

Theorem - 17.0.6 - (Kolmogorov’s SLLN) Let X1,X2, · · · be independent andidentically distributed random variables. Then,

Xn :=1

n

n∑

k=1

Xka.s.→ µ if and only if E|X1| <∞, E(X1) = µ.

Furthermore, if E|X1| =∞ then lim supn |Xn| a.s.= ∞.

Proof: First assume that E|X1| <∞ and let E(X1) = µ. We start off just as inKhintchin’s WLLN by defining

Yk :=

Xk if |Xk| ≤ k0 if |Xk| > k.

Just as in Khintchin’s WLLN, we see that we need only prove that Yna.s.→ µ

due to Remark (17.0.5). Since the first moment exists,

E(Yk) = E(Xkχ

|Xk|≤k

)= E

(X1χ|X1|≤k

)→ E(X1); as k →∞.

Therefore, by the regularity of the Cesaro summability method we have

1

n

n∑

k=1

E(Yk)→ E(X1); as n→∞.

So, by Kolmogorov criterion, Theorem (16.1.3), we need only prove that

∞∑

k=1

V ar(Yk)

k2<∞.

We do not need to be as sophisticated as we were in Khintchin’s weak law. Indeed,

E(Y 2k ) = E

(X2

1χ|X1|≤k

)

= E

X21

k∑

j=1

χj−1<|X1|≤j

=

k∑

j=1

E(X2

1χj−1<|X1|≤j

)

≤k∑

j=1

jE(|X1|χj−1<|X1|≤j

).

Therefore, we have

∞∑

k=1

V ar(Yk)

k2≤

∞∑

k=1

E(Y 2k )

k2

≤∞∑

k=1

1

k2

k∑

j=1

jE(|X1|χj−1<|X1|≤j

)

158 WLLN, SLLN & Uniform SLLN

=

∞∑

j=1

∞∑

k=j

j

k2E(|X1|χj−1<|X1|≤j

)

=

∞∑

j=1

jE(|X1|χj−1<|X1|≤j

) ∞∑

k=j

1

k2

≤∞∑

j=1

jE(|X1|χj−1<|X1|≤j

) C

j

= C∞∑

j=1

E(|X1|χj−1<|X1|≤j

)

= CE

∞∑

j=1

|X1|χj−1<|X1|≤j

= CE|X1| <∞.

This gives the first part.To prove the converse, use the fact that (

∑nk=1 ak)/n → µ implies that ak =

o(k), cf. Proposition (14.0.5). Hence, Sn/n converges to µ almost surely im-plies that Xn/n goes to zero almost surely. If E|X1| = ∞ then it must be that∑

n P (|X1| ≥ n) = ∞. But then it must be that∑

n P (|Xn| ≥ n) = ∞. Nowthe second Borel-Cantelli lemma gives that P (|Xn| ≥ ni.o.) = 1. This contra-dicts with the fact that Xn/n goes to zero almost surely. Hence, it must be thatE|X1| <∞. And the first part of the proof now gives that E(X1) must be µ.

To finish the proof, assume that E|X1| = ∞. Then, for any k > 0 we haveE(|X1|/k) = ∞. So,

∑j P (|X1| > kj) = ∞ for each k > 0. Since, Xj are

identically distributed we have

∞∑

j=1

P (|Xj | > jk) =∞; for each k > 0.

Since, the events in question are independent, the second Borel-Cantelli lemmagives that

P (|Xj | > jk i.o.) = 1; for each k > 0.

Now, notice the inequality

|Xj | =∣∣∣∣∣

j∑

i=1

Xi −j−1∑

i=1

Xi

∣∣∣∣∣ ≤∣∣∣∣∣

j∑

i=1

Xi

∣∣∣∣∣+

∣∣∣∣∣

j−1∑

i=1

Xi

∣∣∣∣∣ .

So, if |Xj | > jk then |∑ji=1 Xi| > jk/2 or |∑j−1

i=1 Xi| > jk/2. Hence, if for some

ω ∈ S, |Xj(ω)| > jk for infinitely many j then |∑ji=1 Xi(ω)| > jk/2 for infinity

many j. Thus, we have

P (|Xj | >k

2i.o.) = 1.

This gives that lim supj |Xj | ≥ k/2 almost surely. Since, k is arbitrarily large. weget the result. ♠

Page 85: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

WLLN, SLLN & Uniform SLLN 159

N. Etemadi1 gave a remarkably simple method to prove that when E(X1) =µ and Xi are only pairwise independent and identically distributed then Sn/nconverges to µ almost surely. The proof uses a mixup of the truncation argumentand the subsequence argument.

Theorem - 17.0.7 - (SLLN of Etemadi) Let X1,X2, · · · be pairwise independentand identically distributed random variables with E|X1| < ∞ and let E(X1) = µ.If Sn = X1 +X2 + · · ·+Xn is the partial sum then Sn

n converges to µ almost surely.

Proof: Once again we use the truncation argument with

Yk :=

Xk if |Xk| ≤ k0 if |Xk| > k.

As in the last proof, we need only show that the averages of Yk sequences convergealmost surely to µ. As before, E(Yk) → µ. And the previous argument verbatimgives that

∞∑

k=1

V ar(Yk)

k2≤

∞∑

k=1

E(Y 2k )

k2< ∞. (0.2)

The problem is that now we cannot invoke the Kolmogorov criteria to conclude theresult, since it requires mutual independence of the random sequence. So, Etemaditook the following detours. Since Xn = X+

n − X−n , if the result is proved for the

sequence X+n a similar argument will work for the other sequence. So, without

loss of generality assume that Xn ≥ 0 for each n. Now follow the line of attack of thesubsequence method. Take any a > 1 and consider the subsequence m(n) = [an],n = 1, 2, · · · . Over this subsequence we will prove that

1

m(n)

m(n)∑

k=1

(Yk − E(Yk))a.s.→ 0. (0.3)

Now, since E(Yk) → µ, the regularity of the Cesaro means and (0.3) will implythat

1

m(n)

m(n)∑

k=1

Yka.s.→ µ.

Using this then we fill the inbetween terms as follows. Note that the fact thatXk ≥ 0, implies that Yk ≥ 0 and therefore, for any m(n) ≤ j < m(n + 1), we have

1

m(n + 1)

m(n)∑

k=1

Yk ≤1

j

j∑

k=1

Yk ≤1

m(n)

m(n+1)∑

k=1

Yk.

Now sincem(n + 1)

m(n)=

[an+1]

[an]→ a,

1Etemadi, Nasrollah. (1981). An elementary proof of the strong law of large numbers.Z. Wahrsch. verw. Geb. vol. 55, pp. 119-122.

160 WLLN, SLLN & Uniform SLLN

we get that

1

aE(X1) ≤ lim inf

j

1

j

j∑

k=1

Yk ≤ lim supj

1

j

j∑

k=1

Yk ≤ aE(X1).

Since a > 1 is arbitrarily close to 1, the result is proved. So, all we need to proveis (0.3). In fact we will show a little more that the left side in (0.3) convergescompletely to zero as n gets large. For this purpose let Y1+Y2+· · ·+Ym(n) = Tm(n).Now, for any ǫ > 0, we have

∞∑

n=1

P

∣∣∣∣∣∣1

m(n)

m(n)∑

k=1

(Yk − E(Yk))

∣∣∣∣∣∣> ǫ

=∞∑

n=1

P(∣∣Tm(n) −E(Tm(n))

∣∣ > ǫm(n))

≤∞∑

n=1

V ar(Tm(n))

ǫ2m(n)2, (Chebyshev),

=1

ǫ2

∞∑

n=1

1

m(n)2

m(n)∑

k=1

V ar(Yk), (pairwise independence),

=1

ǫ2

∞∑

k=1

V ar(Yk)∞∑

n: m(n)≥k

1

m(n)2, (Tonelli),

=1

ǫ2

∞∑

k=1

V ar(Yk)

∞∑

n: m(n)≥k

(1

[an]

)2

≤ 1

ǫ2

∞∑

k=1

V ar(Yk)∞∑

n: an≥k

(2

an

)2

≤ 4

ǫ2

∞∑

k=1

V ar(Yk)1

1− 1a2

1

(aloga k)2

=4a2

(a2 − 1)ǫ2

∞∑

k=1

V ar(Yk)

k2<∞, by (0.2).

This finishes the proof. ♠

Remark - 17.0.7 - (Kolmogorov’s versus Etemadi’s SLLN) Compare Exer-cises (16.1.10), with (17.0.15) and (17.0.16) below. In Exercise (16.1.10) the se-quence of independent random variables was assumed to have finite variances, whilethe sequence ak is allowed to be unbounded. In Exercise (16.1.10) the indepen-dent sequence of random variables are only guaranteed to have the first momentwith ak being constrained further to be bounded. In Exercise (17.0.16) we havefurther weaker conditions on the sequence of random variables (being only pairwiseindependent) with a further stronger condition on the bounded sequence ak,namely it should be Cesaro summable.

Page 86: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

WLLN, SLLN & Uniform SLLN 161

Exercises (17.0.15), (17.0.16) reveal a qualitative difference between the SLLNof Kolmogorov and the SLLN of Etemadi. Of course SLLN of Etemadi is a gen-eralization of the SLLN of Kolmogorov. However, the difference between them is

revealed when we introduce an arbitrary bounded sequence ak into the Cesarotransforms,

1

n

n∑

k=1

akXk.

In the case of the Kolmogorov SLLN this transform is guaranteed to reproduce theconvergence or oscillatory behavior of 1

n

∑nk=1 ak (with the assumption on mutual

independence of Xk). While Etemadi’s SLLN guarantees that the transform willreproduce the convergence behavior of 1

n

∑nk=1 ak (with only pairwise independence

assumption on Xk). Somehow it seems as if mutual independence may be necessaryfor this extra benefit that Komogorov’s SLLN provides. However, this is only aconjecture at this moment.

Exercise - 17.0.15 - (Weighted version of SLLN of Kolmogorov) Let X1,X2, · · ·be any sequence of independent and identically distributed random variables with

E|X1| <∞. If a1, a2, · · · is any bounded sequence of real numbers then

1

n

n∑

k=1

ak(Xk − E(X1))a.s.→ 0.

[Note that ak does not have to be Cesaro summable. It says that the convergence

or oscillatory behavior of 1n

∑nk=1 ak is reproduced by 1

n

∑nk=1 akXk. So, a bounded

sequence is C1-summable if and only if 1n

∑nk=1 akXk converges almost surely.]

Exercise - 17.0.16 - (Weighted version of SLLN of Etemadi) Let X1,X2, · · ·be any sequence of pairwise independent and identically distributed random vari-

ables with E|X1| < ∞. If a1, a2, · · · is any bounded sequence of real numbers

which is Cesaro summable to α then

1

n

n∑

k=1

akXka.s.→ αE(X1).

Exercise - 17.0.17 - (Uniform integrability is preserved by regular meth-ods)

• (i) Show that if E(|X|) <∞ then X is uniformly integrable.

• (ii) Let X1,X2, · · · be any sequence of identically distributed random vari-ables with E(|X1|) < ∞. Prove that the collection Xi, i ≥ 1 is uniformlyintegrable.

• (iii) Let X1,X2, · · · be any sequence of random variables that is uniformlyintegrable. Let A = [ank] by a regular summability method. Prove that thecollection ∑∞

k=1 ankXk, n ≥ 1 is uniformly integrable.

162 WLLN, SLLN & Uniform SLLN

• (iv) Conclude that in Exercises (17.0.15) and (17.0.16) L1 convergence takesplace as well.

D Exercise - 17.0.18 - (Almost sure convergence may imply L2 convergence)Let U1, U2, · · · , be independent and identically distributed B(1, p) for some fixedp ∈ (0, 1). For any real sequence ak if

1

n

n∑

k=1

ak (Uk − p)a.s.→ 0,

then∑n

k=1 a2k = o(n2) and hence the L2-convergence must take place. [Hint:

(Uk − p)(Uj − p)/(p(1− p)), j > k ≥ 1 is an orthonormal system. Egorov’s the-orem says that almost sure convergence implies uniform convergence over a set, A,of positive probability. Expand χA by the orthogonal system.]

UC Exercise - 17.0.19 - Let X1,X2, · · · be any sequence of independent random vari-ables and let Zn = 1

n

∑nk=1 Xk, n = 1, 2, · · · . Prove that the following three

statements are equivalent.

• (i) Zna.s.→ 0.

• (ii) (a) Znprob→ 0 and (b) Z2n

a.s.→ 0.

• (ii) (a) Znprob→ 0 and (b)

∑n P(|2Z2n+1 − Z2n | > ǫ) <∞ for any ǫ > 0.

Exercise - 17.0.20 - Let X1,X2, · · · be iid random variables. If E(X+1 ) = +∞ and

E(X−1 ) <∞ then show that 1

n

∑nk=1 Xk

a.s.→ +∞.

UC Exercise - 17.0.21 - Suppose we have two coins, a dime which is assumed to befair and a quarter which is biased. A nickle (fair) is tossed and if a head appears,the dime is chosen otherwise the quarter is chosen. The chosen coin is then tossedinfinitely many times with outcomes X1,X2, · · · , where Xk = 1 if the k-th tosslands heads and Xk = 0 otherwise. Does the strong law of large numbers hold forthe sequence X1,X2, · · · . Prove your assertions.

HW46 Exercise - 17.0.22 - Prove the following results.

• (i) For any ǫ > 0 prove that

limn→∞

supp∈[0,1]

k:|k−np|≥nǫ

(n

k

)pk(1− p)n−k = 0.

• (ii) For any continuous function f on [0, 1],

limn→∞

supp∈[0,1]

∣∣∣∣∣

n∑

k=0

f(

kn

)(n

k

)pk(1− p)n−k − f(p)

∣∣∣∣∣ = 0.

Bn(f, p) :=∑n

k=0 f(

kn

) (nk

)pk(1− p)n−k is called the Bernstein polynomial.

Page 87: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

17.1 Glivenko-Cantelli Theorem 163

17.1 Glivenko-Cantelli Theorem

We start off with a result of Polya concerning convergence in distribution.

Proposition - 17.1.1 - (George Polya) If Fn, n ≥ 1 converges in distributionto another distribution G and if G is continuous then supt∈R

|Fn(t)−G(t)| → 0.

Proof: Suppose not and let there be an ǫ > 0 and a subsequence nk, k ≥ 1, sothat

supt∈R

|Fnk(t)−G(t)| > ǫ, for all k ≥ 1.

By the definition of supt apllied for each fixed k ≥ 1, we get a sequence xk, k ≥ 1so that

|Fnk(xk)−G(xk)| >

ǫ

2, for all k ≥ 1. (1.4)

We will soon show that xk has to be a bounded sequence. Therefore, it musthave a limit point, say α, so that there exists a subsequence xkj

→ α as j gets

large. That is |xkj− α| ≤ δ

4 for all j ≥ J for any δ > 0. Of course, G is continuousat α giving G(xkj

)→ G(α) and we are given that Fnkj(α)→ G(α). Therefore, for

all j ≥ J ,

G(α− δ)← Fnkj(α− δ) ≤ Fnkj

(xkj) ≤ Fnkj

(α + δ)→ G(α + δ)

Since δ > 0 is arbitrary, this contradicts (1.4). To finish the proof we now showthat xk is bounded. First for our ǫ > 0 cut the left tail by picking an a so thatG(a) < ǫ

8 . Next, since Fnk(a)→ G(a), we have a K so that for all k ≥ K

G(a)− ǫ

8≤ Fnk

(a) ≤ G(a) +ǫ

8.

Now for any k ≥ K, if xk < a then

Fnk(xk) ≤ Fnk

(a) ≤ G(a) +ǫ

8

implying |Fnk(xk) − G(xk)| ≤ 2G(a) + ǫ

8 ≤ 3ǫ8 contradicting (1.4). Therefore,

xn ≥ a for all k ≥ K. For an upper bound, see Exercise (17.1.1). ♠

Let X1,X2, · · · ,Xniid∼ F The empirical distribution function is defined as

Fn(x) :=the number of Xi which are ≤ x

n, x ∈ R.

Note that Fn(x) is a random variable, i.e., a function of X1,X2, · · · ,Xn. Moreprecisely, for a fixed real number x, let Yi(x) := 1 if Xi ≤ x and zero otherwise,

i = 1, 2, · · · , n. Then Y1(x), Y2(x), · · · , Yn(x)iid∼ B(1, F (x)), and

Fn(x) =1

n

n∑

i=1

Yi(x), E(Fn(x)) = F (x), V ar(Fn(x)) =F (x)(1− F (x))

n.

164 WLLN, SLLN & Uniform SLLN

By the SLLN we see that Fn(x)a.s.→ F (x). When F is continuous the above result of

Polya shows that convergence must be uniform. The followoing theorem of Glivenkoand Cantelli says that the convergence is uniform for any F .

Theorem - 17.1.1 - (Glivenko-Cantelli (1933) If X1,X2, · · · iid∼ F , where Fis any probability distribution, and Fn(x) is the empirical distribution as definedabove, then

sup−∞<x<∞

|Fn(x)− F (x)| a.s.→ 0.

Proof: For any distribution F , let ∆ be the set of points of jump of F with j ∈ ∆if and only if F (j) − F (j−) = pj > 0. Clearly this set is countable. If this set isempty just ignore the portion of the argument that relies on it being nonempty.Define a discrete distribution Fd(t) and a continuous distribution Fc(t) by

Fd(t) :=1

p

j∈∆: j≤t

pj , Fc(t) :=1

1− p(F (t)− pFd(t)) , t ∈ R,

where p =∑

j∈∆ pj . So, we see that F (t) = (1 − p)Fc(t) + pFd(t). What is notusually noticed is that the random variable X ∼ F can also be similarly decomposedas X = U Xd + (1 − U)Xc, where Xd ∼ Fd and Xc ∼ Fc and U ∼ Bernoulli(p),with all three random variables, U,Xd,Xc, being mutulally independent. Indeed,

P(UXd + (1− U)Xc ≤ t) = pP(Xd ≤ t) + (1− p)P(Xc ≤ t)

= (1− p)Fc(t) + pFd(t) = F (t).

Let Ukiid∼ U , Xd,k

iid∼ Fd, and Xc,kiid∼ Fc. Now it is easy to see that

Fn(t) =1

n

n∑

k=1

I(UkXd,k + (1 − Uk)Xc,k ≤ t)

=1

n

n∑

k=1

I(Uk = 1) I(Xd,k ≤ t) +1

n

n∑

k=1

I(Uk = 0) I(Xc,k ≤ t)

=1

n

n∑

k=1

Uk I(Xd,k ≤ t) +1

n

n∑

k=1

(1− Uk) I(Xc,k ≤ t)

=: Dn(t) + Cn(t), (say).

Let us use the notations

Fd,n(t) =1

n

n∑

k=1

I(Xd,k ≤ t), Fc,n(t) =1

n

n∑

k=1

I(Xc,k ≤ t).

By the SLLN, Fc,n(t)a.s.→ Fc(t) for all t ∈ R, and the weighted version gives

1

1− pCn(t) =

1

n(1− p)

n∑

k=1

(1 − Uk) I(Xc,k ≤ t)a.s→ Fc(t).

Page 88: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

17.1 Glivenko-Cantelli Theorem 165

By Polya’s theorem (actually Exercise (17.1.2)),

supt|Cn(t)− (1− p)Fc(t)| a.s→ 0.

Here almost surely over the intersection of the two events (one for ω of U ’s and onefor ω′ of Xc,k’s) each with probability one. To take care of Fd,n(t) part we take aslightly different tack. The SLLN and its weighted version, for each j ∈ ∆, give

1

n

n∑

k=1

I(Xd,k = j)a.s.→ P(X = j),

1

np

n∑

k=1

Uk I(Xd,k = j)a.s→ P (X = j).

Consider the matrix B = [bnj ], where bnj = 1np

∑nk=1 Uk I(Xd,k = j). This matrix

obeys the following properties (almost surely).

• (i) bnj → βj for each j ∈ ∆.

• (ii) supn

∑j |bnj | ≤ C <∞, and

• (iii) limn→∞∑

j bnj = σ.

(where βj = P(X = j), C = 1p and σ = 1 in our case). Any such matrix is called

a conservative matrix. A famous theorem of Kojima-Schur says that B takes aconvergent sequence to a convergent sequence if and only if it is a conservativematrix. In our case βj ≥ 0,

∑j βj = σ, and another result, which we will call

discrete Scheffe’s theorem see Exercise (17.1.3), says that we must have

limn→∞

j

|bnj − βj | = 0.

Thus we have

|Dn(t)− pFd(t)| = p

∣∣∣∣∣∣

j∈∆, j≤t

(1

np

n∑

k=1

Uk I(Xd,k = j)− P(X = j)

)∣∣∣∣∣∣

≤ p∑

j∈∆

∣∣∣∣∣1

np

n∑

k=1

Uk I(Xd,k = j)− P(X = j)

∣∣∣∣∣ = p∑

j

|bnj − βj |.

The right side does not depend on t and goes to zero, giving uniform convergencein t. Hence, we see that

supt|Fn(t)− F (t)| = sup

t|Dn(t)− pFd(t) + Cn(t)− (1− p)Fc(t)|

≤ supt|Dn(t)− pFd(t)|+ sup

t|Cn(t)− (1− p)Fc(t)| → 0,

over an event of probability one. ♠

The above theorem is just the tip of the iceberg. In order to see the rate ofconvergence, we bring in an inequality of Hoeffding.

166 WLLN, SLLN & Uniform SLLN

Proposition - 17.1.2 - (Hoeffding’s inequality) Let X be any random variablewith P(a ≤ X ≤ b) = 1. Then for any ǫ > 0 we have

E(et(X−E(X))) ≤ exp

t2(b− a)2

8

, for all t > 0.

When X1,X2, · · · ,Xn are independent with P(ai ≤ Xi ≤ bi) = 1 and Sn = X1 +· · ·+ Xn then

P(Sn − E(Sn) ≥ ǫ) ≤ exp

−2ǫ2∑ni=1(bi − ai)2

.

Proof: Without loss of generality we may assume E(X) = 0. By the convexity ofthe function etx we have

etX ≤ X − a

b− aetb +

b−X

b− aeta.

This gives that E(etX) ≤ beta−aetb

b−a =: h(t). The reader may verify that h(0) = 1,

h′(0) = 0 and h′′(t) = (ba2eta − ab2etb)/(b − a). Therefore, if we write h(t) =

eln h(t) =: eg(t) then g(0) = 0, g′(0) = 0 and g′′(t) = h(t)h′′(t)−(h′(t))2

(h(t))2 . Therefore,

beta − aetb

b− a= exp

t2

2g′′(ξ)

= exp

t2(b− a)2

2

−abeξ(a+b)

(beξa − aeξb)2

, ξ ∈ [0, t].

To maximize the exponent in ξ, we may write it as

−abeξ(a+b)

(beξa − aeξb)2= u(ξ) (1− u(ξ)), u(ξ) =

beξa

beξa − aeξb,

Note that E(X) = 0 implies that a ≤ 0 ≤ b, making u(ξ) ∈ [0, 1]. Hence, u(ξ)(1 −u(ξ)) ≤ 1

4 . This gives the first inequality. For the second inequality we apply thefirst inequality to get

P(Sn − E(Sn) ≥ ǫ) = P(et(Sn−E(Sn)) ≥ eǫt), t ≥ 0,

≤ 1

eǫtE(et(Sn−E(Sn)))

=1

eǫt

n∏

i=1

E(et(Xi−E(Xi)))

≤ 1

eǫt

n∏

i=1

et2(bi−ai)2/8)

=1

eǫtexp

t2

8

n∑

i=1

(bi − ai)2

.

Taking t = 4ǫP

ni=1(bi−ai)2

gives the second inequality. ♠

Page 89: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

17.1 Glivenko-Cantelli Theorem 167

Theorem - 17.1.2 - (Improved Glivenko-Cantelli theorem) For the Glivenko-Cantelli theorem we have

P

(supt∈R

|Fn(t)− F (x)| > ǫ

)≤ 8(n + 1)e−nǫ2/32.

Instead of proving this result, we explain its significance and the context inwhich it should be viewed. First consider the collection of events F := χ(−∞,t](x), t ∈R Note that P(X ≤ t) = F (t) = Eft(X), where ft(x) = χ(−∞,t](x) ∈ F . If

X1,X2, · · · ,Xniid∼ F then the empirical distribution function, Fn(x), is 1

n

∑ni=1 ft(Xi)

and the Glivenko-Cantelli theorem may be stated as a uniform form of the stronglaw of large numbers,

supf∈F

∣∣∣∣∣1

n

n∑

i=1

f(Xi)− Ef(X1)

∣∣∣∣∣a.s.→ 0.

In other words, the convergence is uniform over all the functions f ∈ F .

Exercise - 17.1.1 - Finish the proof of Polya’s theorem by showing that the se-quence xk is bounded above.

Exercise - 17.1.2 - (A modified Polya’s result) Let G be a probability dis-tribution and let F1, F2, · · · be a sequence of nondecreasing and right continuousfunctions with Fn(−∞) ≡ 0, and Fn(+∞)→ 1. If G is a continuous function suchthat Fn(t)→ G(t) for all t then prove that supt |Fn(t)−G(t)| → 0.

Exercise - 17.1.3 - (Discrete Scheffe’s theorem) Let B = [bnj ] be a nonnega-tive conservative matrix, i.e., having the following properties.

• (i) bnj → βj for each j ∈ ∆.

• (ii) supn

∑j bnj ≤ C <∞, and

• (iii) limn→∞∑

j bnj = σ.

If∑

j βj = σ then show that limn→∞∑

j xnj (bnj − βj) = 0, for any boundeddouble array [xnj ].

Exercise - 17.1.4 - (Scheffe’s theorem) Let fn be a sequence of probability den-sities so that fn → f (pointwise) where f is also a density. Then show that

limn→∞

R

|fn(t)− f(t)| dt = 0.

168 WLLN, SLLN & Uniform SLLN

Page 90: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 18

Random Series

We start off with the zero-one law of Kolmogorov. Then we present the famousresult of Kolmogorov concerning the convergence of a series whose components arerandom variables. Then we present some refinements of the strong laws of largenumbers.

18.1 Zero-One Laws & Random Series

For this, let us recall the definition of σ(X1,X2, · · · ,Xn) ⊆ ‖∑. It is the smallestsigma field with respect to which X1,X2, · · · ,Xn are measurable. For an infinitesequence of random variables X1,X2, · · · , we take σ(X1,X2, · · · ) to be the smallestsigma field with respect to which any finite subcollection of the Xi’s is measurable.That is,

σ(X1,X2, · · · ) = σ

( ∞⋃

n=1

σ(X1,X2, · · · ,Xn)

)

.

Another sigma field, called the tail sigma field, T , is defined as follows.

T =∞⋂

\=∞σ(X\,X\+∞, · · · ).

This is a sigma field since intersection of sigma fields is again a sigma field.

Example - 18.1.1 - Here are some examples of elements of T . Let X1,X2, · · · bea sequence of random variables.

∞∑

k=1

Xk converges

,

1

n

n∑

k=1

Xk → 0

.

The first event belongs to T is clear since the convergence of the series takes place ifand only if

∑∞k=n Xk converges for each n. To see why the second event also belongs

to T note that the averages 1n

∑nk=1 Xk converge to zero if and only if 1

n

∑Nk=1 Xk

170 Random Series

+ 1n

∑nk=N+1 Xk converge to zero, for each N . But the first term always goes to

zero for each N . Hence, 1n

∑nk=1 Xk converge to zero if and only if 1

n

∑nk=N+1 Xk

converge to zero, for each N . Hence, the event is a member of σ(XN+1,XN+2, · · · )for each N .

Exercise - 18.1.1 - Is it true that lim infn∑∞

k=1 Xk ≤ 0 is a member of T ?(Justify your answer).

Theorem - 18.1.1 - (Kolmogorov’s 0-1 Law) Let X1,X2, · · · be a sequence ofindependent random variables and let T be the tail sigma field. For any A ∈ T ,either P (A) is 0 or 1.

Proof: Just note that σ(X1,X2, · · · ,Xn) and B ∈ σ(Xn+1,Xn+2, · · · ) are inde-pendent. To prove this, we proceed as follows.

By the independence of random variables, we see that σ(X1,X2, · · · ,Xn) isindependent of σ(Xn+1,Xn+2, · · · ,Xn+m) for each m ≥ 1. Therefore, for any A ∈σ(X1,X2, · · · ,Xn) and any B ∈ ∪∞m=1σ(Xn+1,Xn+2, · · · ,Xn+m) we have P (A ∩B) = P (A)P (B). But the set ∪∞m=1σ(Xn+1,Xn+2, · · · ,Xn+m) forms a π-system.Hence, by Proposition (??),1 σ(X1,X2, · · · ,Xn) is independent of σ(Xn+1,Xn+2, · · · ).

Next, since T ⊆ σ(X\+∞,X\+∈, · · · ), we see that σ(X1,X2, · · · ,Xn) is inde-pendent of T for each n. That is, for any A ∈ σ(X1,X2, · · · ,Xn) and any B ∈ T wehave P (A∩B) = P (A)P (B). Thus, the collection (π-system) ∪∞n=1σ(X1,X2, · · · ,Xn)is independent of T . Once again, by Proposition (??), it must be that

σ(∪∞n=1σ(X1,X2, · · · ,Xn)) = σ(X1,X2, · · · )

is independent of T . But T ⊆ σ(X∞,X∈, · · · ). Hence, T is independent of T .That is, if A,B ∈ T then P (A ∩ B) = P (A)P (B). Taking B = A gives thatP (A) = P (A)2. Thus, it must be that P (A) is either zero or 1. ♠

Exercise - 18.1.2 - Let Xk, k ≥ be a sequence of random variables so that Xka.s.→

X. Prove the following results.

• (i) Show that there exists an event A with P(A) > 0 so that Xk converges toX uniformly on A.

• (ii) Show that there exists an event A with P(A) > 0 and a constant N sothat supω∈A |Xn(ω)−X(ω)| ≤ 1 for all n ≥ N .

[Hint: you may use Egorov’s theorem.]

UC Exercise - 18.1.3 - (Buck-Pollard)2Let ([0, 1], E , P) be the usual probability spacewith P((a, b)) = b − a and E is the Borel sigma field. For any t ∈ (0, 1] lett = U1

2 + U2

22 + · · · be the usual nonterminating dyadic expansion. This makes

1which says that if two π-systems are independent then their respective generatedsigma fields are also independent.

2R. Creighton Buck & Harry Pollard. (1943), Convergence and summability of subse-quences, Bull. Am. Math. Soc. vol. 49, pp. 924–931.

Page 91: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

18.1 Zero-One Laws & Random Series 171

U1, U2, · · · iid∼ B(1, 12 ). For a given sequence ak of constants assume there exists

a random variable Z so that

Zn :=1

n

n∑

k=1

ak (2Uk − 1)a.s.→ Z.

Prove the following results.

• (i) If A is an event with P(A) > 0 so that Zn converges uniformly, then thereexists a constant K so that |Z(t)| ≤ K for all t ∈ A.

• (ii) There exists a constant K so that the event D = t : Z(t) ≤ K ∩ t :Z(t) ≥ −K has probability P(D) = 1. Hence, E(Z) must exist, say µ.

• (iii) For any ǫ > 0 explain why exactly one of the events, Z > µ + ǫ,Z ≤ µ− ǫ, |Z − µ| ≤ ǫ has probability one.

• (iv) Verify that if t = U1(t)2 + U2(t)

22 + · · · then 1− t = 1−U1(t)2 + 1−U2(t)

22 + · · · .That is, Un(1 − t) = 1 − Un(t) for all n. Therefore, there exists a t ∈ (0, 1)for which Z(t) and Z(1− t) exist with Z(t) + Z(1− t) = 0, and hence Z = 0almost surely.

• (v) Using Exercise (17.0.18) deduce that ZnL2

→ 0. Hence,∑n

k=1 a2k = o(n2).

[For more on this, see Exercise (18.2.1).]

Now we present a third application of truncation in conjunction with the Kol-mogorov inequality. Here we will turn our attention to the question of whether aninfinite seires of random variables converges or not. Our goal is to see when doesthe series

∞∑

k=1

Xk

converge to some random variable? Here, convergence is in the sense of partialsums. That is, we consider

Sn :=n∑

k=1

Xk,

and ask do the random variables Sn converge to a random variable in some sense.We will assume that the random variables are mutually independent. The key re-sults are the Khintchin-Kolmogorov theorem and the famous “three series theorem”due to Kolmogorov. Recall that if Xk are independent random variables then

V ar(X1 + X2 + · · ·+ Xn) =n∑

k=1

V ar(Xk).

This gives that

limn→∞

V ar

(n∑

k=1

Xk

)

=∞∑

k=1

V ar(Xk).

172 Random Series

The issue is, can we take the left side limit inside the V ar(·) operation? In otherwords, is V ar(·) operation continuous in this sense so that we may have

V ar(X1 + X2 + · · ·+ Xn + · · · ) =

∞∑

k=1

V ar(Xk).

One major problem is to define the random variable on the left side whose varianceis sought. The random series need to converge to a random variable before we canask about its variance. This is answered in the following theorem which says thata sufficient condition is to have

∑k V ar(Xk) <∞ for both issues to be settled.

Theorem - 18.1.2 - (Khintchin-Kolmogorov criterion) Let X1,X2, · · · be asequence of independent random variables with means E(Xk) = µk, and finitevariances V ar(Xk) = σ2

k so that∑

k σ2k <∞. Then the follwoing statements hold:

• (i) The random series∑∞

k=1(Xk − µk) almost surely converges to a rv L.

• (ii) E(L2) <∞ and E (∑n

i=1(Xi − µi)− L)2 → 0.

• (iii) L∑nk=1(Xk − µk), n = 1, 2, 3, · · · is uniformly integrable.

• (iv) E(L) = 0, and V ar(L) =∑∞

k=1 V ar(Xk).

Proof: Let Sn =∑n

i=1(Xi − µi). Now we will prove the existence of a ran-dom variable L so that Sn converge to L almost surely. By using Kolmogorov’sinequality, (by the first n terms to be zero)

P

(max

n<k≤m|Sk − Sn| > ǫ

)= P

maxn<k≤m

∣∣∣∣∣∣

k∑

j=n+1

(Xj − µj)

∣∣∣∣∣∣> ǫ

≤ 1

ǫ2

m∑

k=n+1

σ2k.

Since,

ǫ < maxn<k≤m

|Sk − Sn| ≤ maxn<k≤m+1

|Sk − Sn| ≤ supk>n|Sk − Sn|,

the continuity property of probability measures gives that

P(supk>n|Sk − Sn| > ǫ) = lim

m→∞P

(max

n<k≤m|Sk − Sn| > ǫ

)≤ 1

ǫ2

∞∑

k=n+1

σ2k.

Since infn supk>n |Sk − Sn| ≤ supk>n |Sk − Sn|, we see that

P

(infn≥1

supk>n|Sk − Sn| > ǫ

)≤ P(sup

k>n|Sk − Sn| > ǫ) ≤ 1

ǫ2

∞∑

k=n+1

σ2k

for all n ≥ 1. Hence, it must be that

P

(infn≥1

supk>n|Sk − Sn| > ǫ

)= 0.

Page 92: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

18.1 Zero-One Laws & Random Series 173

Since, ǫ > 0 is arbitrary, we have

P

(infn≥1

supk>n|Sk − Sn| = 0

)= 1.

This implies that Sk is a Cauchy sequence on a set of probability 1, (see Exercise(18.1.4)) making Sk almost surely convergent to some limiting random variableL. This proves part (i) of the theorem.

For the proof of the second part, we will use the fact that the space L2 of squareintegrable random variables is complete. Since E(Sn) = 0, for any m > n, we have

E|Sn − Sm|2 = V ar(Sn − Sm) =

m∑

i=n+1

V ar(Xi) ≤∑

i≥n+1

V ar(Xi)→ 0.

Therefore, Sn, n ≥ 1 is a Cauchy sequence in L2. Hence, there exists a randomvariable, T , in L2 so that Sn converge to T in the L2 norm. This implies thatSn converge to T in probability as well. And almost sure convergence of Sn to Limplies convergence in probability to L as well. Hence, it must be that T = L withprobability one. This gives part (ii).

Note that supn E(S2n) =

∑k σ2

k < ∞. Hence, Sn is uniformly integrable.Since, Sn converge to L almost surely, we have E(L) = limn E(Sn) = 0. Now weshow that LSn is also uniformly integrable. Indeed, by the CBS inequality,

supn

E(|LSn|) ≤ supn

(E(L2)

)1/2√E(S2

n) ≤(

E(L2)

∞∑

k=1

σ2k

)1/2

< ∞.

Denote the sum∑

k σ2k by K. By the fact that L2 is integrable, for any ǫ > 0 there

exists a δ(ǫ) > 0 so that∫

A

L2 dP ≤ ǫ, whenever P (A) < δ.

For any given ǫ > 0, using δ( ǫ2

K ), once again the CBS inequality gives that

supn

A

|LSn| dP

≤ supn

(∫

A

L2 dP

)1/2√E(S2

n) = K1/2

(∫

A

L2 dP

)1/2

≤ K1/2

(ǫ2

K

)1/2

= ǫ.

To prove the last part, since LSn converges to L2 almost surely, by part (iii), wehave E(LSn)→ E(L2). Finally, by part (ii),

0 = limn

E

(n∑

i=1

(Xi − µi)− L

)2

= limn

n∑

k=1

σ2k + E(L2)− 2E(LSn)

=

∞∑

k=1

σ2k − E(L2).

This finishes the proof. ♠

174 Random Series

Exercise - 18.1.4 - Verify that a sequence of real numbers, ak, is a Cauchy se-quence if and only if infn≥1 supk>n |ak − an| = 0.

Exercise - 18.1.5 - Let X1,X2, · · · be any sequence of iid random variables withE|X1| < 1. Prove that the random series

∑∞n=1

∏nk=1 Xk converges absolutely

almost surely.

Exercise - 18.1.6 - (Random harmonic series revisited) Recall Exercise (7.2.3),

where U1, U2, · · · iid∼ B(1, 12 ) represent the zero/one outcomes of a fair coin toss. De-

fine

Xn :=n∑

i=1

2Ui − 1

i, n ≥ 1.

Show that there exists a random variable L so that

• (i) E(L) = 0, V ar(L) = π2/6.

• (ii) Xna.s.→ L.

• (iii) XnL2

→ L.

Exercise - 18.1.7 - (Pick a point at random revisited) Recall Exercise (14.0.9),

where U1, U2, · · · iid∼ Uniform0, 1, · · · , 9, and

Xn :=

n∑

i=1

Ui − 4.5

10i, n ≥ 1.

Show that there exists a random variable L so that

• (i) L ∼ Uniform(−0.5, 0.5).

• (ii) Xna.s.→ L.

• (iii) XnL2

→ L.

Exercise - 18.1.8 - (Cantor distribution) Let U1, U2, · · · iid∼ B(1, 12),

Xn :=n∑

i=1

2Ui

3i, n ≥ 1.

Show that there exists a random variable L so that

• (i) Xna.s.→ L.

• (ii) XnL2

→ L.

• (iii) Show that L has the Cantor distribution.

• (iv) Show that E(L) = 12 and V ar(L) = 1

8 .

Page 93: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

18.2 Refinements of SLLN 175

• (v) Write a computer program that simulates one million approximate out-comes of L and plot the resulting histograms for various number of bins, say20, 50, 100 and 500. Describe what happens to be resulting histograms.

[Hint: for part (iv) you cannot use the density of L since it does not exist.]

Exercise - 18.1.9 - (Brownian motion a.k.a Wiener process) Take Z0, Z1, Z2, · · ·to be independent and identically distributed standard normal random variables.For any positive t, define

Wn(t) :=tZ0√

π+

√2

π

n∑

k=1

sin kt

kZk, Wn(0) := 0.

Fix t, and consider the sequence W0(t),W1(t),W2(t), · · · . Show that there exists

a random varible, say W (t), so that Wn(t)a.s→ W (t). What is V ar(W (t)) for

0 ≤ t ≤ π? (As a function of t, W (t) turns out to be the Brownian motion.)

UC Exercise - 18.1.10 - (Weiner’s orginial Brownian motion)

18.2 Refinements of SLLN

Now, we will see a rather elementary result from summability theory which has hada profound impact on the question of covergence of sums of independent randomvariables.

Proposition - 18.2.1 - (Kronecker’s lemma) Let Pn be a sequence of strictlypositive numbers with Pn ≤ Pn+1 and Pn → ∞. Let x1, x2, · · · be a sequence ofreal numbers and let

sn := x1 + x2 + · · ·+ xn.

Then, as n→∞,

sn

Pn=

1

Pn

n∑

k=1

xk → 0, providedn∑

k=1

(xk

Pk

)converges.

Proof: Let us take

an :=n∑

k=1

(xk

Pk

),

and let a0 = 0 and let P0 = 0. We are given that an → α, (say). We will applyAbel’s summation by parts to the sequences ak and Pk. Indeed, we have

sn

Pn=

1

Pn

n∑

k=1

xk

=1

Pn

n∑

k=1

Pkxk

Pk

176 Random Series

=1

Pn

n∑

k=1

Pk (ak − ak−1)

=1

Pn

Pnan −n∑

k=1

(Pk − Pk−1) ak−1

= an −1

Pn

n∑

k=1

(Pk − Pk−1) ak−1

→ α− α = 0.

Here we used the regularity of the Riesz means. ♠

An analog of this result holds for the Euler/Borel methods, cf. Exercise??. Asan immediate consequence we prove a strong law of large numbers.

Proposition - 18.2.2 - For Xn be any sequence of independent random variables,

∞∑

k=1

1

k2V ar(Xk) <∞ implies

1

n

n∑

k=1

(Xk − E(Xk))a.s.→ 0.

Proof: By Khintchin-Kolmogorov criterion,∑∞

k=11k2 V ar(Xk) < ∞ implies that

∑k

Xk−E(Xk)k converges almost surely. We take Pn = n in Kronecker’s lemma and

get the conclusion. ♠

Remark - 18.2.1 - (Beta matrices & Kronecker’s lemma) A summability ma-trix B = [bnk] is called a beta matrix if for every convergent series

∑k xk we have

yn :=∑

k xk bnk exists for all n and yn converges to some value. For a characteri-zation of beta matrices see Exericse (??).

For a beta matrix, a non-decreasing sequence, Pk, of positive numbers willbe called a beta-null sequence of B = [bnk] if for every convergent series of the type∑

kxk

Pkthe transform yn :=

∑k xk bnk exists for all n and yn → 0.

Kronecker’s lemma says that for the Pk-Riesz method, the sequence Pkitself is its own beta-null sequence. In particular for the Cesaro method we havePk = k is its beta-null sequence. It is not difficult to show that for the regularEuler and Borel methods their beta-null sequence is Pk =

√k.

Remark - 18.2.2 - Our goal is to avoid assuming the existence of finite variance forthe sequence of random variables in the strong law of large numbers. This involvesa truncation argument. Not only the end result is important but also the methodis of some value since this method has other applications.

The next theorem shows the main refinement if the mean exists but the variancemay not. Then after this result we present a theorem of Feller which gives arefinement of the SLLN when the mean does not exist either.

Page 94: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

18.2 Refinements of SLLN 177

Theorem - 18.2.1 - (Extension of Khintchin-Kolmogorov criterion — due

to Chung, Marcinkiewicz & Zygmund)3 Let f(x) be a nonnegative , even

and continuous function on the real line. Also, let f(x)/x be increasing for x > 0and f(x)/x2 be decreasing for x > 0. (Such as f(x) = |x|p; for some 1 ≤ p ≤ 2). Let(Xk) be a sequence of independent random variables with finite means E(Xk) = µk,and let bk be a sequence of positive increasing numbers going to infinity. If

∞∑

k=1

Ef(Xk − µk)

f(bk)<∞

then (i)∑n

k=1(Xk−µk)

bkconverges a.s. and (ii) 1

bn

∑nk=1(Xk − µk)

a.s.→ 0.

Proof: Since, all the assumptions and conclusions are in terms of Xn−µn, withoutloss of generality, take µn = 0. The trick is to truncate the random variables anduse the Khintchin-Kolmogorov criterion for convergence of a random series. Thiswill give part (i). The second part is just Kronecker’s lemma. So, let us define

Yn :=

Xn if |Xn| ≤ bn

0 if |Xn| > bn.

Now, we will show that Xn and Yn are equivalent. That is,∑

k

P (Xk 6= Yk) =∑

k

P (|Xk| > bk) < ∞.

In this regard, we use the fact that f(x)/|x| is increasing as x goes to infinity. Note,that

|Xn| > bn ⊆

f(bn)

bn≤ f(Xn)

|Xn|

⊆ |Xn|

bn≤ f(Xn)

f(bn)

.

So, we get that

P (|Xn| > bn) = E(χ

|Xn|>bn

)≤ E

( |Xn|bn

χ|Xn|>bn

)≤ E

(f(Xn)

f(bn)

).(2.1)

So, we have

n

P (Xn 6= Yn) =∑

n

P (|Xn| > bn) ≤∑

n

Ef(Xn)

f(bn)< ∞.

Thus, the two random sequences Xn and Yn are equivalent. Therefore, thetwo random sequences Xn/bn and Yn/bn are equivalent. To show that therandom series

∑k Yk/bk converges almost surely, we use the Khintchin-Kolmogorov

criterion. We need only prove that∑

nV ar(Yk)

bk< ∞. In this regard, we use the

other given fact that f(x)/x2 is decreasing. So, over the set where Xk equals Yk,we have,

|Xn| ≤ bn ⇒ f(bn)

b2n

≤ f(Xn)

X2n

⇒ X2n

b2n

≤ f(Xn)

f(bn).

3Chung, Kai Lai (1947), A note on some strong laws of large numbers. Amer. J.

Math., vol. 69, pp. 189-192.

178 Random Series

This gives that

n

V ar(Yn)

b2n

≤∑

n

E

(Y 2

n

b2n

)=

n

E

(X2

n

b2n

χ|Xn|≤bn

)≤∑

n

E

(f(Xn)

f(bn)

)<∞.

By Khintchin-Kolmogorov criterion we must have

k

1

bk(Yk − E(Yk))

a.s.→

to some random variable. Now we need to show that∑

kE(Yk)

bkconverges. For this

we will show a bit more by showing that∑

k|E(Yk)|

bkconverges. The key fact is that

E(Xk) = 0 gives that

E(Xk) = 0 = E(Xkχ|Xk|≤bk

)+ E

(Xkχ|Xk|>bk

)

Therefore, we see that

|E(Yk)| =∣∣E(Xkχ|Xk|≤bk

)∣∣ =∣∣E(Xkχ|Xk|>bk

)∣∣

Hence, by the last inequality of (2.1) we have

k

|E(Yk)|bk

=∑

k

|E(Xkχ|Xk|>bk)|bk

≤∑

k

Ef(Xk)

f(bk)< ∞.

This gives part (i) of the theorem. ♠

In the end we mention a beautiful refinement due to Feller.

Theorem - 18.2.2 - (Feller’s extension of SLLN (1946)) Let (Xk) be a se-quence of mutually independent and identically distributed random variables withE|X1| =∞. Let bn be a sequence of positive numbers with bn/nր. Then,

(i) If∞∑

n=1

P (|X1| > bn) <∞ then lim supn

∣∣∣∣

∑ni=1 Xi

bn

∣∣∣∣a.s.= 0.

(ii) If∞∑

n=1

P (|X1| > bn) =∞ then lim supn

∣∣∣∣

∑ni=1 Xi

bn

∣∣∣∣a.s.= ∞.

Proof: The proof proceeds by using a slightly modified form of the truncationargument. Define

Yn :=

Xn − µn if |Xn| ≤ bn

−µn if |Xn| > bn,where µn := E

(X1χ|X1|≤bn

).

This truncation has the property that

E(Yn) = E((Xn − µn)χ

|Xn|≤bn

)− µnP (|Xn| > bn)

Page 95: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

18.2 Refinements of SLLN 179

= E(X1χ|X1|≤bn

)− µnP (|Xn| ≤ bn)− µnP (|Xn| > bn)

= µn − µn = 0,

as well as

n

P (Yn 6= (Xn − µn)) ≤∑

n

P (|Xn| > bn) =∑

n

P (|X1| > bn).

So, if we assume that the last series converges (part (i)) then the random sequenceXn − µn is equivalent to the sequence Yn. To prove (i), we use the Khintchin-Kolmogorov criterion for which we show that

(a)∑

k

V ar(Yk)

b2k

<∞ and (b)1

bn

n∑

k=1

µk → 0.

The first result, (a), will then imply that the random series∑

kYk

bkconverges almost

surely as well as that 1bn

∑nk=1 Yk converges to zero almost surely. The equivalence

then gives that the random series∑

kXk−µk

bkconverges almost surely as well as that

1bn

∑nk=1(Xk−µk) converges to zero almost surely. But then the second result, (b),

gives that

1

bn

n∑

k=1

Xk =1

bn

n∑

k=1

(Xk − µk) +1

bn

n∑

k=1

µka.s.→= 0.

So, we now prove (a) and (b). Just note that

∞∑

k=1

V ar(Yk)

b2k

≤∞∑

k=1

E(Y 2k )

b2k

=∞∑

k=1

1

b2k

µ2

kP (|Xk| > bk) + µ2kP (|Xk| ≤ bk)

−2µkE(Xkχ

|Xk|≤bk

)+ E

(X2

kχ|Xk|≤bk

)

=∞∑

k=1

1

b2k

(µ2

k − 2µk · µk + E(X2

kχ|Xk|≤bk

))

≤∞∑

k=1

1

b2k

E(X2

kχ|Xk|≤bk

)

=∞∑

k=1

1

b2k

E(X2

1χ|X1|≤bk

)

=

∞∑

k=1

1

b2k

k∑

j=1

bj−1<|X1|≤bjX2

1 dP, take b0 = 0,

=

∞∑

j=1

bj−1<|X1|≤bjX2

1 dP

∞∑

k=j

1

b2k

180 Random Series

≤∞∑

j=1

bj−1<|X1|≤bjX2

1 dP

∞∑

k=j

j2

k2b2j

, sincebk

k≥ bj

j,

=∞∑

j=1

j2

b2j

bj−1<|X1|≤bjX2

1 dP

∞∑

k=j

1

k2

≤ C

∞∑

j=1

j2

b2j

bj−1<|X1|≤bjX2

1 dP

(1

j

), where C is a constant,

≤ C

∞∑

j=1

j

b2j

bj−1<|X1|≤bjX2

1 dP

= C

∞∑

j=1

j

bj−1<|X1|≤bj

X21

b2j

dP

≤ C

∞∑

j=1

j

bj−1<|X1|≤bjdP, since

|X1|bj≤ 1,

= C∞∑

j=1

j P (bj−1 < |X1| ≤ bj)

= C

∞∑

j=1

j∑

k=1

P (bj−1 < |X1| ≤ bj)

= C

∞∑

k=1

∞∑

j=k

P (bj−1 < |X1| ≤ bj)

= C∞∑

k=1

P (|X1| > bk−1)

= CP (|X1| > b0) + C∞∑

k=2

P (|X1| > bk−1)

= CP (|X1| > b0) + C

∞∑

j=1

P (|X1| > bj) <∞, j = k − 1,

since the last series is given to be convergent. This proves (a). To prove (b), wewill use the above proved fact that

∞∑

j=1

j P (bj−1 < |X1| ≤ bj) <∞. (2.2)

Note that for any positive integer N < n,

∣∣∣∣∣1

bn

n∑

k=1

µk

∣∣∣∣∣ ≤1

bn

n∑

k=1

E(|X1|χ|X1|≤bk

)

Page 96: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

18.2 Refinements of SLLN 181

=1

bn

N∑

k=1

E(|X1|χ|X1|≤bk

)+

1

bn

n∑

k=N+1

E(|X1|χ|X1|≤bk

)

=1

bn

N∑

k=1

E(|X1|χ|X1|≤bk

)+

1

bn

n∑

k=N+1

E(|X1|χ|X1|≤bN

)

+1

bn

n∑

k=N+1

E(|X1|χbN <|X1|≤bk

)

≤ 1

bn

N∑

k=1

E(bNχ|X1|≤bN

)+

1

bn

n∑

k=N+1

E(bNχ|X1|≤bN

)

+1

bn

n∑

k=N+1

E(|X1|χbN <|X1|≤bk

)

=nbN

bn+

1

bn

n∑

k=N+1

E(|X1|χbN <|X1|≤bk

)

≤ nbN

bn+

1

bn

n∑

k=N+1

E(|X1|χbN <|X1|≤bn

)

≤ nbN

bn+

n

bnE(|X1|χbN<|X1|≤bn

)

=nbN

bn+

n

bn

n∑

j=N+1

E(|X1|χbj−1<|X1|≤bj

)

≤ nbN

bn+

n

bn

n∑

j=N+1

bjP (bj−1 < |X1| ≤ bj)

=nbN

bn+

n∑

j=N+1

bj

j· n

bn· jP (bj−1 < |X1| ≤ bj)

≤ nbN

bn+

n∑

j=N+1

j P (bj−1 < |X1| ≤ bj), sincebj

j≤ bn

n,

≤ nbN

bn+

∞∑

j=N+1

j P (bj−1 < |X1| ≤ bj).

First let n go to infinity and then let N go to infinity. The first term drops to zerosince bn

n must go to infinity, for otherwise, if it remains bounded then∑

k P (|X1| ≥bk) being convergent would imply that E|X1| <∞ which would contradict E|X1| =∞. Then as N gets large the second term drops to zero, due to being the tail ofa convergent series (cf. (2.2)). This finishes the proof of part (i). Part (ii) has asimilar proof whose details we omit. ♠

Exercise - 18.2.1 - (Buck-Pollard revisited) Continuing Exercise (18.1.3) provethe following further results.

182 Random Series

• (i) For any bounded sequence ak, or more general sequence obeying∑∞

k=1a2

k

k2 <

∞, we have Zna.s.→ 0. Therefore, any such sequence is Cesaro summable if

and only if 1n

∑nk=1 akUk converges almost surely. That is, any sequence of

the type∑∞

k=1a2

k

k2 < ∞, is Cesaro summable if and only if almost all of itssubsequences are Cesaro summable.

• (ii) For any arbitrary sequence ak, if almost all of its subsequences areCesaro summable then it must be that the sequence itself is Cesaro summable.

[Remark: Condition∑n

k=1 a2k = o(n2) of part (vii) of Exercise (18.1.3) is only

a necessary condition while the above condition of∑∞

k=1a2

k

k2 < ∞ is a sufficientcondition for the almost sure convergence of Zn = 1

n

∑nk=1 ak(2Uk − 1). There are

other known sufficient conditions.]

UC Exercise - 18.2.2 - If X1,X2, · · · are independent random variables then show that∑k Xk converges almost surely if and only if

∑k Xk converges in probability.

Page 97: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 19

Kolmogorov’s Three Series

Theorem

Now we turn our attention to the famous three series theorem of Kolmogorov. Forthe proof of the three series theorem we will need the following lower inequalityalso due to Kolmogorov.

Proposition - 19.0.3 - (Kolmogorov’s lower inequality) Let X1,X2, · · · be in-dependent random variables having finite expectations. If there exists a constantK such that |Xn − E(Xn)| ≤ K, n = 1, 2, · · · , then for any ǫ > 0, we have

P

(

max1≤k≤n

∣∣∣∣∣

k∑

i=1

Xi

∣∣∣∣∣ ≤ ǫ

)

≤ (2K + 4ǫ)2∑nk=1 V ar(Xk)

.

Proof: Note that the given condition on Xn makes Xn to be a bounded randomvariable, however, the location of the support of the random variables may driftaway without any bounds. Let Sn =

∑ni=1 Xi and let S0 = 0. Denote the event of

interest by

An :=

max

1≤k≤n|Sk| ≤ ǫ

.

If P (An) = 0 then there is nothing to prove. Otherwise, since An−1 ⊇ An, wesee that P (Ak) > 0 for each k = 0, 1, 2, · · · , n. If we denote the centered randomvariable by Yk = Xk − E(Xk), then let Tn =

∑ni=1 Yi and once again let T0 = 0.

Furthermore, let

ck =1

P (Ak)

Ak

Tk dP, k = 0, 1, 2, · · · .

Note that∫

Ak

(Tk − ck) dP = ckP (Ak)− ckP (Ak) = 0. (0.1)

184 Kolmogorov’s Three Series Theorem

Furthermore,

|ck − ck+1|

=

∣∣∣∣∣1

P (Ak)

Ak

Tk dP − 1

P (Ak+1)

Ak+1

Tk+1 dP

∣∣∣∣∣

=

∣∣∣∣∣1

P (Ak)

Ak

Tk dP − 1

P (Ak+1)

Ak+1

(Tk + Yk+1) dP

∣∣∣∣∣

=

∣∣∣∣∣1

P (Ak)

Ak

Sk dP − E(Sk)− 1

P (Ak+1)

Ak+1

Sk dP + E(Sk)

− 1

P (Ak+1)

Ak+1

Yk+1 dP

∣∣∣∣∣

=

∣∣∣∣∣1

P (Ak)

Ak

Sk dP − 1

P (Ak+1)

Ak+1

Sk dP − 1

P (Ak+1)

Ak+1

Yk+1 dP

∣∣∣∣∣

≤∣∣∣∣

1

P (Ak)

Ak

Sk dP

∣∣∣∣+

∣∣∣∣∣1

P (Ak+1)

Ak+1

Sk dP

∣∣∣∣∣+

∣∣∣∣∣1

P (Ak+1)

Ak+1

Yk+1 dP

∣∣∣∣∣

≤ 2ǫ + K, since |Sk| ≤ ǫ over Ak and Ak+1.

Also note that

| −E(Sk)− ck| =

∣∣∣∣−E(Sk)− 1

P (Ak)

Ak

Tk dP

∣∣∣∣

=

∣∣∣∣−E(Sk)− 1

P (Ak)

Ak

Sk dP − E(Sk)

∣∣∣∣

≤ 1

P (Ak)

Ak

|Sk| dP < ǫ.

Now the monotonicity of the sequence Ak gives that∫

Ak+1

(Tk+1 − ck+1)2 dP =

Ak

(Tk+1 − ck+1)2 dP −

Ak−Ak+1

(Tk+1 − ck+1)2 dP.

We will deal with the two integrals one at a time. Over the event Ak − Ak+1, wenotice that |Sk| ≤ ǫ. By adding and subtracting ck in the integrands we have

Ak−Ak+1

(Tk+1 − ck+1)2 dP

=

Ak−Ak+1

(Tk+1 − ck + ck − ck+1)2 dP

=

Ak−Ak+1

(Tk − ck + ck − ck+1 + Yk+1)2 dP

≤∫

Ak−Ak+1

(|Sk|+ | −E(Sk)− ck|+ |ck − ck+1|+ |Yk+1|)2 dP

Page 98: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Kolmogorov’s Three Series Theorem 185

≤∫

Ak−Ak+1

(ǫ + ǫ + (2ǫ + A) + A)2 dP

= (4ǫ + 2A)2P (Ak −Ak+1).

On the other hand,∫

Ak

(Tk+1 − ck+1)2 dP

=

Ak

(Tk − ck + ck − ck+1 + Yk+1)2 dP

=

Ak

(Tk − ck)2 dP +

Ak

(ck − ck+1)2 dP +

Ak

Y 2k+1 dP +

2

Ak

(Tk − ck)(ck − ck+1) dP + 2

Ak

(Tk − ck)Yk+1 dP +

2

Ak

(ck − ck+1)Yk+1 dP.

The last three terms drop out by using (0.1) and the fact that Yk+1 is independentof all the previous random variables and E(Yk+1) = 0. Hence, we have

Ak

(Tk+1 − ck+1)2 dP

=

Ak

(Tk − ck)2 dP +

Ak

(ck − ck+1)2 dP +

Ak

Y 2k+1 dP

≥∫

Ak

(Tk − ck)2 dP +

Ak

Y 2k+1 dP +

=

Ak

(Tk − ck)2 dP + E(Y 2k+1)

Ak

1dP, (independence)

=

Ak

(Tk − ck)2 dP + V ar(Xk+1)P (Ak).

Hence, we see that∫

Ak+1

(Tk+1 − ck+1)2 dP

=

Ak

(Tk+1 − ck+1)2 dP −

Ak−Ak+1

(Tk+1 − ck+1)2 dP

≥∫

Ak

(Tk − ck)2 dP + V ar(Xk+1)P (Ak)− (4ǫ + 2K)2P (Ak − Ak+1).

Stating differently, (by using Ak ⊇ An for k ≤ n) we have

Ak+1

(Tk+1 − ck+1)2 dP −

Ak

(Tk − ck)2 dP

≥ V ar(Xk+1)P (Ak)− (4ǫ + 2K)2P (Ak − Ak+1)

186 Kolmogorov’s Three Series Theorem

≥ V ar(Xk+1)P (An)− (4ǫ + 2K)2P (Ak − Ak+1).

Adding over k = 0, 1, 2, · · · , n− 1, the telescoping effect gives that

An

(Tn − cn)2 dP ≥ P (An)n∑

k=1

V ar(Xk)− (4ǫ + 2K)2n−1∑

k=0

P (Ak − Ak+1)

≥ P (An)n∑

k=1

V ar(Xk)− (4ǫ + 2K)2P (S − An).

Now we use the fact that∫

An

(Tn− cn)2 dP ≤∫

An

(|Sn|+ | −E(Sn)− cn|)2 dP ≤∫

An

(ǫ+ ǫ)2 dP = 4ǫ2P (An).

Hence we have

4ǫ2P (An) ≥ P (An)

n∑

k=1

V ar(Xk)− (4ǫ + 2K)2 + (4ǫ + 2K)2P (An).

Rearranging gives that

(4ǫ + 2K)2 ≥ P (An)n∑

k=1

V ar(Xk) +((4ǫ + 2K)2 − 4ǫ2

)P (An)

≥ P (An)

n∑

k=1

V ar(Xk).

This finishes the proof. ♠

Theorem - 19.0.3 - (Kolmogrov’s 3-series theorem) Let (Xk) be mutuallyindependent random variables. And let K > 0 (for Kolmogrov) be a fixed constant.Define a truncated sequence (Yj) by

Yj :=

Xj if |Xj | ≤ K0 if |Xj | > K.

Then, the partial sums∑n

j=1 Xj converges almost surely if and only if the followingthree series converge:

• (i)∑∞

j=1 P (Xj 6= Yj) =∑∞

j=1 P (|Xj | > K)

• (ii)∑∞

j=1 E(Yj)

• (iii)∑∞

j=1 V ar(Yj).

Proof: One way the proof is easy. Assume (i), (ii) and (iii) hold for some constantK > 0. Then by item (iii) and Khintchin-Kolmogorov criterion, we have

∞∑

j=1

(Yj − E(Yj)) converges almost surely.

Page 99: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Kolmogorov’s Three Series Theorem 187

Therefore, by item (ii) it must be that

∞∑

j=1

Yj converges almost surely.

Therefore, by item (i) it must be that

∞∑

j=1

Xj converges almost surely.

Conversely, suppose now that∑n

j=1 Xj converges almost surely. So, Xn =∑n

j=1 Xj−∑n−1j=1 Xj → 0 almost surely. Therefore, for any constant K > 0, P (|Xn| >

K i.o.) = 0. By the independence of events and the second Borel-Cantelli lemma,

∞∑

n=1

P (|Xn| > K) <∞.

Fix a constant K > 0 and this gives part item (i). Hence, the two random sequencesXn and Yn are equivalent implying that the random series

∞∑

j=1

Yj converges almost surely.

Since |Yn| ≤ K, we see that |Yn−E(Yn)| ≤ 2K. Now Kolmogorov’s lower inequalitygives that

P

(

maxn<k≤m

|k∑

i=n+1

Yi| ≤ 1

)

≤ (4A + 4)2∑mi=n+1 V ar(Yi)

.

If the series in the denominator diverged as m increased, then

limm→∞

P

(

maxn<k≤m

|k∑

i=n+1

Yi| ≤ 1

)

= 0.

But the left side is P(supk>n |∑k

i=n+1 Yi| ≤ 1). Hence, for sure the tail∑

i>n Yi

of the random series∑

i Yi does not go to zero, implying that the series∑

i Yi

is not convergent. This contradiction implies that∑

i V ar(Yi) < ∞, giving item(iii). Now Khintchin-Kolmogorov criterion implies that

∑k(Yk − E(Yk)) converges

almost surely as well. Hence item (ii) must hold as well. ♠

Exercise - 19.0.3 - By using Kolmogorov’s three series theorem give another proofof Theorem (18.2.1).

Exercise - 19.0.4 - (Random p-series) Let U1, U2, · · · iid∼ B(1, 12 ). Show that the

random series

Zp :=∞∑

k=1

2Uk − 1

kp

188 Kolmogorov’s Three Series Theorem

converges almost surely if and only if p > 12 . Then write a computer program

that simulates its density and V ar(Zp) for various values of p > 12 . [In particular,

if Xiiid∼ F with 0 < V ar(X1) <∞, then

∑k

Xk√k

does not converge almost surely.]

Exercise - 19.0.5 - Let Xk = 2Uk−1√k

, k = 1, 2, · · · where U1, U2, · · · iid∼ B(1, 12 ).

Prove the following results.

• (i)∑

k Yk diverges almost surely, where Yk = −(Xk + Xk−1), k ≥ 2, Y1 =−Xk.

• (ii)∑

k Zk converges almost surely, where Zk = Xk−Xk−1, k ≥ 2, Z1 = X1.

Exercise - 19.0.6 - (Randomly modulating random series) Let X1,X2, · · · be

any sequence of random variables and let U1(t), U2(t), · · · iid∼ B(1, 12 ) be the dyadic

expansion of t ∈ [0, 1]. Assume Xk and Uk are independent. Prove that thefollowing statements are equivalent.

• (i)∑

k Xk(2Uk − 1) converges almost surely.

• (ii)∑

k X2k converges almost surely.

• (iii)∑

k |Xk|(2Uk − 1) converges almost surely.

Furthermore any one of the above items implies∑

k X2k(2Uk − 1) converges almost

surely, as well as if Xk(j) is any subsequence of Xk then∑

j Xk(j)(2Uj − 1)converges almost surely. Give an example of Xk for which all of the above resultshold but

∑k |Xk| = +∞ almost surely.

Exercise - 19.0.7 - Let X1,X2, · · · be a sequence of independent random variables.Prove that the following statements are equivalent.

• (i)∑

k X2k converges almost surely.

• (ii)∑

k E

(X2

k

1+X2k

)<∞.

Page 100: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

Lecture 20

The Law of Iterated

Logarithms

Recall the Borel normal number theorem which says that Sn

n

a.s.→ 0 when we take

Sn =∑n

i=1(2Ui − 1) and Uiiid∼ B(1, 1

2) random variables. The question arose as to

what was the rate of convergence? Hausdorff (1913) showed that Sna.s.= o(n0.5+ǫ)

for any ǫ > 0. A year later, Hardy and Littlewood (1914) showed that Sna.s.=

O((n log n)0.5). Steinhaus (1922) improved the constant of Hardy and Littlewood

by showing that lim supnSn√

2n log n

a.s.≤ 1. A year later, Khintchin settled the upper

bound issue by proving the following result. Just as Kolmogorov proved his famousinequality to derive his famous results, Khintchin also discovered an inequality forhis results and it has found many other uses and generalizations. We start off withhis inequality.1

Proposition - 20.0.4 - (Khintchin’s inequality) Let Rk = 2Uk−1, where U1, U2,

U3, · · · iid∼ B(1, 12 ). Then for any real numbers x1, x2, · · · , xn, we have

(

A2m

n∑

k=1

x2k

)m

≤ E

(n∑

k=1

xkRk

)2m

≤ (2m)!

2m m!

(n∑

k=1

x2k

)m

, m = 1, 2, · · · .

for a constant Am that depends only on m.

Proof: We will only prove the upper inequality since thats what we need. Theproof uses critically the facts that R1, R2, · · · are independent and

E(R2m+1k ) = 0, E(R2m

k ) = 1, m = 1, 2, · · · .Just note that

E

(n∑

k=1

xkRk

)2m

= E

(∑ (2m)!

j1! j2! · · · jr!xj1

k1xj2

k2· · · xjr

krRj1

k1Rj2

k2· · ·Rjr

kr

),

1A. Khintchin (1923), Uber dyadische Bruche, Math. Z. vol. 18, pp. 109-116.

190 The Law of Iterated Logarithms

where j1 + j2 + · · · + jr = 2m and the sum is over all such terms with ji ≥ 1 andr ≥ 1. Should any of the j1, j2, · · · , jr be odd, that term will drop to zero when itsexpectation will be taken. Hence, only even terms will survive, and in which casethe expectation of the “R” terms will be the product of their expectations and eachone of them would equal one. So, we may write the above sum as

E

(n∑

k=1

xkRk

)2m

=∑ (2m)!

(2j1)! (2j2)! · · · (2jr)!x2j1

k1x2j2

k2· · · x2jr

kr,

where 2j1 + 2j2 + · · · + 2jr = 2m and the sum is over all such terms with ji ≥ 1,and r ≥ 1. This may be rewritten as

E

(n∑

k=1

xkRk

)2m

=∑ (2m)!

(2j1)! (2j2)! · · · (2jr)!

(x2

k1

)j1 (x2

k2

)j2 · · ·(x2

kr

)jr,

where j1 + j2 + · · · + jr = m and the sum is over all such terms. Hence we mayoptimize this sum as follows

E

(n∑

k=1

xkRk

)2m

=∑ (2m)! j1! j2! · · · jr!

(2j1)! (2j2)! · · · (2jr)! m!× m!

j1! j2! · · · jr!

(x2

k1

)j1 (x2

k2

)j2 · · ·(x2

kr

)jr

≤ supr,j1,j2,··· ,jr≥1

J1+j2+···+jr=m

(2m)! j1! j2! · · · jr!

(2j1)! (2j2)! · · · (2jr)! m!

∑ m!(x2

k1

)j1 (x2

k2

)j2 · · ·(x2

kr

)jr

j1! j2! · · · jr!

= supr,j1,j2,··· ,jr≥1

J1+j2+···+jr=m

(2m)! j1! j2! · · · jr!

(2j1)! (2j2)! · · · (2jr)! m!×(

n∑

k=1

x2k

)m

.

To get an upper bound on the “sup” term, just note that

(2m)! j1! j2! · · · jr!

(2j1)! (2j2)! · · · (2jr)! m!=

(2m)!

m!

r∏

ℓ=1

jℓ!

(2jℓ)!

=(2m)!

m!

r∏

ℓ=1

1

(jℓ + 1)(jℓ + 2) · · · (2jℓ)

≤ (2m)!

m!

r∏

ℓ=1

1

(1 + 1)(2) · · · (2)

=(2m)!

m!

r∏

ℓ=1

1

2jℓ

=(2m)!

m!

1

2j1+j2+···+jr

=(2m)!

m! 2m, when j1 + j2 + · · ·+ jr = m.

Page 101: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

The Law of Iterated Logarithms 191

This finishes the proof. ♠

We can bypass the use of the following inequality, however, we have elected thisroute to highlight a use of Khintchin’s inequality. The proof of this inequality isbased on the proof of Kolmogorov’s inequality. We leave its proof for the reader(Exercise (20.0.8)).

E

(max

1≤k≤n|Sk|p

)≤(

p

p− 1

)p

E (|Sn|p) , p > 1.

Note that ( pp−1 )p ≤ 4 for any p ≥ 2.

Theorem - 20.0.4 - (Khintchin’s LIL, 1923) Let U1, U2, · · · , iid∼ B(1, 12 ) and let

Rk = 2Uk − 1 be the (so called) Rademacher functions. If Sn =∑n

k=1 Rk then

lim supn

∣∣∣∣Sn√

2n log log n

∣∣∣∣ ≤ 1, a.s.

Proof: Let Ln :=√

2n log log n. The way we will prove this statement is byshowing that the contrary holds only with probability zero. In other words, we willshow that

P

(lim sup

n

|Sn|Ln

> 1

)= 0.

But this is equivalent to proving that for any ε > 0 we show that

P

(lim sup

n

|Sn|Ln

> (1 + ε)

)= 0.

Note that this in turn implies that we need only show that for c2 := 1 + ε so that1 < c < 1 + ε,

P

(lim sup

n

|Sn|Ln

> c

)= 0.

Define

Sn :=n∑

k=1

Rk, Bn = V ar(Sn) =n∑

k=1

1 = n, S∗n := max

1≤k≤n|Sk|.

Note that log log n is positive for all n greater than some n0, which we will take tobe the case. Also, consider all large k > n0 so that ck is strictly increasing (whichhappens after some point n00) and define

Ek := ∪ck

n=ck−1+1

|Sn|Ln

> (1 + ε)

=

supck−1<n≤ck

|Sn|Ln

> (1 + ε)

.

Note that ck = c ck−1 and

Lck

Lck−1

=

√ck log log ck

√ck−1 log log ck−1

≤ √c

(log log c + log(k)

log log c + log(k − 1)

)1/2

→ √c.

192 The Law of Iterated Logarithms

Since√

c < c, this gives that for all large k, we have Lck ≤ cLck−1 . We consider allsuch k’s only. For such k’s we see that

Ek =

supck−1<n≤ck

|Sn|Ln

> (1 + ε)

=

supck−1<n≤ck

|Sn|Ln

> c2

supck−1<n≤ck

|Sn|Lck−1

> c2

⊆S∗

ck > c2Lck−1

⊆ S∗ck > cLck =: Gk.

It will be enough if we show that

∞∑

k=1

P (Gk) <∞, for every ǫ > 0,

since then the Borel-Cantelli lemma will give that P (Eki.o.) ≤ P (Gki.o.) = 0 forevery ε > 0. Note that, for any real a,

E(eaS∗

n

)≤ 2E

(eaS∗

n + e−aS∗n

2

)

= 2

(1 +

a2E(S∗n)2

2!+

a4E(S∗n)4

4!+

a6E(S∗n)6

6!+ · · ·

)

≤ 8

(1 +

a2E(S2n)

2!+

a4E(S4n)

4!+

a6E(S6n)

6!+ · · ·

)

≤ 8

(

1 +

∞∑

m=1

a2m(2m)!Bmn

(2m)! 2m m!

)

, Khintchin with Bn = n,

≤ 8

(

1 +∞∑

m=1

(a2n/2)m

m!

)

= 8ea2n/2.

So, by Markov’s inequality, for any a > 0 we have

P (Gk) ≤E

(eaS∗

ck

)

eacLck

≤ 8ea2ck/2

eacLck

.

The above is true for any choice of a > 0. Now take a = cLck/ck and get

P (Gk) ≤ 8ec2(Lck )2/(2ck) e−c2(L

ck )2/ck

= 8 e−c2(Lck )2/(2ck)

= 8 e−c2 log log(ck)

Page 102: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

The Law of Iterated Logarithms 193

= 8 e−c2 log(k log(c))

=8

(k log c)c2 .

Since c > 1,∑

k P (Gk) <∞. ♠

Within a year,2 Khintchin was able to settle the whole problem by showingthat actually equality held in his earlier result.

Theorem - 20.0.5 - (Khintchin’s LIL, 1924) Let U1, U2, · · · , iid∼ B(1, 12 ) and let

Rk = 2Uk − 1 be the (so called) Rademacher functions. If Sn =∑n

k=1 Rk then

lim supn

|Sn|√2n log log n

= 1, a.s.

Proof: By Khintchin’s LIL applied to −(2Uk − 1), we see that

− lim infn

Sn√2n log log n

= lim supn

−Sn√2n log log n

≤ lim supn

|Sn|√2n log log n

a.s.≤ 2.

Therefore, for all but finitely many n, we can say that

Sn ≥ −2√

2n log log n, a.s.

Let A1 the set with P(A1) = 1 and the above result holds for all ω ∈ A1.Now we modify the argument of Khintchin’s proof in the reverse direction, with

(1+ǫ) replaced by (1−ǫ), hoping to use Borell-Cantelli-II, for which which we needindependent events. So, the old event Ek or Gk in Khintchin’s proof won’t cut it.Instead, for a constant 0 < β < 1, yet unspecified, consider the events

Dk :=Snk− Snk−1

> β√

2(nk − nk−1) log log(nk − nk−1)

, k = 2, 3, · · · ,

where we take nk = ck, and c > 1 is still unspecified. Note that D2,D3, · · ·are independent random variables. For a moment assume that we can prove that∑

k P(Dk) = +∞. Therefore, by Borell-Cantelli-II, it must be that P(Dki.o.) = 1.So, there exists a set A2 with P(A2) = 1 and for each ω ∈ A2, infinitely many Dk

occur.Hence, over A1 ∩A2, which is an event with probability one, we see that

Snk≥ Snk−1

+ β√

2(nk − nk−1) log log(nk−1)nk)

≥ −2√

2nk−1 log log nk−1 + β√

2(nk − nk−1) log log(nk − nk−1)

≥√

2nk log log nk

(− 2√

c+ β

√(1− 1

c

)log log(ck−1(c− 1))

log k + log log c

)

≥√

2nk log log nk

(

− 2√c

+ β(1− ǫ

2

)√(

1− 1

c

))

,

2Khintchin, A. (1924) Uber einen Satz der Wahrscheinlichkeitsrechnung, Fund. Math.

Vol 6, pp. 9-20.

194 The Law of Iterated Logarithms

for all k large enough so that log log(ck−1(c−1))log k+log log c ≥ (1− ǫ

2 )2. Now we fix the constants

β ∈ (0, 1) and c > 1, which are picked so that

− 2√c

+ β(1− ǫ

2

)√(

1− 1

c

)> 1− ǫ.

This is certainly possible when β is close enough to 1 and c is large enough. Becauseof this we see that |Sn| ≥ Sn > (1− ǫ)

√n log log n infinitely often with probability

one. So, we are only left with the task of showing that∑

k P(Dk) = +∞ for anychoice of 0 < β < 1 and any c > 1. There are several techniques of finding a

lower bound of the probability P(∑m

k=n+1(2Uk−1) > t) when m > n, but none istrivial. After proving the central limit theorem we will use it to derive this result.See Exercise (20.0.10) for the normal case. ♠

HW47 Exercise - 20.0.8 - (An extension of Hajek-Renyi inequality) Let Sn = X1+X2 + · · · + Xn, n ≥ 1, be the sequence of partial sums of independent randomvariables having mean zero and finite p-th moment, where p > 1 is a fixed constant.Let cn be a sequence of non-increasing positive numbers. Then for any n > m wehave

P

(max

m≤k≤nck|Sk| ≥ ǫ

)≤ 1

ǫp

(

cpnE|Sn|p +

n−1∑

k=m

(cpk − cp

k+1)E|Sk|p)

.

For ck = 1, if S∗n = max1≤k≤n |Sk|, then show that

P(S∗n > ǫ) ≤ 1

ǫE(|Sn|χS∗

n>ǫ), E (|S∗

n|p) ≤(

p

p− 1

)p

E (|Sn|p) .

HW48 Exercise - 20.0.9 - Show that E(X2m) = (2m)!2m m! when X ∼ N(0, 1). Furthermore,

for any t > 0 prove the following inequalities

t

1 + t2e−t2/2 ≤

∫ ∞

t

e−u2/2 du ≤ 1

te−t2/2.

HW49 Exercise - 20.0.10 - (LIL for the normal case) Prove the law of iterated loga-

rithm when Riiid∼ N(0, σ2), namely

lim supn→∞

Sn√2n log log n

a.s= σ, lim inf

n→∞Sn√

2n log log n

a.s= −σ,

by proving the following steps.

• (i) Explain why it is sufficient to assume that σ = 1.

• (ii) Show that Khintchin’s proof the upper inequality goes verbatim withoutusing Khintchin’s inequality after proving the facts that Sn/

√n is distributed

as a standard normal random variable and using its moments as given inExercise (20.0.9).

Page 103: Distribution Theory & its Summability Perspectivekazim/Istanbul/talks2.pdf14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ ˙Ilknur Ankara Universitesi¨ 16- SOYLU, Elis Ankara

The Law of Iterated Logarithms 195

• (iii) Show that the Khintchin’s lower bound proof goes almost verbatim aswell, after proving the facts that Sm − Sn =

∑mi=n+1 Ri ∼ N(0,m − n) and

using the lower inequality for P(Sm−Sn > t√

m− n) from Exercise (20.0.9).

Remark - 20.0.3 - (LIL of Kolmogorov, 1929) Kolmogorov then extend Khintchin’sresult by allowing nonidentically distributed random variables into the picture andproved the following version of LIL.3

Let R1, R2, · · · be a sequence of independent random variables with mean zeroand V ar(Rk) = σ2

k. Let Sn =∑n

k=1 Rk and B2n = V ar(Sn) =

∑nk=1 σ2

k. If therandom variables Rk are bounded with bound

|Rn| = o

(Bn√

log log Bn

). (0.1)

Then the LIL holds, (using Rk and −Rk),

lim supn

Sn√2B2

n log log Bn

a.s.= 1, lim inf

n

Sn√2B2

n log log Bn

a.s.= −1.

Or stating both parts together,

lim supn

|Sn|√2B2

n log log Bn

a.s.= 1.

Marcinkiewicz and Zygmund4 showed that the condition (0.1) cannot be replacedby a big “O” version of it.

Remark - 20.0.4 - (LIL of Hartman and Wintner, 1941) Hartman and Wint-ner showed that Kolmogorov’s condition (0.1) could indeed by reduced considerablyif we assumed the added assumption that the random variables, R1, R2, · · · wereiid. In fact they proved the following result for non-iid case:5 Let R1, R2, · · · be asequence of independent random variables with mean zero and V ar(Rk) = σ2

k. LetSn =

∑nk=1 Rk and Bn = V ar(Sn) =

∑nk=1 σ2

k be such that Bn

n > c for some c > 0.Then

lim supn

Sn√2Bn log log Bn

= 1, a.s.

provided that Rk are random variables whose distribution has a common (uniform)tail bound by another random variable Y which has finite variance:

supn

P (|Rn| ≥ r) = O(P (|Y | ≥ r)), as r →∞.

A new proof of this result was provided by Alejandro de Acosta.6

3A. Kolmogorov, (1929), Uber das Gesetz des iterierten Logarithmus, Math. Annalen,vol. 101, pp. 126-135.

4Marcinkiewicz and Zygmund (1937), Remarque sur la loi du logarithme itere, Fund.Math. vol. 29, pp. 215-222.

5Philip Hartman and Aurel Wintner, (1941), Amer. J. Math. vol. 63, no. 1, (1941),pp. 169-176.

6Alejandro de Acosta (1983), A New Proof of The Hartman-Wintner Law of the Iter-ated Logarithm, Ann. Prob. vol. 11, no. 2, pp. 270-276.

196 The Law of Iterated Logarithms

Remark - 20.0.5 - (Fine tuning of LIL by Slivka) John Slivka was a Ph.D.student of Norman C. Severo at the State University of New York and Buffalo. Inhis Ph.D. dissertation he considered the following type of fine tuning.

As seen in the above discussion if Sn =∑n

k=1 Rk, where Rk are independenthaving mean zero and Bn =

∑nk=1 V ar(Rk), if we let bε(n) = (1+ε)(2Bn log log Bn)1/2,

for n ≥ 3, the logarithm having base e, then the event Sn > bε(n) occurs onlyfinitely many times, almost surely. Hence, we may define a sequence of 0’s and 1’sby

Yn(ε) :=

1 if Sn > bε(n),0 otherwise.

So, for any ε > 0 we see that N(ε) :=∑∞

n=1 Yn(ε) is a finite function (a randomvarible). Slivka proved that whenever the LIL holds, then for any ε > 0 it sohappens that

E (N(ε)) = +∞.

In fact,he showed something even more remarkable. He proved that for any λ > 0

E(N(ε)λ

)= +∞.

This result came out in John Slivka7

Remark - 20.0.6 - (The current state of affairs) Literally to this day variousimprovements and variants of the LIL are seen coming out in the literature indifferent settings. No doubt finer details will appear in the future. One finelytuned result is an improvement of Kolmogorov done by Feller. It goes as follows.Let Sn = X1 + X2 + · · · + Xn, where Xi are independent random variables withE(Xi) = 0 and σ2

i = V ar(Xi). Let B2n = σ2

1 + · · ·+σ2n. For any increasing sequence

of positive numbers φn, the following results hold:

• (a) If∑

kφk

k e−φ2k/2 <∞ then P(Sn > Bnφn i.o.) = 0.

• (b) If∑

kφk

k e−φ2k/2 =∞ then P(Sn > Bnφn i.o.) = 1.

7John Slivka, (1969), On the Law of The Iterated Logarithm, Proc. National Academyof Sci., vol. 63, pp. 289-291.