Probability Inequalities

Inequalities on Probability Spaces

Sopassakis Pantelis

November 2, 2011

Contents

Preface 5

1 Probability Spaces 6

1.1 Intoduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Expectation Operator . . . . . . . . . . . . . . . . . . . . . . 8

2 Inequalities on Probability Spaces 12

2.1 Boole's Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Cauchy - Schwartz and Related Inequalities . . . . . . . . . . 132.3 Cramer-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Markov's and Chebychev's Inequalities . . . . . . . . . . . . . 202.5 Jensen's Inequality . . . . . . . . . . . . . . . . . . . . . . . . 262.6 Cherno-Cramer and Bennett's Bound . . . . . . . . . . . . . 272.7 Various Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 34

3

Preface

This essay is concerned with the study of inequalities on probability spaces.Mathematical structures such as equalities incorporate an expliteness thatis not consistent with the vagueness that characterizes probability and realphenomena. Such inequalities are a powerful tool for a mathematician, anengineer and everyone dealing with probability theory, statistics, measurespaces, information theory and many other scientic elds. The essay con-sists of two chapter. In the rst one, I tries to summarize some point forprobability and measure theory and dene the expectation operator. Itsproperties are discussed and some useful propositions are prooved. In thesecond chapter, various inequalities are discussed. Some proofs were om-mited, although the reader can advice the references.

The whole text was typed in LATEX

Sopassakis P.Chemical Engineer

5

Chapter 1

Probability Spaces

1.1 Intoduction

Throughout this essay, I will try to present some of the most popular inequal-ities on probability spaces and explain their proctical signicance. Boole's,Bonferoni, Kolmogorov's, Cramer - Rao, Hoeding's, Azuma's and other in-equalities will be discused. At rst, various denitions will be given fromprobability spaces theory and measure theory.

1.2 Probability spaces

We give the denition of the measure of a set:

Measure: Let Ω be a nonempty set and F a σ−algebra on Ω. The pair(Ω,F) is called measurable space. A map µ : F → [0,+∞] is called measureif the following hold:

1. µ(∅) = 0

2. For all Ai ∈ F with Ai ∩Aj = ∅ for i 6= j: µ(⋃i∈I Ai) =

∑i∈I Ai

The triplet (Ω,F , µ) is called measure space.

Probability Space: A meaure space (Ω,F , µ) is called probability spaceand µ probability measure if µ(Ω) = 1. Hereinafter we will denote the prob-ability measure by P.

In stochastic models, the notion of random variables is crucial. The term'variable' is somewhat misleading because as we will see, random variablesare maps form a nonempty set Ω to < wich describe certain random phe-nomena. We introduce the notion random variables to compute expressions

6

1.2. PROBABILITY SPACES 7

Measurable Step Function: Let (Ω,F) be a measurable space. A mapf : Ω→ < is called measurable step function or just step function if there ex-ist a1, a2, . . . , an ∈ < and A1, A2, . . . , An ∈ F , such that f can be representedas:

f(ω) =∑

i=1,...,n

aiIAi(ω)

where IAi is the characteristic (or indicator) function of Ai, that is IAi(ω)equals 1 for every ω ∈ Ai and 0 if ω /∈ Ai.

We introduce now the notion of random variables in order to study stochasticprocesses. The goal is the computation of expressions like:

P(ω ∈ Ω : f(ω) ∈ (a, b), a ≤ b)

We require thatω ∈ Ω : f(ω) ∈ (a, b), a ≤ b ∈ F

Random Variable: Let (Ω,F) be a measurable space. A map f : Ω→ <is called random variable if there is a sequence (fi)i∈N of measurable stepfunctions fn : Ω→ < such that

f(ω) = limi→∞ fi(ω), for every ω ∈ Ω

The following proposition, reveals the notion of random variables:

Proposition : Let (Ω,F) be a measure space and f : Ω→ < a map. Thefollowing are equivalent:

1. f is a random variable

2. For every a, b ∈ <, with a < b, we have that f−1((a, b)) =

ω∈Ωf(ω)∈(a,b)

Proof: The proof (1) ⇒ (2) is a direct consequence from the denition ofrandom variables. Assume that

f(ω) = limi→∞

fi(ω)

where (fi)i is a sequence of measurable step-functions. For every fi we knowthat

f−1n ((a, b)) ∈ F

Thereforef−1((a, b)) =

ω ∈ Ω : lim

n→∞fn(ω) ∈ (a, b)

=

=∞⋃m=1

∞⋃N=1

∞⋂n=N

ω ∈ Ω : a− 1

m< f(ω) < b− 1

m

∈ F

8 CHAPTER 1. PROBABILITY SPACES

We are now going to prove that (2)⇒ (1). Proof is based on the observationthat:

f−1([a, b)) =ω ∈ Ω : lim

n→∞fn(ω) ∈ [a, b)

=

=∞⋂m=1

ω ∈ Ω : a− 1

m< f(ω) < b

∈ F

So f can be represented via the step-function representation:

fn(ω) =4n−1∑k=−4n

k

2nI k

2n<f< k+1

2n(ω)

1.3 Expectation Operator

In this section we dene the expectation (also called integration) operator ofa random variable upon a probability space (Ω,F , µ) and discuss its basicproperties. We give the denition of Expectation Operator rightaway:

Expectation Operator: Given a probability space (Ω,F , µ), and an F−measurable mapping f : Ω→ <, with representation:

f =n∑i=1

aiIAi

we dene the expectation operator E as:

Ef =

∫ΩfdP =

∫Ωf(ω)dP(ω) =

n∑i=1

aiP(Ai)

.

Basic Properties of E: We say that a property Q(ω) holds P−almostsurely or just almost surely if the set Q = ω ∈ Ω : P(ω) holds is in F andis of unit measure. Opertator E has various useful properties. The basicproperties of E on a probability space (Ω,F , µ) are:

1. E is a Linear Operator, that is: E(αf + βg) = αEf + βEg for everyα, β ∈ <

2. If 0 ≤ f(ω) ≤ g(ω) then 0 ≤ Ef ≤ Eg

3. The random variable f is integrable i |f | is integrable. In this case|Ef | ≤ E |f |

1.3. EXPECTATION OPERATOR 9

4. If almost surely f = 0 then Ef = 0

5. If almost surely f ≥ 0 and Ef = 0 then almost surely f = 0

6. If almost surely f = g and Ef exists then Ef = Eg

Proof:

1. Let (Ω,F , µ) be a probability space and f, g : Ω → < be measurablestep-functions. Given two real numbers α, β we have to proove thatthat

E(αf + βg) = αEf + βEg

We have:

f =∑n

i=1 κiIAi and g =∑m

j=1 λjIBj

Hence

αf + βg = α

n∑i=1

κiIAi + β

m∑j=1

λjIBj

Applying the expectation operator we get

E(αf + βg) = αEf + βEg

2. This property is a direct consequense of the denition of E

3. This property is shown by means of the denition of the integrablefunction. We state that a random variable f is integrable if bothEf+ and Ef− are nite, where f+(ω) = max [f(ω), 0] and f−(ω) =max [−f(ω), 0]. Since

ω ∈ Ω : f+(ω) 6= 0∩ω ∈ Ω : f−(ω) 6= 0

= ∅

and since both of these sets are measurable it follows that |f | = f++f−

is integrable i f+ and f− are integrable and that

|Ef | =∣∣Ef+ − Ef−

∣∣ ≤ Ef+ + Ef− = E |f |

4. If almost surely f = 0 then almost surely f+ = 0 and f− = 0, thuswithout loss of generality we may consider that f(ω) ≥ 0. Considerthat g is a measurable step function with g =

∑nk=1 αkIAk ≥ 0 and

almost surely g = 0. Then if αk 6= 0 then P(Ak) = 0. Thus

Ef = sup Eg : 0 ≤ g ≤ f = 0

Exploiting property (1), since 0 ≤ g ≤ f we have almost surely thatg = 0

10 CHAPTER 1. PROBABILITY SPACES

Operator E preserves the order: Let (Ω,F , µ) be a probability spaceand f, g : Ω → < be random variables such that Ef and Eg both exist. Ifalmost surely f ≤ g, then Ef ≤ Eg

Proof: We assume that both Ef− and Eg+ are nite. We have that f ≤ g,hence 0 ≤ f+ ≤ g+ and 0 ≤ g− ≤ f−. Thus f and g are integrable and

Ef = Ef+ − Ef− ≤ Eg+ − Eg− = Eg

If one of Ef−, Eg+ is innite, then Ef = −∞ or Eg = ∞ and the proof isobvious.

We will pose the following proposition without proving it since it exceedsthe purpose of this essay.

Proposition: Let (Ω,F , µ) be a probability space and f, g : Ω → < berandom variables such that Ef and Eg both exist. If Ef+ + Eg+ < ∞ orEf− + Eg− < ∞, then E(f + g)+ < ∞ or E(f + g)− < ∞ and E(f + g) =Ef + Eg

Proof: See Christel Geiss and Stefan Geiss, 'An introduction to probabil-ity theory', (ebook) [GEI]

Proposition: Let (Ω,F , µ) be a probability space and g, f1, f2, . . . : Ω→ <be random variables, where g is integrable. If one of the following hold:

1. g(ω) ≤ fn(ω) ↑ f(ω) almost surely

2. g(ω) ≥ fn(ω) ↓ f(ω) almost surely

then limn→∞ Efn = Ef

Proof: Let's consider that (1) holds. We dene the sequence hn = fn − gand h = f − g. Then almost surely

0 ≤ hn(ω) ↑ h(ω)

Thus limn Ehn = Eh. Since f−n and f− are integrable, and Ehn = Efn − Egand Eh = Ef − Eg. If we consider that (2) holds then the proof is exactlythe same.

Having dened and described the expectations operator, we can denethe notion of variance or a random variable.

1.3. EXPECTATION OPERATOR 11

Variance: Let (Ω,F , µ) be a probability space and f : Ω→ < be a randomvariable. Then we dene σ2 = E [f − Ef ]2. This is called the variance of f .

Lemma of Fatou: Let (Ω,F , µ) be a probability space and g, f1, f2, . . . :Ω → < be random variables, with |fn(ω)| ≤ g(ω) almost surely and g isintegrable. Then limn→∞ sup fn and limn→∞ inf fn are integrable randomvariables and

E limn→∞

inf fn ≤ limn→∞

inf Efn ≤ limn→∞

sup Efn ≤ E limn→∞

sup fn

Proof: From the denitions of supremum and innum it is obvious that:

limn→∞

inf Efn ≤ limn→∞

sup Efn

We will proove that:

E limn→∞

inf fn ≤ limn→∞

inf Efn

We dene the random variable:

Zk = infn≥k

fn

Therefore Zk ↑ lim infn fn and almost surely |Zk| ≤ g and |limn inf fn| ≤ g.Applying the previous proposition, one has that:

E limn→∞

inf fn = lim EZk = limk

(Einfn≥kfn) ≤ limk

( infn≥kEfn) = lim

ninf Efn

The same way we proove that

limn→∞

sup Efn ≤ E limn→∞

sup fn

Chapter 2

Inequalities on Probability

Spaces

In this chapter we discuss the more signicant inqualities on probabilityspaces. In probability theory in general, we introduce the notion of theprobability P(E) that an event E ∈ F occurs. Thus equalities incorpo-rate an explicitness which is not consistent with that vagueness. In manycases, probability inequalities can deal with dicult problem more easilythan equalities yielding in valuable results.

2.1 Boole's Inequality

Probably the most simple inequality though very useful. Boole's inequality(named after George Boole) is also known as union bound since it posesa bound on the union on the probability of the union of a countable set ofevents. Boole's inequality says that for any countable set of events (Ai)i ∈ F ,the probability that at least one happens is no greater than the sum of theprobabilities of the individual events. Formally:

P

⋃i∈I

Ai

≤∑i∈IP(Ai)

where I is a countable set of indices.

Proof: The proof deduces from the fact that every measure and thus anyprobability measure is σ−subadditive.

A kind of generalization of the Boole's inequality is the Bonferroni In-equalities, after Carlo Emilio Bonferroni. By means of the Bonferroni In-equalitues we can determine upper and lower bounds on the probability of

12

2.2. CAUCHY - SCHWARTZ AND RELATED INEQUALITIES 13

nite unions of events. We dene:

S1 =

n∑i=1

P(Ai)

S2 =∑i<j

(Ai ∩Aj)

Recursively we dene Sk as:

Sk =∑

i1<i2<...<ik

(Ai1 ∩Ai2 ∩ . . . ∩Aik)

Then Bonferroni union bound inequality states that for every odd k ≥ 1:

P

n⋃i=1

Ai

≤

k∑j=1

(−1)j+1Sj

and for every even k ≥ 2:

P

n⋃i=1

Ai

≥

k∑j=1

(−1)j+1Sj

Boole's inequality is recovered by setting k = 1

2.2 Cauchy - Schwartz and Related Inequalities

Before proceeding we will deve the notion of inner product.

Inner product: Let V be a vector space on the body <. A map 〈, 〉 :V × V → < is said to be an inner product on V if the following hold forevery x, y, z ∈ V and λ, µ ∈ <:

1. 〈x, x〉 ≥ 0

2. If 〈x, x〉 = 0, then x = 0V

3. 〈x, y〉 = 〈y, x〉 and

4. 〈λx+ µy, z〉 = λ 〈x, z〉+ µ 〈y, z〉

It is easy to show that an inner product induces a norm on the vectorspace V which is dened as ‖x‖ = 〈x, x〉1/2. The well known Cauchy -Schwartz inequality states that given a vector space V , and two vectorsu, v ∈ V , the modulus of their inner product is less or equal than the productof their norms.

14 CHAPTER 2. INEQUALITIES ON PROBABILITY SPACES

Cauchy - Schwartz Inequality: Let V be a vector space and 〈, 〉 an innerproduct on V and x, y ∈ V . Then

|〈x, y〉| ≤ ‖x‖ ‖y‖

Proof 1: If one of x, y is 0V , the proof is trivial. We observe that the equationhold only if x and y are linearly dependent.In that case, there exists a λ ∈ <such that x = λy. So 〈x, λx〉2 = λ2 〈x, x〉2 = λ2 ‖x‖4 = ‖x‖2 ‖y‖2. We nowconsider the polynomial function:

R(z) = ‖zx+ y‖2 = z2 ‖x‖2 + 2z 〈x, y〉+ ‖y‖2

If x, y are linearly independent R(z) is positive thus has no real zeroes. Hence

〈x, y〉2 − ‖x‖2 ‖y‖2 < 0⇒

〈x, y〉2 < ‖x‖2 ‖y‖2

as claimed

Proof 2: We consider the case that none of x, y is 0V . The given inequalityis equivalent to: ⟨

x,y

〈y, y〉1/2

⟩2

≤ 〈x, x〉

We will prove the above inequality for y ∈ V with ‖y‖ = 1. One has:

0 ≤ 〈x− 〈x, y〉 y,−〈x, y〉 y〉 =

= 〈x, x〉 − 2 〈x, y〉2 + 〈x, y〉2 〈y, y〉 =

= 〈x, x〉 − 〈x, y〉2 .which proves the Cauchy - Schwartz Inequality.

Expecation operator as an inner product: The expectation operatorwe discussed above denes an inner product, that is for every random vari-ables f, g we dene 〈f, g〉 = E(fg). Indeed one can easily proove that itsatises all of the properties (1) - (4). Consequently the triangle inequalityholds for the induced norm. We dene the operator

[f ]rms =√E f2

where rms stands for root of the mean square. Then, the triangle inequalitybecomes:

‖f + g‖ ≤ ‖f + h‖+ ‖h+ g‖ ⇔⇔ [f + g]rms ≤ [f + h]rms + [h+ g]rms

2.3. CRAMER-RAO BOUND 15

Cauchy-Schwartz in Probability Spaces: Let (Ω,F ,P) be a proba-bility space and f, g, h : Ω → < be random variables. Then E(fg) ≤E(f2)E(g2). If Ef = 0 and Eg = 0 then E(fg) ≤ σ2

fσ2g

Proof: We have to proove that operator E is an inner product. Let 〈f, g〉 =E(fg)

1. Operator E is symmetric: E(fg) = E(fg), and thus 〈f, g〉 = 〈g, f〉

2. Let 〈f, f〉 = 0 or equivalently E(f2) = 0, then as prooved, almostsurely f = 0

3. Let 〈f, f〉 ≥ 0 which clearly holds since E(f2) ≥ 0

4. λ, µ be real numbers, then

〈λf + µg, h〉 = E (λf + µg)h = E λfh+ µgh =

λE(fh) + µE(gh) = λ 〈f, h〉+ µ 〈g, h〉

Hence operator E denes an inner product on Ω∗ 3 f : Ω→ <, and Cauchy-Schwartz inequality holds as we prooved above.

Holder Inequality: Holder's Inequality is knowon from real analysis andstates that given a Vector Space V , two vectors x, y ∈ V and a well-denedp-norm ‖·‖p, one has that

‖x+ y‖p ≤ ‖x‖p + ‖y‖p

which is just the triangle inequality for the p-norm, where p ∈ [1,+∞]. Nowlet (Ω,F ,P) be a probability space and f, g : Ω → < be random variables,then the Holder's Inequality becomes:

E(fg) ≤ [E(fp)]1/p + [E(gp)]1/p

2.3 Cramer-Rao Bound

Cramer-Rao Bound Inequality also known as Cramer-Rao Lower Bound,named after Harald Cramer and Calyampudi Radhakrishna Rao, poses alower bound on the variance of the estimators of a deterministic parameter.Cramer-Rao inequality is of great practical signicance. Suppose that θ isan unknown parameter to be estimated from experimental data X , whichare random variables and are described by some probability density functionf(x; θ). Let ϑ be an unbiased estimator of θ and T (x) is an unbiased esti-mator of a function ψ(θ) of the parameter θ. Note that parameter θ may be


a real scalar or a vector of <n. Before proceeding to the statement of theCramer-Rao Bound, we should note that requires certain weak conditionswhich are known as regularity conditions. These are:

1. For every x, such that f(x, θ) > 0, the derivative

∂

∂θln f(x; θ)

exists and is nite

2. The operations of integration with respect to x and dierentiation withrespect to θ can be interchanged in the expectation value of T , that is

∂

∂θ

[∫T (x)f(x; θ)dx

]=

∫T (x)

[∂

∂θf(x; θ)

]dx

provided that the right-hand side of the above relation exists and isnite.

Regularity conditions are weak requirements that hold for almost every dis-tribution.

Proposition: The second regularity conditionn hold when one of the belowhold:

1. The function f(x, θ) has bounded support in x and the bounds do notdepend on θ

2. The function f(x, θ) has innite support, is continously dierentiableand the integral converges uniformlly for all θ

In the following the CR inequality will be stated for various cases ofincreasing generality.

θ is scalar and unbiased: We rst dene a function of θ known as FischerInformation

I(θ) = E

[[∂

∂θlog f((X); θ)

]2]

Fischer Information is well-dened due to the rst regularity condition. Thenthe variance of ϑ is bounded:

var(ϑ) ≥ 1

I(θ)


We may dene the eciency of an unbiased estimator ϑ as

e(ϑ) =

1I(θ)

var(ϑ)

From CR inequality one has that e(ϑ) ≤ 1. We now state CR inequality fora more general case of scalar θ

General Scalar case: Consider that θ is a real scalar parameter and ψ(θ)is a function of θ. Suppose that T (X ) is an unbiased estimator of ψ(θ), thatis E [T (X )] = ψ(θ). In this case the lower CR bound is given by:

var(T ) ≥

[ψ(θ)

]2

I(θ)

where ψ(θ) is the derivative of ψ with respect to θ and I is the FischerInformation function. We can easily generalize this result to the case ofbiased estimators. Consider an estimator ϑ of θ with bias b(θ) = E ϑ − θ,and let ψ(θ) = θ + b(θ) be a function of θ. By the result above, one has:

var(ϑ) ≥

[1 + b(θ)

]2

I(θ)

Proof: Let (Ω,F ,P) be a probability space and X : Ω → < be a randomvariable with probability density function f(x; θ), where θ is a real scalardeterministic parameter. We denote by T = t(θ) a statistic which is used asan estimator for θ. Let

V =∂

∂θln f(X , θ)

then the expectation of V is zero. If we consider the covariance of V with Twe have cov(V, T ) = E(V T ) since E(V ) = 0. Expanding this expression wehave:

cov(V, T ) = E[T · ∂

∂θln f(X ; θ)

]=

∫t(x)

[∂

∂θf(x; θ)

]dx =

=∂

∂θ[t(x)f(x; θ)dx] = ψ(θ)

Because by the second regularity condition, dierentiation and integrationoperators commute. By the Cauchy-Schwartz inequality, one has that:√

var(T )var(V ) ≥ |cov(V, T )| =∣∣∣ψ(θ)

∣∣∣


Hence

var(T ) ≥

[ψ(θ)

]2

var(V )=

[ψ(θ)

]2

I(θ))=

[∂

∂θE(T )

]2 1

I(θ)

as claimed

Example: In this example we will use the Cramer-Rao inequality to assessthe eciency of s2 for the estimation of σ2. Let X be a normally distributedrandom variable with known mean µ and unknown variance σ2. We introducethe following statistic which will be used as an unbiased estimator of theunknown variance, from a nite number of experimental data.

T =

∑ni=1(Xi − µ)2

n

Indeed T is unbiased for σ2, as ET = σ2. The variance of T is then:

var(T ) =var(X − µ)2

n=

1

n

[E

(X − µ)4− (E

(X − µ)2

)2]

We observe that the rst term is the fourth central moment of X and hasvalue 3(σ2)2. The second term is the square of the variance (σ2)2. Therefore:

var(T ) =2(σ2)2

n

The score V is:

V =∂

∂σ2L(σ2, X)

where L is the likelihood function. Subsituing density function of the normaldistribution N(µ, σ2) one gets:

V =∂

∂σ2log

[1√

2πσ2e−

(X−µ)2

2σ2

]=

(X − µ)2

2(σ2)2− 1

2σ2

The Fischer Information is minus the expectation of the derivative of V (Thisis a property of I that one can easily proove). We have:

I = −E(∂V

∂σ2

)= −E

[(X − µ)2

2(σ2)2− 1

2σ2

]= . . . =

1

2(σ2)2

Thus the information in a sample of n independent observations is n timethat, that is n

2(σ2)2. The Cramer-Rao Inequality states that:

var(T ) ≥ 1

I

In this case equality is achieved, and the eciency of T is 100% !


Vector case: The Cramer-Rao bound can be extended to the multivariatecase. Assume that the parameter θ is a vector if <n, that is

θ = [θ1, θ2, . . . , θn]T ∈ <n

with probability density function f(x; θ) which satises the two regularityconditions. Now the Fischer Information is a matrix in <n×n, with elements

Im,k = E[d

dθmlog f(x; θ)

d

dθklog f(x; θ)

]Let T be a statistical estimator with

T (X) = [T1(X), T2(X), . . . , Tq(X)]T ∈ <q

and denote its expectation value (vector) by E [T (X)] = ψ(θ). Then, theCramer-Rao bound, states that the covariance of T (X) satises:

covθ(T (X)) ≥ ∂ψ(θ)

∂θ[I(θ)]−1

(∂ψ(θ)

∂θ

)TWhere the relation ≥ denotes positive semideniteness, that is K ≥ L ⇔K − L ≥ 0.

Chapman-Robbins Bound: In statistic estimation theory, Cramer-RaoBound is a powerful tool which helps setting a lower bound on the varianceof the estimators of deterministic parameters [CHR]. However the dieren-tiability assumption made on the probability density function is somehowrestricting. Though more dicult to compute, Chapman-Robbins bound(also known as Hammersley-Chapman-Robbins bound) is basen on muchweaker assumptions. Let θ ∈ <n be an unknown deterministic parameterand X be a random variable in (<n,F ,P) n-dimensional probability space,interpreted as a measurement of θ. Suppose that the probability densityfunction of X is given by p(X, θ), is well-dened and positive for all valuesof X and θ. Let δ(X) be an unbiased estimate of an arbitary function g(θ)of θ, that is for all θ ∈ <n:

E [δ(X)] = g(θ)

The the Chapman-Robbins bound on var(δ(X)) states that:

var(δ(X)) ≥ supr

[g(θ + r)− g(θ)]2

E[p(x;θ+r)p(x;θ) − 1

]2

The Chapman-Robbins bound converges to the Cramer-Rao bound as r → 0assumin the regularity conditions hold. This implies that when both ofthese two bounds exist, the Chapman-Robbins version is always at least astight as the Cramer-Rao bound. The great signicance of the Chapman-Robbins bound consists in coping with non-dierentiable probability densityfunctions.


2.4 Markov's and Chebychev's Inequalities

Markov's and Chebychev's Inequalities are useful in statistical estimation.Infra we state these inequalities and give their proofs.

Markov's Inequality: Suppose that X is a non-negative random variablein a probability space (Ω,F ,P) and d ∈ <+ is a positive real number. Thenmarkov's inequality states that:

P [X ≥ d] ≤ 1

dE [X ]

Proof 1: Dene:

Y =

d X ≥ d0 otherwise

Then almost surely 0 ≤ Y ≤ X . It is easy to show that Y is a randomvariable (i.e that it is measurable). Applying the expectation operator andcomparing we get:

E [X ] ≥ E [Y] = dP X ≥ d

and the inequality follows.

Proof 2: For any event E in F , we dene the characteristic or indicatorfunction of E as IE : F → 0, 1:

IE =

1 if the event E occurs

0 otherwise

Thusd · I|X|≥d ≤ |X|

HenceE[d · I|X|≥d

]≤ E (|X|)

We now may observe that the left side of the above inequality is the same as

d · E[I|X|≥d

]= d · P [|X| ≥ d]

Therefored · P X ≥ d ≤ E (|X|)

Since d > 0, the inequality follows.

2.4. MARKOV'S AND CHEBYCHEV'S INEQUALITIES 21

In measure theoretic terms, Markov's inequality states that given a measurespace (Ω,F , µ) and f a measurable extended real-valued function and t > 0then

µ (ω ∈ Ω : f(ω) ≥ t) ≤ 1

t

∫X|f | dµ

In the special case where the space has unit measure, that is, (Ω,F , µ) is ameasure space, Markov's inequality states that

P [X ≥ d] ≤ 1

dE [X ]

We now give the proof in the general case of measure spaces:Proof: Let A be a measurable set and IA it's characteristic (also known asindicator) function, that is

IA(x) =

1 if x ∈ A0 otherwise

We also dene At = x ∈ X : |f(x)| ≥ t. Then

0 ≤ tIAt ≤ |f | IAt ≤ |f |

Therefore ∫XtIAtdµ ≤

∫At

|f | dµ ≤∫X|f | dµ

We observe that the left side of the inequality we want to proove is the sameas:

t

∫XIAtdµ = tµ(At)

Hence, one has that:

tµ (x ∈ X : |f(x)| ≥ t) ≤∫X|f | dµ

but since t > 0, both sides may be devided by t, preserving the order:

µ (x ∈ X : |f(x)| ≥ t) ≤ 1

t

∫X|f | dµ

which is Markov's inequality is measure spaces.


Chebychev's Inequality: Suppose that X is a non-negative randomvariable and dene Y = (X − µ)2. Y is a positive random variable in L1

with E [Y] = σ2X . Chebychev's inequality states that

PY ≥ t2

≤ σ2

t2

Proof: The proof is an application of Markov's inequality as stated above.From Chebychev's inequality, we deduce that

P |X − µ| ≥ t ≤ σ2

t2

The following proposition is very interesting:

Proposition: Let X be a random variable dened on a probability space(Ω,F ,P such that EX2 <∞. Then for all λ > 0 the following holds:

P (X − EX ≥ λ) ≤ E (X − EX)2

λ2≤ EX

2

λ2

Proof: For 0 < p < q < ∞, one has that (E |X|p)1p ≤ (E |X|q)

1q . From

hypothesis E |X| <∞ and EX exists. From Chebychev's inequality one hasthat:

P |X − EX| ≥ λ ≤ E (X − EX)2

λ2≤ EX

2

λ2

Furthermore:E (X − EX)2 = EX2 − (EX)2 ≤ EX2

and inequality follows.

By means of markov's and Chebychev's inequalities one may determine aupper bound on the probabilities of certain events. Bernstein inequalities(after Sergei Bernstein - 1920s and 1930s) which are application of Cheby-chev's one. In these inequalities, X1, X2, . . . , Xn is a collection of randomvariables with zero expected value, that is EXi = 0 and independent witheachother (Later Bernstein prooved a generalization to weakly dependentrandom variables). Bernstein inequalities show that

P

n∑j=1

Xj > t

is exponentially small.


Bernstein Inequalities (1): Let Xini=1 be a collection of independentrandom variables on a probability space (Ω,F ,P), satisfying the conditions:

1. Xi is in L2, that is EX2i <∞ so that

∑ni=1 EX2

i = u2

2. There exists a c ∈ < such that∑n

i=1 E[|Xi|k

]≤ 1

2k!u2ck−2 for all

integers k ≥ 3

Then for any ε ≥ 0

P

n∑i=1

(Xi − E [Xi]) > ε

≤ exp

[−u

2

c2

(1 +

cε

u2−√

1 + 2cε

u2

)]≤

≤ exp

(− ε2

2 (u2 + cε)

)and the bothside bound becomes:

P

∣∣∣∣∣n∑i=1

(Xi − E [Xi])

∣∣∣∣∣ > ε

≤ 2 exp

[−u

2

c2

(1 +

cε

u2−√

1 + 2cε

u2

)]≤

≤ 2 exp

(− ε2

2 (u2 + cε)

)Bernstein Inequalities (2): Let Xini=1 be a collection of independentrandom variables on a probability space (Ω,F ,P), almost surely absolutelybounded, that is P |Xi| ≤M = 1. Then for any ε ≥ 0:

P

n∑i=1

(Xi − E [Xi]) > ε

≤ exp

[−9u2

M2

(1 +

Mε

3u2−√

1 + 2Mε

3u2

)]≤

≤ exp

(− ε2

2(u2 + M

3 ε))

and for the bothside bound:

P

∣∣∣∣∣n∑i=1

(Xi − E [Xi])

∣∣∣∣∣ > ε

≤ exp

[−9u2

M2

(1 +

Mε

3u2−√

1 + 2Mε

3u2

)]≤

≤ exp

(− ε2

2(u2 + M

3 ε))

Proof: To proove Berstein inequalities, one needs Cramer-Cherno inequal-ity.We will now study an application of Berstein's Inequality. That is Hoeding'sinequality [WH](named after Wassily Hoeding)which is stated rightaway:


Hoeding's Inequality: LetX1, X2, . . . , Xn be independent random vari-ables dened on a probability space (Ω,F ,P). Assume that Xi is almostsurely bounded for every i = 1, . . . n, that is P (Xi ∈ [ai, bi]) = 1. The sumof these random variables denoted by S =

∑ni=1Xi is a random variable and

the following inequality holds:

P (S − E [S] ≥ nt) ≤ exp

(− 2n2t2∑n

i=1 (bi − ai)2

)Proof: The proof is straightforward from the more general Bernstein inequal-ity.

By means of Hoedings inequality we can see that the regular estimator ofa random variable's mean converges to the mean exponentially fast. Werestate Hoeding's inequality as follows:

Hoeding's Inequality restatement: Assume a collectionX1, X2, . . . , Xn

of independent random variables dened on a probability space (Ω,F ,P).Let

p = E [Xi] , Xi ∈ [0, 1] , p =1

m

∑Xi, ε ≥ 0

ThenP |p− p| ≥ ε ≤ 2e−2εm

The proof of Hoeding's inequality will be stated after McDiarmid's. Mc-Diarmid's inequality (named after Colin McDiarmid) is a generalization ofHoeding's inequality. That is a result in probability theory that gives anupper bound on the probability for the value of a function depending onmultiple independent random variables to deviate from its expected value.

McDiarmid's Inequality: Let Xini=1 be a collection of independentrandom variables taking values in a set Ω. Assume that f : Ωn → < israndom variable satisfying:

supx1,x2,...,xn,xi

|f (x1, x2, . . . , xn)− f (x1, x2, . . . , xi−1, xi, xi+1, . . . , xn)| ≤ ci

for i = 1, . . . , n. Then for any ε > 0, the following inequality holds:

P f (X1, X2, . . . , Xn)− E [f (X1, X2, . . . , Xn)] ≥ ε ≤ exp

(− ε2

2∑n

i=1 c2i

)

Proof: We may consider a random variable

Z = f(X1, X2, . . . , Xn)


We want to proove that, under the requirements of McDiarmid's inequality,the following holds:

P Z − EZ ≥ ε = Peλ(Z−EZ) > eλε

≤ Ee

λ(Z−EZ)

eλε

Furthermore

E exp [λ (Z − EZ)] = E exp [λ (Z1 + . . .+ Zn)]

= EEx1 exp [λ (Z1 + . . .+ Zn)]

= E[exp [λ (Z2 + . . .+ Zn)] Ex1eλZ1

]≤ E

[exp [λ (Z2 + . . .+ Zn)] eλ

2c21/2]

= eλ2c21/2EEx2 exp [λ (Z2 + . . .+ Zn)]

= eλ2c21/2E

exp [λ (Z3 + . . .+ Zn)] Ex2eλZ2

≤ exp

[λ2 c

21 + c2

2

2

]E exp [λ (Z3 + . . .+ Zn)]

(Recursively) ≤ exp

(λ2

n∑i=1

c2i

w

).

Hence

P Z − EZ > ε ≤ exp

−λε+ λ2

n∑i=1

c2i

2

This holds for all positive λ. Thus applying the innum operator, we proovequod eratum.

Hoedings inequality can be deduced from McDiarmid's by consideringthe function f (X1, X2, . . . , Xn) =

∑ni=1Xi. The proof of Hoeding's in-

equality is stated below:

Proof: Let us dene p such that

p =1

m

m∑i

Xi = f (X1, X2, . . . , Xm)

Because Xi ∈ [0, 1] the most that Xi can be changed is from 0 to 1, andthe maximum change if f is bounded by ci = 1/m. Thus p satises therequirements for f in McDiarmid's inequality. Now we have:

P [f ≥ Ef + ε] ≤ exp

(−2

ε2∑mi=1 c

2i

)


Hence

P [p ≥ p+ ε] ≤ exp

(−2

ε2

m(1/m)2

)=

= exp

(−2

ε2

1/m

)= e−2εm

which completes the proof of Hoeding's inequality.

2.5 Jensen's Inequality

convexity: A function f , with domain a set A is said to be convex if thefollowing hold for every x, y ∈ A, with x 6= y and for every λ ∈ (0, 1):

1. A is a convex set. Geometrically, a set is convex if every line seg-ment between two arbitary distinct points of A lies entirely in A, oranalytically (λx+ (1− λ) y) ∈ A

2. f (λx+ (1− λ) y) ≤ λf (x) + (1− λ) f (y)

Jensen's inequality gives a lower bound on the expectation value of conveximages of random variables. The inequality is stated below.

Jensen's Inequality: Let X be a random variable and f is a convex func-tion. Then f(X ) is a random variable and

E [f (X )] ≥ f (E [X ])

Proof: It is easy to proove that f(X ) is a random variable, that is f(X )is measurable. We dene c = E (X ) . Since f is convex, there exists asupporting line for f(x) at c which is

ϕ(x) = a(x− c) + f(x)

for some a, ϕ(x) ≤ f(x). Then

E (f(x)) ≥ E (ϕ(x)) = E (a(x− c) + f(c)) = f(c)

as claimed

2.6. CHERNOFF-CRAMER AND BENNETT'S BOUND 27

Proposition (Average-median inequality): For probability distribu-tions having an expected value and a median, the mean (i.e., the expectedvalue) and the median can never dier from each other by more than onestandard deviation. To express this in mathematical notation, let ï¾1

2 , m,and ï¾1

2 be respectively the mean, the median, and the standard deviation.Then:

|µ−m| ≤ σ

Proof 1: This proof makes use of the Jensen inequality. One has that

|µ−m| = |E (X −m)| ≤ E (|X −m|)≤ E (|X − µ|)

= E(√

(X − µ)2)

≤√E ((X − µ)2) = σ

The rst inequality is an application of jensen's inequality to the absolutevalue function, which is convex. The second comes from the fact that themedian minimizes the absolute deviation function and the last one is anapplication of Jensen's inequality to the concave square root function.

Proof 2: The one-tailed version of Chebyshev's inequality is

P(X − µ ≥ λσ) ≤ 1

1 + λ2

By letting λ = 1 and we get that P (X ≥ µ+ σ) ≤ 1/2. Thus P (X ≤ µ− σ) ≤1/2. Consequently the median must be within one standard deviation of themean.

2.6 Cherno-Cramer and Bennett's Bound

Cherno-Cramer bound: The Cherno-Cramèr inequality is a very gen-eral and powerful way of bounding random variables. Compared with thefamous Chebyshev bound, which implies inverse polynomial decay inequal-ities, the Cherno-Cramèr method yields exponential decay inequalities,at the cost of needing a few more hypotheses on random variables' be-haviour. Let Xini=1 be a collection of independent random variables suchthat E[exp (tXi)] < +∞ ∀i in a right neighborhood of t = 0, i.e. for any


t ∈ (0, c) (Cramèr condition). Then a zero-valued for x = 0, positive, strictlyincreasing, strictly convex function Ψ(x) : [0,∞) 7→ R+ exists such that:

P

n∑i=1

(Xi − E[Xi]) > ε

≤ exp (−Ψ(ε)) ∀ε ≥ 0

Namely, one has:

Ψ(x) = sup0<t<c

(tx− ψ(t))

ψ(t) =n∑i=1

lnE[et(Xi−EXi)

]=

n∑i=1

(lnE

[etXi

]− tE [Xi]

)that is, Ψ(x) is the Legendre transform of the cumulant generating functionof the

∑ni=1 (Xi − E[Xi]) random variable.

Remarks:

1. Besides its importance for theoretical questions, the Cherno-Cramérbound is also the starting point to derive many deviation or concentra-tion inequalities, among which Bernstein, Kolmogorov, Bennett, Ho-eding and Cherno ones are worth mentioning. All of these inequal-ities are obtained imposing various further conditions on Xi randomvariables, which turn out to aect the general form of the cumulantgenerating function ψ(t).

2. Sometimes, instead of bounding the sum of n independent randomvariables, one needs to estimate their mean i.e. the quantity 1

n

∑ni=1Xi;

in order to reuse Cherno-Cramér bound, it's enough to note that

P

1

n

n∑i=1

(Xi − E[Xi]) > ε′

= P

n∑i=1

(Xi − E[Xi]) > nε′

so that one has only to replace, in the above stated inequality, ε withnε′.

3. It turns out that the Cherno-Cramer bound is asymptotically sharp,as Cramèr limit theorem shows.

We will now give a proof for the Cramer-Cherno Bound

Proof: Let h(x) be the step function (h(x) = 1 for x ≥ 0, h(x) = 0 for


x < 0); then, by generalized Markov inequality, for any t > 0 and any ε ≥ 0,

P

n∑i=1

(Xi − E[Xi]) > ε

= E

[h

(n∑i=1

(Xi − E[Xi])− ε

)]≤

≤ E[et(

∑ni=1(Xi−E[Xi])−ε)

]=

= exp(−εt)E[e∑ni=1 t(Xi−E[Xi])

]=

= exp(−εt)E

[n∏i=1

et(Xi−E[Xi])

]=

(by independence) = exp(−εt)n∏i=1

E[et(Xi−E[Xi])

]=

= exp

(−εt+

n∑i=1

lnE[et(Xi−E[Xi])

])=

= exp [− (tε− ψ(t))] .

Since this expression is valid for any t > 0, the best bound is obtained takingthe supremum:

P

n∑i=1

(Xi − E[Xi]) > ε

≤ e− supt>0(tε−ψ(t))

which proves part (c).To prove part (a), let's observe that Ψ(0) = supt>0(−ψ(t)) = − inft>0(ψ(t))and that

E[et(Xi−E[Xi])

]≥ E[1 + t (Xi − E[Xi])] =

= E[1] + tE[Xi]− tE[E[Xi]] =

= 1 = E[et(Xi−E[Xi])

]t=0

that is, t = 0 is the inmum point for E[et(Xi−E[Xi])

]∀i and consequently

for ψ(t) =∑n

i=1 lnE[et(Xi−EXi)

], so as a conclusion Ψ(0) = −ψ(0) = 0 (b)

Let x > 0 be xed and let t0 be the supremum point for tx−ψ(t). We haveto show that t0x− ψ(t0) > 0.By dierentiation, ψ′(t0) = x. Let's now recall that the moment generatingfunction is convex, so ψ′′(t) > 0. Writing the Taylor expansion for ψ(t)around t = t0, we have, with a suitable t1 < t0,

0 = ψ(0) = ψ(t0)− ψ′(t0)t0 +1

2ψ′′(t1)t20

that is

Ψ(x) = t0x− ψ(t0) = t0ψ′(t0)− ψ(t0) =

1

2ψ′′(t1)t20 > 0


The convexity of Ψ(x) follows from the fact that Ψ(x) is the supremum ofthe linear (and hence convex) functions tx− ψ(t) and so must be convexitself. Eventually, in order to prove that Ψ(x) is an increasing function, let'snote that

Ψ′(0) = limx→0

Ψ(x)−Ψ(0)

x= lim

x→0

Ψ(x)

x> 0

and that, by Taylor formula with Lagrange form remainder, for a ξ = ξ(x)

Ψ′(x) = Ψ′(0) + Ψ′′(ξ)x ≥ 0

since Ψ′′(ξ) ≥ 0 by convexity and x ≥ 0 by hypotheses.

Proof Bernstein's Inequalities : In a previous section we mentionedBernstein's inequalities without proovinf them. We will now see a proof forBernstein's rst inequality. By Cherno-Cramèr bound, we have:

E

n∑i=1

(Xi − E[Xi]) > ε

≤ exp

[− sup

t>0(tε− ψ(t))

]

where

ψ(t) =n∑i=1

(lnE

[etXi

]− tE [Xi]

)Since lnx ≤ x− 1 ∀x ≥ 0,

ψ(t) =

n∑i=1

(lnE

[etXi

]− tE [Xi]

)≤

n∑i=1

E[etXi

]− tE [Xi]− 1

=n∑i=1

E

[1 + tXi +

1

2t2X2

i ++∞∑k=3

tkXki

k!

]− tE [Xi]− 1

=

n∑i=1

(1

2t2E

[X2i

]+

+∞∑k=3

tkE[Xki

]k!

)

=1

2t2

n∑i=1

E[X2i

]+

+∞∑k=3

tk∑n

i=1E[Xki

]k!

≤ 1

2t2

n∑i=1

E[X2i

]+

+∞∑k=3

tk∑n

i=1E[|Xi|k

]k!

,


and, keeping in mind hypotheses

ψ(t) ≤ 1

2t2v2 +

+∞∑k=3

tk

2v2ck−2 =

1

2t2v2 +

1

2t3v2c

+∞∑k=0

(tc)k

Now, if tc < 1, we obtain

ψ(t) ≤ 1

2t2v2

(1 +

tc

1− tc

)=

v2t2

2 (1− tc)

whence

supt>0

(tε− ψ(t)) ≥ sup0<t< 1

c

(tε− v2t2

2 (1− tc)

)By elementary calculus, we obtain the value of t that maximizes the

expression in brackets (out of the two roots of the second degree polynomialequation, we choose the one which is < 1

c ):

topt =v2 + 2cε− v2

√1 + 2cε

v2

c (v2 + 2cε)=

1

c

1− 1√1 + 2cε

v2

which, once plugged into the bounds, yields

P

n∑i=1

(Xi − E[Xi]) > ε

≤ exp

[−v

2

c2

(1 +

cε

v2−√

1 + 2cε

v2

)]

Observing that√

1 + x ≤ 1 + 12x, one gets:

topt =1

c

1− 1√1 + 2cε

v2

≤ 1

c

(1− 1

1 + cεv2

)=

ε

v2 + cε= t

′<

1

c

Plugging t′ in the bound expression, the sub-optimal yet more easily man-ageable formula is obtained:

P

n∑i=1

(Xi − E[Xi]) > ε

≤ exp

(− ε2

2 (v2 + cε)

)which is obviously a worse bound than the preceeding one, since t′ 6= topt.One can also verify the consistency of this inequality directly proving that,for any x ≥ 0,

1 + x−√

1 + 2x ≥ x2

2 (1 + x)


Bennett inequality: Let Xini=1 be a collection of independent randomvariables satisfying the conditions:

1. E[X2i ] <∞ ∀i, so that one can write

∑ni=1E[X2

i ] = v2

2. P |Xi| ≤M = 1 ∀i.

Then, for any ε ≥ 0,

P

n∑i=1

(Xi − E[Xi]) > ε

≤ exp

[− v2

M2θ

(εM

v2

)]≤ exp

[− ε

2Mln

(1 +

εM

v2

)]where

θ (x) = (1 + x) ln (1 + x)− x

Remark: Observing that

(1 + x) ln (1 + x)− x ≥ 9

(1 +

x

3−√

1 +2

3x

)≥ 3x2

2 (x+ 3), ∀x ≥ 0

, and plugging these expressions into the bound, one obtains immediately theBernstein inequality under the hypotheses of boundness of random variables,as one might expect. However, Bernstein inequalities, although weaker, holdunder far more general hypotheses than Bennett one.

Proof: Proof of Bennett's inequality is quite similar to that of Bernstein'sone stated above. By Cherno-Cramer Bound once again, we have:

P

n∑i=1

(Xi − E[Xi]) > ε

≤ exp

[− sup

t≥0(tε− ψ(t))

]where

ψ(t) =

n∑i=1

(lnE

[etXi

]− tE [Xi]

).

Keeping in mind that the condition

P |Xi| ≤M = 1 ∀i

implies that, for all i,

E[|Xi|k] ≤Mk ∀k ≥ 0

and since lnx ≤ x− 1 ∀x > 0, and

E[|X|k] ≤Mk =⇒ E[|X|k

]≤ E

[X2]Mk−2 ∀k ≥ 2, k ∈ N


one has:

ψ(t) =

n∑i=1

(lnE

[etXi

]− tE [Xi]

)≤

n∑i=1

E[etXi

]− tE [Xi]− 1

=n∑i=1

E

[ ∞∑k=0

(tXi)k

k!

]− tE [Xi]− 1

=n∑i=1

( ∞∑k=0

tkE[Xki

]k!

)− tE [Xi]− 1

=n∑i=1

( ∞∑k=2

tkE[Xki

]k!

)

≤n∑i=1

∞∑k=2

tkE[|Xi|k

]k!

≤

n∑i=1

( ∞∑k=2

tkE[X2i

]Mk−2

k!

)

=

∞∑k=2

tkMk−2∑n

i=1E[X2i

]k!

=v2

M2

∞∑k=2

(tM)k

k!

=v2

M2[exp (tM)− tM − 1]

One can now write

supt≥0

(tε− ψ(t)) ≥ supt≥0

(tε− v2

M2

(etM − tM − 1

))=

= supt>0

[v2

M2

(M2ε

v2t−

(etM − tM − 1

))].

By elementary calculus, we obtain the value of t that maximizes the expres-sion in round brackets:

topt =1

Mln

(1 +

Mε

v2

)which, once plugged into the bound, yields

P

n∑i=1

(Xi − E[Xi]) > ε

≤ exp

[− v2

M2

((1 +

Mε

v2

)ln

(1 +

Mε

v2

)− Mε

v2

)].


Observing that (1 + x) ln (1 + x) − x ≥ x2 ln (1 + x) ∀x ≥ 0 , one gets the

sub-optimal yet more easily manageable formula:

P

n∑i=1

(Xi − E[Xi]) > ε

≤ exp

[− ε

2Mln

(1 +

εM

v2

)].

2.7 Various Inequalities

Kolmogorov's Inequality: In probability theory, Kolmogorov's Inequal-ity is a so-called maximal inequality that gives a bound on the probabilitythat the partial sums of a nite collection of independent random variablesexceed some specied bound. The inequality is named after the russianmathematician Andrey Kolmogorov and is stated as follows:

Let X1, . . . , Xn be independent random variables in a probability space, suchthat E [Xk] = 0 and Var[Xk] <∞ for k = 1, . . . , n. Then, for each λ > 0,

P(

max1≤k≤n

|Sk| ≥ λ)≤ 1

λ2var[Sn] =

1

λ2

n∑k=1

var[Xk],

where Sk = X1 + · · ·+Xk.

Proof: One can proov that the sequence Sii is a martingale (However,this exceeds the purpose of this proof). Without loss of generality, one mayassume that S0 = 0 and Si ≥ 0 for all i. We dene (Zi)i as follows:Let Z0 = 0 and

Zi+1 =

Si+1 if max1≤j≤i Sj < λ

Zi otherwisee

Then (Zi)ni=0 is also a martingale. Since E [Si] = E [Si−1] for all i and

E [E [X|Y ]] = E [X] by the law of total expectation one has:

n∑i=1

E [(Si − Si−1)] =n∑i=1

E[S2i − 2SiSi−1 + S2

i−1

]=

n∑i=1

E[S2i − 2E [SiSi−1|Si] + E

[S2i−1|Si−1

]]=

n∑i=1

E[S2i − 2E

[S2i |Si

]+ E

[S2i−1|Si−1

]]= E

[S2n

]− E

[S2

0

]= E

[S2n

]

2.7. VARIOUS INEQUALITIES 35

Thus by Chebychev's inequality one has

P

max1≤i≤n

Si ≥ λ

= P Zn ≥ λ

≤ 1

λ2E(Z2n

)=

1

λ2

n∑i=1

E

(Zi − Zi−1)2

≤ 1

λ2

n∑i=1

E[(Si − Si−1)2

]=

1

λ2E[S2n

]=

1

λ2var [Sn]

We will now state a weaker version of Kolmogorov's Inequality [ET]

Etemadi's Inequality: Let X1, . . . , Xn be independent random variablesin a probability space, and let a ≥ 0. Let Sk denote the partial sum:

Sk = X1 + . . .+Xk

Then

P

max1≤k≤n

|Sk| ≥ 3ε

≤ 3 max

1≤k≤nP |Sk| ≥ ε

Remarks:

1. Note that Etemadi's inequality does not require that the random vari-ables have a mean E(Xk) = 0

2. Suppose that the random variables Xk have expected value zero. Ifone applies Chebychev inequality to the right-side hand of Etemadi'sInequality and replace ε by ε/3, the result is Kolmogorov's inequalitywih an extra 27 on the right hand. That is:

P(

max1≤k≤n

|Sk| ≥ ε)≤ 27

ε2var (Sn)

Payley-Zygmund Bound: In probability theory, Payley - Zygmund in-equality, bounds the probability that a positive random variable is small,in terms of its mean and variance (that is its two rst central moments).The inequality was prooved by Raymond Payled and Antoni Zygmund. It isstated as follows:

If Z is an almost surely non-negative random variable with nite variance,and 0 ≤ ϑ ≤ 1, then

P Z ≥ ϑE(Z) ≥ (1− ϑ)2 E(Z)2

E(Z2)


Proof: We know that:

EZ = E ZIZ≤ϑEZ+ E ZIZ≥ϑEZ

Obviously the rst addend is at most ϑ · EZ. The second one is at most:√(EZ2)

√EIZ≥ϑEZ =

√(EZ2)

√P Z ≥ ϑ · E(Z)

According to the Cauchy-Schwartz inequality.

Remark: The right-hand of the Payley - Zygmund inequality can be writtenas:

P Z ≥ ϑE(Z) ≥ (1− ϑ)2 E(Z)2

E(Z)2 + varZ=

(1− ϑ)2 µ2

µ2 + σ2Z

where µ is the mean of Z and σ2Z its variance.

The one-sided Chebychev inequality gives a slightly better bound:

P Z ≥ ϑE(Z) ≥ (1− ϑ)2 µ2

(1− ϑ)2 µ2 + σ2Z

Entropy Power Inequality: For a random variable X : Ω→ <n, denedon a probability space (Ω,F ,P), with probability density function f : <n →<, the information entropy of X, denoted by h(X), is dened as:

h(X) = −∫<nf(x) log f(x)dx

and the entropy power of X, denoted by N(X), is dened to be:

N(X) =1

2πeexp

(2

ηh(X)

)The entropy power inequality is stated as follows:

Let X and Y be independent random variables with probability densityfunctions in the Lp - space Lp (<n) for some p > 1. Then

N (X + Y ) ≥ N (X) +N (Y )

Remarks:

1. Equality holds i X and Y are (multivariate) normal variables withproportional variance matrices


2. What is stated by this inequality, is that entropy power is a super-additive function. This inequality was prooved by Shannon in 1948[SH]. Stam [ST] prooved the necessery and sucient contitions for theequality to hold

Proof: Let p > 1 and Dene

Np(X) =1

2πp−q/p ‖f‖−2q/n

p

Where q is the Holder-conjugate of p, that is p−1 + q−1 = 1. Then Np(X),which may be called the pth Renyi entropy power of X, converges to N(X)as p→ 1+. Suppose that 0 < λ < 1, and for r > 1, let:

p = p(r) =r

(1− λ) + λrand q = q(r) =

r

λ+ (1− λr)

Then p, q satisfy1

p+

1

q= 1 +

1

r

Let X and Y be independent random vectrors in <n with probability densi-ties f ∈ Lp(<n) and g ∈ Lq(<n), respectively. Then, by Young's inequality,one has that

Nr(X + Y ) ≥(Np(X)

1− λ

)1−λ(Nq(Y )

λ

)λAs r → 1+ and p, q → 1, the above inequality becomes:

N(X + Y ) ≥(N(X)

1− λ

)1−λ(N(Y )

λ

)λBy dierentiating the logarithm of the right-hand expression, it can be ver-ied that this is a maximum when

λ =N(X)

N(X) +N(Y )

Sustituting this into the above inequality, we obtain the entropy power in-equality, that is

N(X + Y ) ≥ N(X) +N(Y )

Azuma's Inequality: [AZU] In probability theory, the Azuma or Azuma-Hoeding inequality (named after Kazuoki Azuma and Wassily Hoeding)gives a concentration result for the values of martingales that have boundeddierences. The inequality is stated as follows:


Let Xk : k = 0, 1, 2, . . . be a martingale with bounded succesive dier-ences |Xk −Xk−1| < ck. Let us denote by uN =

∑Nk=1 ck for some integer

N . Then Azuma-Hoeding inequality states that for any positive real t:

P XN −X0 ≥ t ≤ exp

(− t2

uN

)Applying Azuma's inequality to the martingale −X and employing the unionbound (Boole's inequality) one obtains the two-sided bound:

P |XN −X0| ≥ t ≤ 2 exp

(− t2

uN

)

Application of Azuma's Inequality for coin-ips game:

Suppose one bets his money on toss playing the game of coin ips. So let usconsider a sequance of independednt and identically distributed random coinips and Ω = Head , Toss. Assuming that the player wins 1$ for everyright bet and loses 1$ for every wrong one, we may denote this random i.i.d.variable by Fi, i = 0, 1, 2, . . .. Fi is a martingale with bounded dierences.Denings Xi =

∑ij=1 Fi yields a martingale with |Xk −Xk−1| ≤ 1, allowing

us to apply Azuma's inequality:

P XN > X0 + t ≤ exp

(− t2

2N

)Brunn-Minkowski and Anderson Inequalities: [GAR] One form ofthe Brunn-Minkowski inequality one nds oftenly in literature states thatgiven two compact convex bodies K,L of <n with nonempty interior, 0 <λ < 1 and let vol be a volume function , then:

vol ((1− λ)K + λL)1/n ≥ (1− λ) · vol (K)( 1/n) + λ · vol (L)1/n

The Brunn-Minkowski can be thought as a generalization of the classicalisoperimetric inequality in the plane which states that:

L2 ≥ 4πA

Where A is the area of a domain enclosed by a curve of length L. T. W.Anderson in 1955, used the Brunn-Minkwski inequality which is stated be-low with three dierent ways from the probability theory, and the analysispoint of view:

1. Assume that X and Y are independent random variables, and I ⊆ Ω.Then the following holds

P X ∈ I ≥ P X + Y ∈ I


2. If a nonnegative function f on < is1. symmetric, that is f(x) = f(−x) and2. unimodal, that is f(cx) ≥ f(x), for 0 ≤ c ≤ 1 and I is an intervalcentered at the origin, then: ∫

I+yf (x) dx

is maximized when y = 0

3. Let K be an origin-symmetric (asteroid) convex body in <n, and let fbe a non-negative, symmetric, and unimodal function of L (<n). Then∫

Kf(x+ cy)dx ≥

∫Kf(x+ y)dx

for 0 ≤ c ≤ 1 and for every y ∈ <n.

One can easily observe that the second inequality is a special case of thethird one. We will now proove the only rst inequality.

Proof: Let X and Y be independent random vectors on a probability space(<n,F ,P) with probability density functions f and g respectively. Then theconvolution f ∗ g, dened to be:

(f ∗ g) (x) =

∫<nf(x− y)g(y)dy

is the probability density function of X+Y . Therefore, by Fubini's theorem,one has:

P X + Y ∈ I =

∫I

∫<f(z − y)g(y)dydz

=

∫<

∫If(z − y)g(y)dzdy

=

∫<

∫I−y

f(x)g(y)dxdy

≤∫<

∫If(x)g(y)dxdy

=

∫If(x)dx = P X ∈ I

An interesting inequality in probability and measure theory is the so-calledPrecopa-Leinder inequality and the Borell-Brascamp-Lieb inequality by meansof which one can show that certain statistical tests are unbiased. Such in-equalities have many applications in stochastic programming.


Vysochanskii-Petunin Inequality: [PET] In probability theory, the Vysochanskij-Petunin inequality gives a lower bound for the probability that a randomvariable with nite variance lies within a certain number of standard devi-ations of the variable's mean. The only requirement for the distribution isthat it must be unimodal and have nite variance. This inequality applieseven to heavily skewed distributions and puts bounds on how much of thedata is, or is not, "in the middle". The inequality is stated below:

Let (<,F ,P) be a probability space and λ >√

8/3. Then for every unimodalprobability distribution, the following inequality holds

P |X − µ| ≥ λσ ≤ 4

9λ2

The theorem renes Chebyshev's inequality by including the factor of 4/9,made possible by the condition that the distribution be unimodal. TheChebyshev's inequality do not allow us to justify the empirical "3s rule",which asserts that for distributions occurring in practice:

P |X − µ| ≥ 3σ ≤ 0.05

Though using the Vysochanskii-Petunin Inequality for λ = 3 >√

8/3 onehas that:

P |X − µ| ≥ 3σ ≤ 4

81∼= 0.049 < 0.05

Bibliography

[AZU] K., Azuma, Weighted Sums of Certain Dependent RandomVariables, Tohoku Mathematical Journal, (1967) , vol.19,pp.357-367

[BIL] P., Billingsley, Probability and Measure, John Wiley & sonseditions, New York (1995), isbn: 0-471-00710-2

[CHR] D.,G., Chapman; H., Robbins, Minimum variance estimationwithout regularity assumptions, Ann. Math. Statist., (1951)vol.22 No.4, pp.581-586

[ET] Etemadi Nasrollah, On some classical results in probabilitytheory, Sankhya Ser. A 47, pp.215-221

[GAR] R., J., Gardner, The Brunn - Minkowski Inequality, Bulletinof the American Mathematical Society (2002), vol. 39, No3,pp.355 - 405

[GEI] Geiss C., and Geiss S., An introduction to probability theory(Feb. 2004), Lecture Notes, pp.8-10 & 29-58.

[HW] Hoeding Wassily, Probability Inequalities for sums ofbounded random variables, Lournal of the American Statis-tical Assosiation, vol.58, No.301, pp.13-30 (1963)

[MD] C., McDiarmid, On the method of bounded dierences, Sur-veys in combinatorics, London Mathematical Society, LectureNotes 141, Cambridge University Press (1989), pp.148-88

[SH] Shannon, Claude E., A mathematical theory of communica-tion, Bell System Technological Journal,(1948) vol.27 pp.379-423 & 623-656

[SJ] Shao Jun, Mathematical Statistics, Springer editions, NewYork, isbn: 0-387-98674-X, section 1.3.1

41

42 BIBLIOGRAPHY

[ST] Stam, A., J., Some inequalities satised by the quantities ofinformation of Fischer and Shannon, Information and Control,vol.2, pp.101-112

[PET] D., A., Klyushin; Y., I., Peetunin; R., I., Andrushkiw; N.,V., Boroday and K., P., Ganina, Cancer Diagnostic Methodbased on pattern recognition of DNA changes in buccalepithelium in the pathology of the thyroid and mammaryglands, available online: http://m.njit.edu/CAMS/Technical-Reports/CAMS02-03/report4.pdf

[PM] PlanetMath Online: http://www.planetmath.org

[WIK] Wikipedia - Online Encyclopaedia :http://www.wikipedia.org

[WF] William, Feller, An introduction to probability theory and itsapplications, vol.1 3rd edition, John Wiley & sons editions,New York (1968), isbn: 0-471-25708-7.

Documents

Probability Inequalities