19
Machine Learning Lecture Notes Predrag Radivojac January 13, 2015 1 Random Variables Until now we operated on relatively simple sample spaces and produced measure functions over sets of outcomes. In many situations, however, we would like to use probabilistic modeling on sets (e.g. a group of people) where elements can be associated with various descriptors. For example, a person may be associated with his/her age, height, citizenship, IQ, or marital status and we may be interested in events related to such descriptors. In other situations, we may be interested in transformations of sample spaces such as those corresponding to digitizing an analog signal from a microphone into a set of integers based on some set of voltage thresholds. The mechanism of a random variable facilitates addressing all such situations in a simple, rigorous and unified manner. A random variable is a variable that, from the observer’s point of view, takes values non-deterministically, with generally dierent preferences for dierent outcomes. Mathematically, however, it is defined as function that maps one sample space into another, with a few technical caveats we will introduce later. A selection of R as the co-domain for the mapping defines a real-valued random variable. Let us motivate the need for random variables. Consider a probability space (, F ,P ), where is a set of people and let us investigate the probability that a randomly selected person ! 2 has a cold (we may assume we have a diagnostic method and tools at hand). We start by defining an event A as A = {! 2 : Disease(!)= cold} and simply calculate the probability of this event. This is a perfectly legitimate approach, but it can be much simplified using the random variable mechanism. We first note that, technically, our diagnostic method corresponds to a func- tion Disease : ! {cold, not cold} that maps the sample space to a new binary sample space Disease = {cold, not cold}. Even more interestingly, our approach also maps the probability distribution P to a new probability dis- tribution P Disease that is defined on Disease . For example, we can calculate P Disease ({cold}) from the probability distribution of the aforementioned event A; i.e. P Disease ({cold})= P (A). This is a cluttered notation so we may wish to simplify it by using P (Disease = cold), where Disease is a “random variable”. 1

Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

Embed Size (px)

Citation preview

Page 1: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

Machine Learning Lecture Notes

Predrag Radivojac

January 13, 2015

1 Random Variables

Until now we operated on relatively simple sample spaces and produced measurefunctions over sets of outcomes. In many situations, however, we would like touse probabilistic modeling on sets (e.g. a group of people) where elementscan be associated with various descriptors. For example, a person may beassociated with his/her age, height, citizenship, IQ, or marital status and we maybe interested in events related to such descriptors. In other situations, we maybe interested in transformations of sample spaces such as those correspondingto digitizing an analog signal from a microphone into a set of integers based onsome set of voltage thresholds. The mechanism of a random variable facilitatesaddressing all such situations in a simple, rigorous and unified manner.

A random variable is a variable that, from the observer’s point of view, takesvalues non-deterministically, with generally different preferences for differentoutcomes. Mathematically, however, it is defined as function that maps onesample space into another, with a few technical caveats we will introduce later.A selection of R as the co-domain for the mapping defines a real-valued randomvariable.

Let us motivate the need for random variables. Consider a probability space(⌦,F , P ), where ⌦ is a set of people and let us investigate the probability that arandomly selected person ! 2 ⌦ has a cold (we may assume we have a diagnosticmethod and tools at hand). We start by defining an event A as

A = {! 2 ⌦ : Disease(!) = cold}

and simply calculate the probability of this event. This is a perfectly legitimateapproach, but it can be much simplified using the random variable mechanism.We first note that, technically, our diagnostic method corresponds to a func-tion Disease : ⌦ ! {cold, not cold} that maps the sample space ⌦ to a newbinary sample space ⌦

Disease

= {cold, not cold}. Even more interestingly, ourapproach also maps the probability distribution P to a new probability dis-tribution P

Disease

that is defined on ⌦

Disease

. For example, we can calculatePDisease

({cold}) from the probability distribution of the aforementioned eventA; i.e. P

Disease

({cold}) = P (A). This is a cluttered notation so we may wish tosimplify it by using P (Disease = cold), where Disease is a “random variable”.

1

Page 2: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

We will use capital letters X,Y, . . . to denote random variables (such asDisease) and lowercase letters x, y, . . . to indicate elements (such as “cold”)of ⌦

X

,⌦Y

. . . Generally, we write such probabilities as P (X = x), which is anotational relaxation from P ({! : X(!) = x}). Before we proceed to formallydefine random variables, we shall look at two illustrative examples.

Example 3. Consecutive tosses of a fair coin. Consider a process of threecoin tosses and two random variables, X and Y , defined on the sample space.We define X as the number of heads in the first toss and Y as the number ofheads over all three tosses. Our goal is to find the probability spaces that arecreated after the transformations.

First, we have

! HHH HHT HTH HTT THH THT TTH TTTX(!) 1 1 1 1 0 0 0 0Y (!) 3 2 2 1 2 1 1 0

Let us only focus on variable Y . Clearly, Y : ⌦ ! {0, 1, 2, 3} but we also needto find F

Y

and PY

. To calculate PY

, a simple approach is to find its pmf pY

.For example, let us calculate p

Y

(2) = PY

({2}) as

PY

({2}) = P (Y = 2)

= P ({! : Y (!) = 2})= P ({HHT,HTH,THH})

=

3

8

,

because of the uniform distribution in the original space (⌦,F , P ). In a similarway, we can calculate that P (Y = 0) = P (Y = 3) =

1/8, and that P (Y = 1) =

3/8. In this example, we took that F = P(⌦) and FY

= P(⌦

Y

). As a final note,we mention that all the randomness is defined in the original probability space(⌦,F , P ) and that the new probability space (⌦

Y

,FY

, PY

) simply inherits itthrough a deterministic transformation.

Example 4. Quantization. Consider (⌦,F , P ) where ⌦ = [0, 1], F = B(⌦),and P is induced by a uniform probability density function. Define X : ⌦ !{0, 1} as

X(!) =

(0 ! 0.5

1 ! > 0.5

and find the transformed probability space.Technically, we have changed the sample space to ⌦

X

= {0, 1}. For anevent space F

X

= P(⌦

X

) we would like to understand the new probabilitydistribution P

X

. We have

2

Page 3: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

pX

(0) = PX

({0})= P (X = 0)

= P ({! : ! 2 [0, 0.5]})

=

1

2

and

pX

(1) = PX

({1})= P (X = 1)

= P ({! : ! 2 (0.5, 1]})

=

1

2

From here we can easily see that PX

({0, 1}) = 1 and PX

(Ø) = 0, and so PX

is indeed a probability distribution. Again, PX

is naturally defined using P .Thus, we have transformed the probability space (⌦,F , P ) into (⌦

X

,FX

, PX

).

1.1 Formal definition of random variable

We now formally define a random variable. Given a probability space (⌦,F , P ),a random variable X is a function X : ⌦ ! ⌦

X

such that for every A 2 B (⌦

X

)

it holds that {! : X (!) 2 A} 2 F . It follows that

PX

(A) = P ({! : X (!) 2 A}) .It is important to mention that, by default, we defined the event space of arandom variable to be the Borel field of ⌦

X

. This is convenient because a Borelfield of a countable set ⌦ is its power set. Thus, we are working with the largestpossible event spaces for both discrete and continuous random variables.

Consider now a discrete random variable X defined on (⌦,F , P ). As we cansee from the previous examples, the probability distribution for X can be foundas

pX

(x) = PX

({x})= P ({! : X (!) = x})

for 8x 2 ⌦

X

. The probability of an event A can be found as

PX

(A) = P ({! : X (!) 2 A})

=

X

x2A

pX

(x)

3

Page 4: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

for 8A ✓ ⌦

X

.The case of continuous random variables is more complicated, but reduces

to an approach that is similar to that of discrete random variables. Here wefirst define a cumulative distribution function (cdf) as

FX

(t) = PX

({x : x t})= P

X

((�1, t])

= P (X t)

= P ({! : X(!) t}) ,

where P (X t), as before, presents a minor abuse of of notation. If the cumu-lative distribution function is differentiable, the probability density function ofa continuous random variable is defined as

pX

(x) =dF

X

(t)

dt

����t=x

.

Alternatively, if pX

exists, then

FX

(t) =

ˆt

�1pX

(x) dx,

for each t 2 R. Our focus will be exclusively on random variables that havetheir probability density functions; however, for a more general view, we shouldalways keep in mind “if one exists” when referring to pdfs.

The probability that a random variable will take a value from interval (a, b]can now be calculated as

PX

((a, b]) = PX

(a < X b)

=

ˆb

a

pX

(x) dx

= FX

(b)� FX

(a) ,

which follows from the properties of integration.Suppose now that the random variable X transforms a probability space

(⌦,F , P ) into (⌦

X

,FX

, PX

). To describe the resulting probability space, wecommonly use probability mass and density functions inducing P

X

. For exam-ple, if P

X

is induced by a Gaussian distribution with parameters µ and �2, weuse

X : N(µ,�2)

or sometimes X ⇠ N(µ,�2). Both notations indicate that the probability den-

sity function for the random variable X is

4

Page 5: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

pX

(x) =1p2⇡�2

e�1

2�2 (x�µ)2 .

This distribution in turn provides all the necessary information about the prob-ability space; i.e. ⌦

X

= R, and FX

= B(R).A group of k random variables {X

i

}ki=1 defined on the same probability

space (⌦,F , P ) is called a random vector or a multivariate (multidimensional)random variable. We have already seen an example of a random vector providedby random variables (X,Y ) in Example 3. A generalization of a random vectorto infinite sets is referred to as a random process; i.e. {X

i

: i 2 T }, where T isan index set usually interpreted as a set of time indices. In the case of discretetime indices (e.g. T = N) the random process is called a discrete-time randomprocess; otherwise (e.g. T = R) it is called a continuous-time random process.

1.2 Joint and marginal distributions

Let us first look at two discrete random variables X and Y defined on thesame probability space (⌦,F , P ). We define the joint probability distribution

pXY

(x, y) of X and Y as

pXY

(x, y) = P (X = x, Y = y)

= P ({! : X(!) = x} \ {! : Y (!) = y}).

We can extend this to a k-dimensional random variable X = (X1, X2, . . . , Xk

)

and define a multidimensional probability mass function as pX

(x), where x =

(x1, x2, . . . , xk

) is a vector of values, such that each xi

is chosen from some ⌦

Xi .A marginal distribution is defined for a subset of X = (X1, X2, . . . , Xk

) bysumming or integrating over the remaining variables. A marginal distributionpXi (xi

) is defined as

pXi (xi

) =

X

x1

· · ·X

xi�1

X

xi+1

· · ·X

xk

pX

(x1, . . . , xk

) ,

where the variable in the j-th sum takes values from ⌦

Xj . The previous equationdirectly follows from Eq. (1) in the Probability Theory section and is alsoreferred to as sum rule.

In the continuous case, we define a multidimensional cdf as

FX

(t) = PX

({x : xi

ti

, i = 1 . . . k})= P (X1 t1, X2 t2 . . .)

and the probability density function, if it exists, is defined as

pX

(x) =

@k

@t1 · · · @tkFX

(t1, . . . tk)

����t=x

.

5

Page 6: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

The marginal density pXi (xi

) is defined as

pXi (xi

) =

ˆx1

· · ·ˆxi�1

ˆxi+1

· · ·ˆxk

pX

(x) dx1 · · · dxi�1dxi+1 · · · dxk

If the sample space for each discrete random variable is seen as a countablesubset of R, then the probability space for any k-dimensional random variableX (discrete or continuous) can be defined as

�Rk,B(R)k, P

X

�.

Example 5. Three tosses of a fair coin (again). Consider two random vari-ables from Example 3 and calculate their probability spaces, joint and marginaldistributions.

A joint probability mass function p(x, y) = P (X = x, Y = y) is shown below

Y0 1 2 3

X0 1/8 1/4 1/8 0

1 0

1/8 1/4 1/8

but let us step back for a moment and show how we can calculate it. Letus consider two sets A = {HHH, HHT, HTH, HTT} and B = {HHT, HTH,THH}, corresponding to the events that the first toss was heads and that therewere exactly two heads over the three tosses, respectively. Now, let us look atthe probability of the intersection of A and B

P (A \B) = P ({HHT,HTH})

=

1

4

We can represent the probability of the logical statement X = 1 ^ Y = 2 as

pXY

(1, 2) = P (X = 1, Y = 2)

= P (A \B)

= P ({HHT,HTH})

=

1

4

.

The marginal probability distribution can be found in a straightforward way as

pX

(x) =X

y2⌦Y

pXY

(x, y) ,

where ⌦

Y

= {0, 1, 2, 3} . Thus,

pX

(0) =

X

y2⌦Y

pXY

(0, y)

=

1

2

.

6

Page 7: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

We note for the end that we needed |⌦X

| · |⌦Y

|� 1 numbers (because the summust equal 1) to fully describe the joint distribution p

XY

(x, y). Asymptotically,this corresponds to an exponential growth of the number of entries in the tablewith the number of random variables (k). For example, if |⌦

Xi | = 2 for 8Xi

,there are 2

k � 1 free elements in the joint probability distribution. Estimat-ing such distributions from data is intractable and is one form of the curse of

dimensionality.

1.3 Conditional distributions

The conditional probability distribution for two random variables X and Y ,p(y|x), is defined as

pY |X(y|x) = P (Y = y|X = x)

= P ({! : Y (!) = y} | {! : X(!) = x})

=

P ({! : X(!) = x} \ {! : Y (!) = y})P ({! : X(!) = x})

=

PXY

(X = x, Y = y)

PX

(X = x)

=

pXY

(x, y)

pX

(x)

or simply p(x, y) = p(y|x) · p(x). This is a direct consequence of the productrule from Eq. (2) in the Probability Theory section. From this expression, wecan calculate the probability of an event A as

PY |X(Y 2 A|X = x) =

8><

>:

Py2A

pY |X(y|x) Y : discrete

´y2A

pY |X(y|x)dy Y : continuous

The extension to more than two variables is straightforward. We can write

p(xk

|x1, . . . , xk�1) =p(x1, . . . , xk

)

p(x1, . . . , xk�1).

By a recursive application of the product rule, we obtain

p(x1, . . . , xk

) = p(x1)

kY

l=2

p(xl

|x1, . . . , xl�1) (1)

which is referred to as chain rule.

7

Page 8: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

1.4 Independence of random variables

Two random variables are independent if their joint probability distribution canbe expressed as

pXY

(x, y) = pX

(x) · pY

(y).

As before, k random variables are (collectively, mutually) independent if a jointprobability distribution of any subset of variables can be expressed as a productof individual (marginal) probability distributions of its components.

Another, different, form of independence can be found even more frequentlyin probabilistic calculations. It represents independence between variables inthe presence of some other random variable (evidence), e.g.

pXY |Z (x, y|z) = p

X|Z(x|z) · pY |Z(y|z)

and is referred to as conditional independence. Interestingly, the two forms ofindependence are unrelated; i.e. neither one implies the other. We show this intwo simple examples from Figure 1.

1.5 Expectations and moments

Expectations of functions are defined as sums (or integrals) of function valuesweighted according to the probability distribution function. Given a probabilityspace (⌦

X

,B(⌦X

), PX

), we consider a function f : ⌦

X

! C and define itsexpectation function as

Ex

[f(x)] =

8><

>:

Px2⌦X

f(x)pX

(x) X : discrete

´⌦X

f(x)pX

(x)dx X : continuous

It can happen for continuous distributions that Ex

[f(x)] = ±1; in such caseswe say that the expectation does not exist. For f(x) = x, we have a standardexpectation E [X] = E

x

[x] =P

xpX

(x), or the mean value of X. Using f(x) =xk results in the k-th moment, f(x) = log

1pX(x) gives the well-known entropy

function H(X), or differential entropy for continuous random variables, andf(x) = (x� E [X])

2 provides the variance of a random variable X, denotedby V [X]. Interestingly, the probability of some event A ✓ ⌦

X

can also beexpressed in the form of expectation; i.e.

PX

(A) = E [IA

(x)] ,

where

IA

(x) =

(1 x 2 A

0 x /2 A

is an indicator function. With this, it is possible to express the cumulativedistribution function as F

X

(t) = E[I(�1,t](x)].

8

Page 9: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

X

X

0011

0011

P Z X Y|( = 1 , )

P Z X Y|( = 1 , )

P Z z X x Y y P Z z Y y| |( = = , = ) ( = = )=

P Y y X x Z z P Y y Z z| |( = = , = ) ( = = )

dede

ccc

c

1 {1 {

Y

Y

0101

0101

A.

X Z Yand are conditionally independent given , but not independent

X Yand are independent, but not conditionally independent given Z

B.

P X( = 1)

P X( = 1)

a

a

X

X

01

01

P Y X|( = 1 )

P Y X|( = 1 )

bc

bb

P Z X x Y|( = 1 = , = 1) = d

P Y X Z a bc c b c|( = 1 =1, = 1) (1 (1 2 ))2 ¡ ¡ ¡=

for example,

for example,

for example,

for example,

P Z Y|( = 1 = 1) = d

P Y Z b|( = 1 = 1) ¡ ¡ ¡ ¡= (1 (1 2 ))d c a c1

P Z z X x P Z z|( = = ) ( = )

P Y y X x P Y y|( = = ) ( = )=

P Z X|( = 1 =1) = +d ce cd¡

P Y X x|( = 1 = ) = b

P Z a c b b( = 1) ( )¡ ¡= + ( )( + )d e d

P Y b( = 1) =

Figure 1: Independence vs. conditional independence using probability distri-butions involving three binary random variables. Probability distributions arepresented using factorization p(x, y, z) = p(x)p(y|x)p(z|x, y), where all constantsa, b, c, d, e 2 [0, 1]. (A) Variables X and Y are independent, but not condition-ally independent given Z. When c = 0, Z = X � Y , where � is an “exclusiveor” operator. (B) Variables X and Z are conditionally independent given Y ,but are not independent.

9

Page 10: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

f(x) Symbol Name

x E[X] Mean

(x� E [X])

2V [X] Variance

xk E[Xk

] k-th moment; k 2 N

(x� E [X])

k

E[(x� E [X])

k

] k-th central moment; k 2 Netx M

X

(t) Moment generating function

eitx 'X

(t) Characteristic function

log

1pX(x) H(X) (Differential) entropy

log

pX(x)q(x) D(p

X

||q) Kullback-Leibler divergence�

@

@✓

log pX

(x|✓)�2 I(✓) Fisher information

Table 1: Some important expectation functions Ex

[f(x)] for a random variableX described by its distribution p

X

(x). Function q(x) in the definition of theKullback-Leibler divergence is non-negative and must sum (integrate) to 1; i.e.it is a probability distribution itself. The Fisher information is defined for afamily of probability distributions specified by a parameter ✓. Note that themoment generating function may not exist for some distributions and all valuesof t; however, the characteristic function always exists, even when the densityfunction does not.

Function f(x) inside the expectation can also be complex-valued. For exam-ple, '

X

(t) = Ex

[eitx], where i is the imaginary unit, defines the characteristicfunction of X. The characteristic function is closely related to the inverse Fouriertransform of p

X

(x) and is useful in many forms of statistical inference. Severalexpectation functions are summarized in Table 1.

Given two random variables X and Y and a specific value x assigned to X,we define the conditional expectation as

Ey

[f(y)|x] =

8><

>:

Py2⌦Y

f(y)pY |X(y|x) Y : discrete

´⌦Y

f(y)pY |X(y|x)dy Y : continuous

where f : ⌦

Y

! C is some function. Again, using f(y) = y results in E [Y |x] =Pyp

Y |X(y|x) or E [Y |x] =´yp

Y |X(y|x)dy. We shall see later that under someconditions E [Y |x] is referred to as regression function. These types of integralsare often seen and evaluated in Bayesian statistics.

For two random variables X and Y we also define

10

Page 11: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

Ex,y

[f(x, y)] =

8><

>:

Px2⌦X

Py2⌦Y

f(x, y)pXY

(x, y) X,Y : discrete

´⌦X

´⌦Y

f(x, y)pXY

(x, y)dxdy X, Y : continuous

and note that we interchangeably use E [XY ] and E [xy]. Expectations can alsobe defined over a single variable

Ex

[f(x, y)] =

8><

>:

Px2⌦X

f(x, y)pX

(x) X : discrete

´⌦X

f(x, y)pX

(x)dx X : continuous

Note that Ex

[f(x, y)] is a function of y.We define the covariance function as

cov (X,Y ) = Ex,y

[(x� E [X]) (y � E [Y ])]

= Ex,y

[xy]� Ex

[x]Ey

[y]

= E [XY ]� E [X]E [Y ] ,

with cov(X) = cov(X,X) being the variance of the random variable X. Simi-larly, we define a correlation function as

corr (X,Y ) =

cov (X,Y )pcov (X) · cov (Y )

,

which is simply a covariance function normalized by the product of standarddeviations. Both covariance and correlation functions have wide applicabilityin statistics, machine learning, signal processing and many other disciplines.Several important expectations for two random variables are listed in Table 2.

Example 6. Three tosses of a fair coin (yet again). Consider two randomvariables from Examples 3 and 5, and calculate the expectation and variancefor both X and Y . Then calculate E [Y |X = 0].

We start by calculating E [X] = 0 · pX

(0) + 1 · pX

(1) =

12 . Similarly,

E [Y ] =

3X

y=0

y · pY

(y)

= pY

(1) + 2pY

(2) + 3pY

(3)

=

3

2

The conditional expectation can be found as

11

Page 12: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

f(x, y) Symbol Name

(x� E [X]) (y � E [Y ]) cov(X,Y ) Covariance(x�E[X])(y�E[Y ])p

V [X]V [Y ]corr(X,Y ) Correlation

log

pXY (x,y)pX(x)pY (y) I(X;Y ) Mutual information

log

1pXY (x,y) H(X,Y ) Joint entropy

log

1pX|Y (x|y) H(X|Y ) Conditional entropy

Table 2: Some important expectation functions Ex,y

[f(x, y)] for two randomvariables, X and Y , described by their joint distribution p

XY

(x, y). Mutualinformation is sometimes referred to as average mutual information.

E [Y |X = 0] =

3X

y=0

y · pY |X(y|0)

= pY |X(1|0) + 2p

Y |X(2|0) + 3pY |X(3|0)

= 1

where pY |X(y|x) = p

XY

(x, y)/pX

(x).

⇤In many situations we need to analyze more than two random variables. A

simple two-dimensional summary of all pairwise covariance values involving krandom variables X1, X2, . . . Xk

is called the covariance matrix. More formally,the covariance matrix is defined as

⌃ = [⌃

ij

]

k

i,j=1

where

ij

= cov(Xi

, Xj

)

= Exi,xj [(xi

� E [Xi

]) (xj

� E [Xj

])] .

Here, the diagonal elements of a k⇥k covariance matrix are individual variancevalues for each variable X

i

and the off-diagonal elements are the covariancevalues between pairs of variables. The covariance matrix is symmetric. It issometimes called a variance-covariance matrix.

1.5.1 Properties of expectations

Here we review, without proofs, some useful properties of expectations. For anytwo random variables X and Y , and constant c 2 R, it holds that:

12

Page 13: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

1. E [cX] = cE [X]

2. E [X + Y ] = E [X] + E [Y ]

3. V [c] = 0

4. V [X] � 0

5. V [cX] = c2V [X].

In addition, if X and Y are independent random variables, it holds that:

1. E [XY ] = E [X] · E [Y ]

2. V [X + Y ] = V [X] + V [Y ]

3. cov (X,Y ) = 0.

1.6 Mixtures of distributions

In previous sections we saw that random variables are often described usingparticular families of probability distributions. This approach can be general-ized by considering mixtures of distributions, i.e. linear combinations of otherprobability distributions. As before, we shall only consider random variablesthat have their probability mass or density functions.

Given a set of m probability distributions, {pi

(x)}mi=1, a finite mixture dis-

tribution function, or mixture model, p(x) is defined as

p(x) =mX

i=1

wi

pi

(x), (2)

where w = (w1, w2, . . . , wm

) is a set of non-negative real numbers such thatPm

i=1 wi

= 1. We refer to w as mixing coefficients or, sometimes, as mixingprobabilities. A linear combination with such coefficients is called a convexcombination. It is straightforward to verify that a function defined in thismanner is indeed a probability distribution.

Here we will briefly look into the basic expectation functions of the mixturedistribution. Suppose {X

i

}mi=1 is a set of m random variables described by their

respective probability distribution functions {pXi(x)}

m

i=1. Suppose also that arandom variable X is described by a mixture distribution with coefficients w

and probability distributions {pXi(x)}

m

i=1. Then, assuming continuous randomvariables defined on R, the expectation function is given as

13

Page 14: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

Ex

[f(x)] =

ˆ +1

�1f(x)p

X

(x)dx

=

ˆ +1

�1f(x)

mX

i=1

wi

pXi(x)dx

=

mX

i=1

wi

ˆ +1

�1f(x)p

Xi(x)dx

=

mX

i=1

wi

Exi [f(x)].

We can now apply this formula to obtain the mean, when f(x) = x and thevariance, when f(x) = (x� E[X])

2, of the random variable X as

E[X] =

mX

i=1

wi

E[Xi

],

and

V [X] =

mX

i=1

wi

V [Xi

] +

mX

i=1

wi

(E[Xi

]� E[X])

2,

respectively. A mixture distribution can also be defined for countably and un-countably infinite numbers of components. Such distributions, however, are rarein practice.

Example 7. Signal communications. Consider transmission of a singlebinary digital signal (bit) over a noisy communication channel shown in Figure2. The magnitude of the signal X emitted by the source is equally likely to be 0

or 1 Volt. The signal is sent over a transmission line (e.g. radio communication,optical fiber, magnetic tape) in which a zero-mean normally distributed noisecomponent Y is added to X. Derive the probability distribution of the signalZ = X + Y that enters the receiver.

We will consider a slightly more general situation where X : Bernoulli(↵)and Y : Gaussian(µ,�2

). To find pZ

(z) we will use characteristic functions ofrandom variables X, Y and Z, written as '

X

(t) = E[eitx], 'Y

(t) = E[eity] and'Z

(t) = E[eitz]. Without derivation we write

'X

(t) = 1� ↵+ ↵eit

'Y

(t) = eitµ��2t2

2

and subsequently

14

Page 15: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

Source Channel

Noise

X

Y

Z

Z X Y= +

Receiver

X: Bernoulli(®)

Y : Gaussian(¹ , ¾ )2

Figure 2: A digital signal communication system with additive noise.

'Z

(t) = 'X+Y

(t)

= 'X

(t) · 'Y

(t)

=

�1� ↵+ ↵eit

�· eitµ��2t2

2

= ↵eit(µ+1)��2t2

2+ (1� ↵)eitµ�

�2t2

2 .

By performing integration on 'Z

(t) we can easily verify that

pZ

(z) = ↵ · 1p2⇡�2

e�1

2�2 (z�µ�1)2+ (1� ↵) · 1p

2⇡�2e�

12�2 (z�µ)2 ,

which is a mixture of two normal distributions N(µ+ 1,�2) and N(µ,�2

) withcoefficients w1 = ↵ and w2 = 1 � ↵, respectively. Observe that a convexcombination of random variables Z = w1X + w2Y does not imply p

Z

(z) =

w1pX(x) + w2pY (y).

1.7 Graphical representation of probability distributions

We saw earlier that a joint probability distribution can be factorized using thechain rule from Eq. (1). Such factorizations can be visualized using a directedgraph representation, where nodes represent random variables and edges depictconditional dependence. For example,

p(x, y, z) = p(x)p(y|x)p(z|x, y)is shown in Figure 3A.

Such visualization is especially convenient for showing conditional indepen-dence, which can be represented as missing edges in the graph. Figure 3Bshows the the same factorization of p(x, y, z) where variable Z is independentof X given Y .

15

Page 16: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

X

Y

Z

X

0011

P Z X Y|( = 1 , )

P X Y( , ) = | |= = , = ( = ) ( = = ) ( = = , = )x y Z z P X x P Y y X x P Z z X x Y y

P X Y( , ) = | |= = , = ( = ) ( = = ) ( = = )x y Z z P X x P Y y X x P Z z Y y

0.30.10.70.4

Y

0101

X

Y

Z

A

B

P X( = 1)

0.3

P X( = 1)

0.3

X

01

P Y X|( = 1 )

0.50.9

X

01

P Y X|( = 1 )

0.50.9

Y

01

P Z Y|( = 1 )

0.20.7

Figure 3: Bayesian network: graphical representation of two joint probabilitydistributions for three discrete (binary) random variables (X,Y, Z) using di-rected acyclic graphs. The probability mass function p(x, y, z) is defined over{0, 1}3. (A) Full factorization; (B) Factorization that shows and ensures con-ditional independence between Z and X, given Y . Each node is associatedwith a conditional probability distribution. In discrete cases, these conditionaldistributions are referred to as conditional probability tables.

16

Page 17: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

Graphical representations of probability distributions using directed acyclicgraphs, together with conditional probability distributions, are called Bayesian

networks. They facilitate interpretation as well as effective statistical inference.Given a set of k random variables X = (X1, . . . , Xk

), Bayesian networks fac-torize the joint probability distribution of X as

p(x) =kY

i=1

p�xi

|xParents(Xi)

�,

where Parents(X) denotes the immediate ancestors of node X in the graph. InFigure 3B, node Y is a parent of Z, but node X is not a parent of Z.

It is important to mention that there are multiple (how many?) ways offactorizing a distribution. For example, by reversing the order of variablesp(x, y, z) can be also factorized as

p(x, y, z) = p(z)p(y|z)p(x|y, z),

which has a different graphical representation and its own conditional proba-bility distributions, yet the same joint probability distribution as the earlierfactorization. Selecting a proper factorization and estimating the conditionalprobability distributions from data will be discussed in detail later.

Undirected graphs can also be used to factorize probability distributions.The main idea here is to decompose graphs into maximal cliques C (the smallestset of cliques that covers the graph) and express the distribution in the followingform

p(x) =1

Z

Y

C2C C

(x

C

),

where each C

(x

C

) � 0 is called the clique potential function and

Z =

ˆx

Y

C2C C

(x

C

)dx,

is called the partition function, used strictly for normalization purposes. Incontrast to conditional probability distributions in directed acyclic graphs, theclique potentials usually do not have conditional probability interpretations and,thus, normalization is necessary. One example of a maximum clique decompo-sition is shown in Figure 4.

The potential functions are typically taken to be strictly positive, C

(x

C

) >0, and expressed as

C

(x

C

) = exp (�E(x

C

)) ,

where E(x

C

) is a user-specified energy function on the clique of random variablesX

C

. This leads to the probability distribution of the following form

17

Page 18: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

X1 X2

X3 X4

X5 X6 X7

X8

X X X X2 3 4, , }C1

= {X1,

X X X2 5, }C2

= {

X X6}C3= {X5,

X X X7 8, }C4

= {X6,

Figure 4: Markov network: graphical representation of a probability distributionusing maximum clique decomposition. Shown is a set of eight random variableswith their interdependency structure and maximum clique decomposition (aclique is fully connected subgraph of a given graph). A decomposition intomaximum cliques covers all vertices and edges in a graph with the minimumnumber of cliques. Here, the set of variables is decomposed into four maximalcliques C = {C1, C2, C3, C4}.

p(x) =1

Zexp

X

C2Clog

C

(x

C

)

!.

As formulated, this probability distribution is called the Boltzmann distributionor the Gibbs distribution.

The energy function E(x) must be lower for for values of x that are morelikely. It also may involve parameters that are then estimated from the availabletraining data. Of course, in a prediction problem, an undirected graph must becreated to also involve the target variables, which were here considered to be asubset of X.

Consider now any probability distribution over all possible configurationsof the random vector X with its underlying graphical representation. If thefollowing property

p (xi

|x�Xi) = p�xi

|xN(Xi)

�(3)

is satisfied, the probability distribution is referred to as Markov network or aMarkov random field. In the equation above

X�Xi = (X1, . . . , Xi�1, Xi+1, . . . , Xk

)

and N(X) is a set of random variables neighboring X in the graph; i.e. thereexists an edge between X and every node in N(X). The set of random variablesin N(X) is also called the Markov blanket of X.

It can be shown that every Gibbs distribution satisfies the property fromEq. (3) and, conversely, that for every probability distribution for which Eq. (3)

18

Page 19: Machine Learning Lecture Notes - Computer Sciencepredrag/classes/2015springb555/3.pdf · Machine Learning Lecture Notes Predrag Radivojac ... It is important to mention that,

holds can be represented as a Gibbs distribution with some choice of parameters.This equivalence of Gibbs distributions and Markov networks was establishedby the Hammersley-Clifford theorem.

19