Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Metric Distances Between ProbabilityDistributions of Different Sizes

M. Vidyasagar

Cecil & Ida Green ChairThe University of Texas at Dallas

[email protected]/∼m.vidyasagar

Johns Hopkins University, 20 October 2011

M. Vidyasagar Distances Between Probability Distributions



Concluding Remarks

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks




Concluding Remarks

Problem FormulationSource of Difficulty

Outline

1 Motivation

Problem Formulation











Concluding Remarks


Outline

1 Motivation

Problem Formulation











Concluding Remarks


Motivation

Originally: Given a hidden Markov process with a very large statespace, how can one approximate it ‘optimally’ with another HMPwith a much smaller state space?

Analog control ← digital control.

Signals in Rd ← signals over finite sets.

Applications: Control over networks, data compression, reducedsize noise modeling etc.

If u, y are input and output, view {ut, yt} as a stochastic processover some finite set U × Y , then ‘reduced order modeling’ isapproximating {ut, yt} by another stochastic process over a‘smaller cardinality’ set U ′ × Y ′.




Concluding Remarks


General Problem: Simplified Modeling of StochasticProcesses

Suppose {Xt} is a stochastic process assuming values in a ‘large’but finite set A with n elements; we wish to approximate it byanother process {Yt} assuming values in a ‘small’ finite set B withm < n elements.

Questions:

How do we define the ‘distance’ (think ‘modeling error’)between the two processes {Xt} and {Yt}?Given {Xt}, how do we find the ‘best possible’ reduced orderapproximation to {Xt} in the chosen distance?




Concluding Remarks


Scope of Today’s Talk

Scope of this talk: i.i.d. (independent, identically distributed)processes.

An i.i.d. process {Xt} is completely described by its‘one-dimensional marginal,’ i.e., the distribution of X1 (or any Xt).

Questions:

Given two probability distributions φ,ψ, on finite sets A,B,how can we define a ‘distance’ between them?

Given distribution φ with n components, and an integerm < n, how can we find the ‘best possible’ m-dimensionalapproximation to φ?

Full paper at: http://arxiv.org/pdf/1104.4521v2




Concluding Remarks


Outline

1 Motivation

Problem Formulation











Concluding Remarks


Total Variation Metric

Suppose A = {a1, . . . , an} is a finite set. and φ,ψ are probabilitydistributions on A. Then the total variation metric is

ρ(φ,ψ) =1

2

n∑i=1

|φi − ψi| =n∑

i=1

{φi − ψi}+ = −n∑

i=1

{φi − ψi}−,

where {x}+ = max{x, 0}, {x}− = min{x, 0}.

ρ is permutation-invariant if the same permutation is applied tothe components of φ,ψ (rearranging the elements of A). So if π isa permutation of the elements of A, then

ρ(π(φ), π(ψ)) = ρ(φ,ψ).

But what if φ,ψ are probability measures on different sets?




Concluding Remarks


Permutation Invariance Example

Example:

A = {H,T},B = {M,W},φ = [0.55 0.45],ψ = [0.48 0.52].

What is the distance between φ,ψ?

If we identify H ↔M,T ↔W , then ρ(φ,ψ) = 0.07.

But if we identify H ↔W,T ↔M , then ρ(φ,ψ) = 0.03.

Which one is more ‘natural’? Answer: There is no naturalassociation!




Concluding Remarks


Permutation Invariance: An Inherent Feature

Suppose φ,ψ are probability distributions on distinct sets A,B,and we wish to ‘compare’ them. Suppose d(φ,ψ) is the distance(yet to be defined).

Claim: Suppose π is a permutation on A, ξ is a permutation on B.Then we must have

d(φ,ψ) = d(π(φ), ξ(ψ)).

Ergo: Any definition of the distance must be invariant underpossibly different permutations of A,B.

In particular, if A,B are distinct sets, even if |A| = |B|, ourdistance cannot reduce to ρ.




Concluding Remarks

Outline

1 Motivation

Problem Formulation











Concluding Remarks

Notation

A = {1, . . . , n}1, B = {1, . . . ,m}, φ is a probability distributionon A, ψ is a probability distribution on B, X is a random variableon A with distribution φ, Y is a r.v. on B with distribution ψ.

Sn := {v ∈ Rn+ :

n∑i=1

vi = 1}.

So φ ∈ Sn,ψ ∈ Sm. Sn×m denotes the set of n×m ‘stochasticmatrices’, i.e., matrices whose columns add up to one:

Sn×m = {P ∈ [0, 1]n×m : Pem = en},

where en is the column vector with n ones.

1We don’t write A = {a1, . . . , an} but that is what we mean.M. Vidyasagar Distances Between Probability Distributions



Concluding Remarks

Entropy

Suppose φ ∈ Sn. Then the (Shannon) entropy of φ is

H(φ) = −n∑

i=1

φi log φi.

We can also call this H(X) (i.e., associate entropy with a r.v. orits distribution).




Concluding Remarks

Mutual Information

Suppose X,Y are r.v.s on A,B and let θ denote their jointdistribution. Then θ ∈ Snm, and θA = φ,θB = ψ (marginaldistributions).

I(X,Y ) := H(X) +H(Y )−H(X,Y )

is called the mutual information between X and Y .

Alternate formula:

I(X,Y ) =∑i∈A

∑j∈B

θij logθijφiψi

.

Note that I(X,Y ) = I(Y,X).




Concluding Remarks

Conditional Entropy

The quantityH(X|Y ) = H(X,Y )−H(Y )

is called the conditional entropy of X given Y . Note thatH(X|Y ) 6= H(Y |X) in general. In fact

H(Y |X) = H(X|Y ) +H(Y )−H(X).

If θ is the joint distribution of X,Y , then

H(X|Y ) = H(θ)−H(ψ), H(Y |X) = H(θ)−H(φ).




Concluding Remarks

Conditional Entropy: Alternate Formulation

Define

Θ = [θij ] = [Pr{X = i&Y = j}] ∈ [0, 1]n×m,

and define P = [Diag(φ)]−1Θ. Clearly

pij =θijφi

= Pr{Y = j|X = i}.

So P ∈ Sn×m and

H(Y |X) =∑i∈A

φiH(pi) =: Jφ(P ),

where pi is the i-th row of P .




Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Outline

1 Motivation

Problem Formulation











Concluding Remarks


Outline

1 Motivation

Problem Formulation











Concluding Remarks


Maximizing Mutual Information (MMI)

Given φ ∈ Sn,ψ ∈ Sm, we look for a ‘joint’ distribution θ ∈ Snmsuch that θA = φ,θB = ψ, and in addition, the entropy H(θ) isminimum, or equivalently, the conditional entropy of X given Y isminimum, or again equivalently, the mutual information betweenX and Y is maximum. In other words, define

W (φ,ψ) := minH(θ) s.t. θA = φ,θB = ψ

= minH(X,Y ) s.t. θA = φ,θB = ψ,

V (φ,ψ) := W (φ,ψ)−H(φ)

= minH(Y |X) s.t. θA = φ,θB = ψ.

We try to make Y ‘as deterministic as possible’ given X.




Concluding Remarks


The Variation of Information Metric

The quantityd(φ,ψ) := V (φ,ψ) + V (ψ,φ)

is called the variation of information metric. Since

V (ψ,φ) = V (φ,ψ) +H(φ)−H(ψ),

we need to compute only one of V (ψ,φ), V (φ,ψ).

The quantity d satisfies

d(φ,φ) = 0 and d(φ,ψ) ≥ 0 ∀φ,ψ.

d(φ,ψ) = d(ψ,φ).

The triangle inequality holds:

d(φ, ξ) ≤ d(φ,ψ) + d(ψ, ξ).




Concluding Remarks


Proof of Triangle Inequality

Suppose X,Y, Z are r.v.s on finite sets A,B,C. Then theconditional entropy satisfies a ‘one-sided’ triangle inequality:

H(X|Y ) ≤ H(X|Z) +H(Z|Y ).

Proof:

H(X|Y ) ≤ H(X,Z|Y ) = H(Z|Y ) +H(X|Z, Y )

≤ H(Z|Y ) +H(X|Z).

So if we define

v(X,Y ) = H(X|Y ) +H(Y |X),

then v satisfies the triangle inequality. Note that

V (φ,ψ) = minφ,ψ

v(X,Y )

subject to X,Y having distributions φ,ψ respectively.M. Vidyasagar Distances Between Probability Distributions



Concluding Remarks


A Key Property

Actually d is a pseudometric, not a metric. In other words, d(φ,ψ)can be zero even if φ 6= ψ.

Theorem: d(φ,ψ) = 0 if and only if n = m and φ,ψ arepermuations of each other.

Consequence: The metric d is not convex!

Example: Let n = m = 2,

φ = [0.75 0.25],ψ = [0.25 0.75], ξ = 0.5φ+ 0.5ψ = [0.5 0.5].

Thend(φ,φ) = d(φ,ψ) = 0 but d(φ, ξ) > 0.




Concluding Remarks


Outline

1 Motivation

Problem Formulation











Concluding Remarks


Computing the Metric Distance

Change variable of optimization from θ, the joint distribution, toP , the matrix of conditional probabilities.

V (φ,ψ) = minθ∈Snm

H(θ)−H(φ) s.t. θA = φ,θB = ψ

= minP∈Sn×m

Jφ(P ) :=∑i∈A

φiH(pi) s.t. φP = ψ.

Since Jφ(·) is a strictly concave function, and the feasible region ispolyhedral (convex hull of a finite number of extreme points),solution occurs at one of these extreme points.

Also, a ‘principle of optimality’ allows us to break down largeproblems into smaller problems.




Concluding Remarks


The m = 2 Case

Partition the set {1, . . . , n} into two sets I1, I2 such that

|ψ1 −∑i∈I1

φi| = |ψ2 −∑i∈I2

φi|

is minimized.

Interpretation: Given two bins with capacities ψ1, ψ2, assignφ1, . . . , φn to the two bins so that overflow in one bin (andunder-utilization in other bin) is minimized.

Theorem: If both bins are filled exactly, then V (φ,ψ) = 0.




Concluding Remarks


The m = 2 Case (Continued)

Theorem: If both bins cannot be filled exactly, and if the smallestelement of φ (call it φ0) belongs to the overstuffed bin, then

V (φ,ψ) = φ0H([u/φ0 (φ0 − u)/φ0]),

where u is the unutilized capacity (or overflow).

If φ0 belongs to the underutilized bin, the above is an upper boundfor V (φ,ψ) but may not equal V (φ,ψ).

Bad news: Computing the optimal partitioning is NP-hard!

So we need an approximation procedure to upper bound d(φ,ψ).No special reason to restrict to m = 2.




Concluding Remarks


The m = 2 Case (Continued)

Theorem: If both bins cannot be filled exactly, and if the smallestelement of φ (call it φ0) belongs to the overstuffed bin, then

V (φ,ψ) = φ0H([u/φ0 (φ0 − u)/φ0]),

where u is the unutilized capacity (or overflow).

If φ0 belongs to the underutilized bin, the above is an upper boundfor V (φ,ψ) but may not equal V (φ,ψ).

Bad news: Computing the optimal partitioning is NP-hard!

So we need an approximation procedure to upper bound d(φ,ψ).No special reason to restrict to m = 2.




Concluding Remarks


Best Fit Algorithm for Bin-Packing

Think of ψ1, . . . , ψm as capacities of m bins.

Arrange ψj in descending order.

For i = 1, . . . , n, place each φi into the bin with maximumunutilized capacity.

If a bin overflows, don’t fill it anymore.

Complexity is O(n logm).

Provably suboptimal: Total bin size is ≤ 1.25 times optimal binsize; best-known bound.

Corresponding bound on overflow is not so good: It is≤ 0.25 + 1.25× optimal value.




Concluding Remarks


Illustration of Best Fit Algorithm

Suppose ψ = [0.45 0.30 0.25], and φ ∈ S14 is given by

φ = 10−2 · [ 14 13 12 9 8 7 6 6 5 5 4 4 4 3 ]

φ1

φ2

φ6

φ8

φ11

φ3

φ5

φ9

φ12φ14

φ4

φ7

φ10

φ13




Concluding Remarks


A Greedy Algorithm for Bounding d(φ,ψ)

Think of ψ1, . . . , ψm as m bins to be packed. Modify ‘best fit’algorithm: Sort ψ in decreasing order, put each φi into bin withlargest unutilized capacity. If φi does not fit into any bin, put itaside (departure from best fit algorithm).

When all of φ is processed, say k entries are put aside. Thenk < m. Now solve a k by m bin-packing problem; repeat. See fullpaper for details.

This results in a sequence of bin-packing problems of decreasingsize. Outcome is an upper bound for d(φ,ψ). Complexity isO((n+m2) logm).




Concluding Remarks

Outline

1 Motivation

Problem Formulation











Concluding Remarks

Problem Formulation

Problem: Given φ ∈ Sn, and m < n, find ψ ∈ Sm that minimizesd(φ,ψ).

Definition: ψ ∈ Sm is said to be an aggregation of φ ∈ Sn ifthere exists a partition I1, . . . , Im of {1, . . . , n} such that

ψj =∑i∈Ij

φi.




Concluding Remarks

Results

Theorem: Any optimal approximation of φ in the variation ofinformation metric d must be an aggregation of φ.

Theorem: An aggregation φ(a) of φ is an optimal approximationof φ in the variation of information metric d if and only if it hasmaximum entropy (amongst all aggregations).

Note: um, the uniform distribution with m elements, hasmaximum entropy in Sm. So should we try to minimizeρ(φ(a),um)? This is yet another bin-packing problem with all bincapacities equal to 1/m. But how valid is this approach?




Concluding Remarks

NP-Hardness of Optimal Order Reduction

Suppose m = 2 and let φ(a) be an aggregation of φ ∈ Sn. Thenφ(a) has maximum entropy if and only if the total variationdistance ρ(φ(a),u2) is minimized.

Therefore:

When m = 2, minimizing ρ(φ(a),u2) gives an optimalreduced order approximation to φ.

For m ≥ 3, minimizing ρ(φ(a),um) gives a suboptimalreduced order approximation to φ.




Concluding Remarks

Reformulation of Problem

Problem: Given φ ∈ Sn and m < n, find an aggregation ofφ(a) ∈ Sm with maximum entropy.

Reformulation: Since the uniform distribution um has maximumentropy in Sm, So find an aggregation φ(a) such that the totalvariation distance ρ(φ(a),um) is minimized.

More general problem: (Not any harder) Given φ ∈ Sn, m < nand ξ ∈ Sm, find an aggregation φ(a) to minimize the totalvariation distance ρ(φ(a), ξ).

Use best-fit algorithm to find an aggregation of φ that is close tothe uniform distribution. Complexity is O(n logm).




Concluding Remarks

Bound on Performance of Best Fit Algorithm

Theorem: Let φ(a) denote the aggregation of φ using the best fitalgorithm. Then

ρ(φ(a), ξ) ≤ 0.25mφmax,

where φmax is the largest component of φ and ρ is the totalvariation metric.

This can be turned into a (messy) bound on the entropy of φ(a).




Concluding Remarks

Outline

1 Motivation

Problem Formulation











Concluding Remarks

Achievements

Definition of a proper metric between probability distributionsdefined on sets of different cardinalities, by maximizingmutual information between the two distributions.

Study of properties of the distance, and the problem ofcomputing the distance; showing the close relationship to thebin-packing problem with overstuffing.

Since bin-packing is NP-hard, adapting best-fit algorithm togenerate upper bounds for distance in polynomial time.

Characterization of solution of optimal order reductionproblem in terms of aggregating to maximize entropy.

Formulation as a bin-packing problem with overstuffing.

Upper bound on performance of algorithm.




Concluding Remarks

Next Steps

Extension of the metric to Markov processes, hidden Markovprocesses, arbitrary (but stationary ergodic) stochasticprocesses.

Alternatives to best-fit heuristic algorithm.

Thank You!


Documents

Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation