40
Motivation Information Theory Background A Metric Distance Between Distributions Optimal Order Reduction Concluding Remarks Metric Distances Between Probability Distributions of Different Sizes M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas [email protected] www.utdallas.edu/m.vidyasagar Johns Hopkins University, 20 October 2011 M. Vidyasagar Distances Between Probability Distributions

Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Metric Distances Between ProbabilityDistributions of Different Sizes

M. Vidyasagar

Cecil & Ida Green ChairThe University of Texas at Dallas

[email protected]/∼m.vidyasagar

Johns Hopkins University, 20 October 2011

M. Vidyasagar Distances Between Probability Distributions

Page 2: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 3: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 4: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 5: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

Motivation

Originally: Given a hidden Markov process with a very large statespace, how can one approximate it ‘optimally’ with another HMPwith a much smaller state space?

Analog control ← digital control.

Signals in Rd ← signals over finite sets.

Applications: Control over networks, data compression, reducedsize noise modeling etc.

If u, y are input and output, view {ut, yt} as a stochastic processover some finite set U × Y , then ‘reduced order modeling’ isapproximating {ut, yt} by another stochastic process over a‘smaller cardinality’ set U ′ × Y ′.

M. Vidyasagar Distances Between Probability Distributions

Page 6: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

General Problem: Simplified Modeling of StochasticProcesses

Suppose {Xt} is a stochastic process assuming values in a ‘large’but finite set A with n elements; we wish to approximate it byanother process {Yt} assuming values in a ‘small’ finite set B withm < n elements.

Questions:

How do we define the ‘distance’ (think ‘modeling error’)between the two processes {Xt} and {Yt}?Given {Xt}, how do we find the ‘best possible’ reduced orderapproximation to {Xt} in the chosen distance?

M. Vidyasagar Distances Between Probability Distributions

Page 7: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

Scope of Today’s Talk

Scope of this talk: i.i.d. (independent, identically distributed)processes.

An i.i.d. process {Xt} is completely described by its‘one-dimensional marginal,’ i.e., the distribution of X1 (or any Xt).

Questions:

Given two probability distributions φ,ψ, on finite sets A,B,how can we define a ‘distance’ between them?

Given distribution φ with n components, and an integerm < n, how can we find the ‘best possible’ m-dimensionalapproximation to φ?

Full paper at: http://arxiv.org/pdf/1104.4521v2

M. Vidyasagar Distances Between Probability Distributions

Page 8: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 9: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

Total Variation Metric

Suppose A = {a1, . . . , an} is a finite set. and φ,ψ are probabilitydistributions on A. Then the total variation metric is

ρ(φ,ψ) =1

2

n∑i=1

|φi − ψi| =n∑

i=1

{φi − ψi}+ = −n∑

i=1

{φi − ψi}−,

where {x}+ = max{x, 0}, {x}− = min{x, 0}.

ρ is permutation-invariant if the same permutation is applied tothe components of φ,ψ (rearranging the elements of A). So if π isa permutation of the elements of A, then

ρ(π(φ), π(ψ)) = ρ(φ,ψ).

But what if φ,ψ are probability measures on different sets?

M. Vidyasagar Distances Between Probability Distributions

Page 10: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

Permutation Invariance Example

Example:

A = {H,T},B = {M,W},φ = [0.55 0.45],ψ = [0.48 0.52].

What is the distance between φ,ψ?

If we identify H ↔M,T ↔W , then ρ(φ,ψ) = 0.07.

But if we identify H ↔W,T ↔M , then ρ(φ,ψ) = 0.03.

Which one is more ‘natural’? Answer: There is no naturalassociation!

M. Vidyasagar Distances Between Probability Distributions

Page 11: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem FormulationSource of Difficulty

Permutation Invariance: An Inherent Feature

Suppose φ,ψ are probability distributions on distinct sets A,B,and we wish to ‘compare’ them. Suppose d(φ,ψ) is the distance(yet to be defined).

Claim: Suppose π is a permutation on A, ξ is a permutation on B.Then we must have

d(φ,ψ) = d(π(φ), ξ(ψ)).

Ergo: Any definition of the distance must be invariant underpossibly different permutations of A,B.

In particular, if A,B are distinct sets, even if |A| = |B|, ourdistance cannot reduce to ρ.

M. Vidyasagar Distances Between Probability Distributions

Page 12: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 13: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Notation

A = {1, . . . , n}1, B = {1, . . . ,m}, φ is a probability distributionon A, ψ is a probability distribution on B, X is a random variableon A with distribution φ, Y is a r.v. on B with distribution ψ.

Sn := {v ∈ Rn+ :

n∑i=1

vi = 1}.

So φ ∈ Sn,ψ ∈ Sm. Sn×m denotes the set of n×m ‘stochasticmatrices’, i.e., matrices whose columns add up to one:

Sn×m = {P ∈ [0, 1]n×m : Pem = en},

where en is the column vector with n ones.

1We don’t write A = {a1, . . . , an} but that is what we mean.M. Vidyasagar Distances Between Probability Distributions

Page 14: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Entropy

Suppose φ ∈ Sn. Then the (Shannon) entropy of φ is

H(φ) = −n∑

i=1

φi log φi.

We can also call this H(X) (i.e., associate entropy with a r.v. orits distribution).

M. Vidyasagar Distances Between Probability Distributions

Page 15: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Mutual Information

Suppose X,Y are r.v.s on A,B and let θ denote their jointdistribution. Then θ ∈ Snm, and θA = φ,θB = ψ (marginaldistributions).

I(X,Y ) := H(X) +H(Y )−H(X,Y )

is called the mutual information between X and Y .

Alternate formula:

I(X,Y ) =∑i∈A

∑j∈B

θij logθijφiψi

.

Note that I(X,Y ) = I(Y,X).

M. Vidyasagar Distances Between Probability Distributions

Page 16: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Conditional Entropy

The quantityH(X|Y ) = H(X,Y )−H(Y )

is called the conditional entropy of X given Y . Note thatH(X|Y ) 6= H(Y |X) in general. In fact

H(Y |X) = H(X|Y ) +H(Y )−H(X).

If θ is the joint distribution of X,Y , then

H(X|Y ) = H(θ)−H(ψ), H(Y |X) = H(θ)−H(φ).

M. Vidyasagar Distances Between Probability Distributions

Page 17: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Conditional Entropy: Alternate Formulation

Define

Θ = [θij ] = [Pr{X = i&Y = j}] ∈ [0, 1]n×m,

and define P = [Diag(φ)]−1Θ. Clearly

pij =θijφi

= Pr{Y = j|X = i}.

So P ∈ Sn×m and

H(Y |X) =∑i∈A

φiH(pi) =: Jφ(P ),

where pi is the i-th row of P .

M. Vidyasagar Distances Between Probability Distributions

Page 18: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 19: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 20: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Maximizing Mutual Information (MMI)

Given φ ∈ Sn,ψ ∈ Sm, we look for a ‘joint’ distribution θ ∈ Snmsuch that θA = φ,θB = ψ, and in addition, the entropy H(θ) isminimum, or equivalently, the conditional entropy of X given Y isminimum, or again equivalently, the mutual information betweenX and Y is maximum. In other words, define

W (φ,ψ) := minH(θ) s.t. θA = φ,θB = ψ

= minH(X,Y ) s.t. θA = φ,θB = ψ,

V (φ,ψ) := W (φ,ψ)−H(φ)

= minH(Y |X) s.t. θA = φ,θB = ψ.

We try to make Y ‘as deterministic as possible’ given X.

M. Vidyasagar Distances Between Probability Distributions

Page 21: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

The Variation of Information Metric

The quantityd(φ,ψ) := V (φ,ψ) + V (ψ,φ)

is called the variation of information metric. Since

V (ψ,φ) = V (φ,ψ) +H(φ)−H(ψ),

we need to compute only one of V (ψ,φ), V (φ,ψ).

The quantity d satisfies

d(φ,φ) = 0 and d(φ,ψ) ≥ 0 ∀φ,ψ.

d(φ,ψ) = d(ψ,φ).

The triangle inequality holds:

d(φ, ξ) ≤ d(φ,ψ) + d(ψ, ξ).

M. Vidyasagar Distances Between Probability Distributions

Page 22: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Proof of Triangle Inequality

Suppose X,Y, Z are r.v.s on finite sets A,B,C. Then theconditional entropy satisfies a ‘one-sided’ triangle inequality:

H(X|Y ) ≤ H(X|Z) +H(Z|Y ).

Proof:

H(X|Y ) ≤ H(X,Z|Y ) = H(Z|Y ) +H(X|Z, Y )

≤ H(Z|Y ) +H(X|Z).

So if we define

v(X,Y ) = H(X|Y ) +H(Y |X),

then v satisfies the triangle inequality. Note that

V (φ,ψ) = minφ,ψ

v(X,Y )

subject to X,Y having distributions φ,ψ respectively.M. Vidyasagar Distances Between Probability Distributions

Page 23: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

A Key Property

Actually d is a pseudometric, not a metric. In other words, d(φ,ψ)can be zero even if φ 6= ψ.

Theorem: d(φ,ψ) = 0 if and only if n = m and φ,ψ arepermuations of each other.

Consequence: The metric d is not convex!

Example: Let n = m = 2,

φ = [0.75 0.25],ψ = [0.25 0.75], ξ = 0.5φ+ 0.5ψ = [0.5 0.5].

Thend(φ,φ) = d(φ,ψ) = 0 but d(φ, ξ) > 0.

M. Vidyasagar Distances Between Probability Distributions

Page 24: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 25: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Computing the Metric Distance

Change variable of optimization from θ, the joint distribution, toP , the matrix of conditional probabilities.

V (φ,ψ) = minθ∈Snm

H(θ)−H(φ) s.t. θA = φ,θB = ψ

= minP∈Sn×m

Jφ(P ) :=∑i∈A

φiH(pi) s.t. φP = ψ.

Since Jφ(·) is a strictly concave function, and the feasible region ispolyhedral (convex hull of a finite number of extreme points),solution occurs at one of these extreme points.

Also, a ‘principle of optimality’ allows us to break down largeproblems into smaller problems.

M. Vidyasagar Distances Between Probability Distributions

Page 26: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

The m = 2 Case

Partition the set {1, . . . , n} into two sets I1, I2 such that

|ψ1 −∑i∈I1

φi| = |ψ2 −∑i∈I2

φi|

is minimized.

Interpretation: Given two bins with capacities ψ1, ψ2, assignφ1, . . . , φn to the two bins so that overflow in one bin (andunder-utilization in other bin) is minimized.

Theorem: If both bins are filled exactly, then V (φ,ψ) = 0.

M. Vidyasagar Distances Between Probability Distributions

Page 27: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

The m = 2 Case (Continued)

Theorem: If both bins cannot be filled exactly, and if the smallestelement of φ (call it φ0) belongs to the overstuffed bin, then

V (φ,ψ) = φ0H([u/φ0 (φ0 − u)/φ0]),

where u is the unutilized capacity (or overflow).

If φ0 belongs to the underutilized bin, the above is an upper boundfor V (φ,ψ) but may not equal V (φ,ψ).

Bad news: Computing the optimal partitioning is NP-hard!

So we need an approximation procedure to upper bound d(φ,ψ).No special reason to restrict to m = 2.

M. Vidyasagar Distances Between Probability Distributions

Page 28: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

The m = 2 Case (Continued)

Theorem: If both bins cannot be filled exactly, and if the smallestelement of φ (call it φ0) belongs to the overstuffed bin, then

V (φ,ψ) = φ0H([u/φ0 (φ0 − u)/φ0]),

where u is the unutilized capacity (or overflow).

If φ0 belongs to the underutilized bin, the above is an upper boundfor V (φ,ψ) but may not equal V (φ,ψ).

Bad news: Computing the optimal partitioning is NP-hard!

So we need an approximation procedure to upper bound d(φ,ψ).No special reason to restrict to m = 2.

M. Vidyasagar Distances Between Probability Distributions

Page 29: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Best Fit Algorithm for Bin-Packing

Think of ψ1, . . . , ψm as capacities of m bins.

Arrange ψj in descending order.

For i = 1, . . . , n, place each φi into the bin with maximumunutilized capacity.

If a bin overflows, don’t fill it anymore.

Complexity is O(n logm).

Provably suboptimal: Total bin size is ≤ 1.25 times optimal binsize; best-known bound.

Corresponding bound on overflow is not so good: It is≤ 0.25 + 1.25× optimal value.

M. Vidyasagar Distances Between Probability Distributions

Page 30: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

Illustration of Best Fit Algorithm

Suppose ψ = [0.45 0.30 0.25], and φ ∈ S14 is given by

φ = 10−2 · [ 14 13 12 9 8 7 6 6 5 5 4 4 4 3 ]

φ1

φ2

φ6

φ8

φ11

φ3

φ5

φ9

φ12φ14

φ4

φ7

φ10

φ13

M. Vidyasagar Distances Between Probability Distributions

Page 31: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Definition of the Metric DistanceComputing the Distance

A Greedy Algorithm for Bounding d(φ,ψ)

Think of ψ1, . . . , ψm as m bins to be packed. Modify ‘best fit’algorithm: Sort ψ in decreasing order, put each φi into bin withlargest unutilized capacity. If φi does not fit into any bin, put itaside (departure from best fit algorithm).

When all of φ is processed, say k entries are put aside. Thenk < m. Now solve a k by m bin-packing problem; repeat. See fullpaper for details.

This results in a sequence of bin-packing problems of decreasingsize. Outcome is an upper bound for d(φ,ψ). Complexity isO((n+m2) logm).

M. Vidyasagar Distances Between Probability Distributions

Page 32: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 33: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Problem Formulation

Problem: Given φ ∈ Sn, and m < n, find ψ ∈ Sm that minimizesd(φ,ψ).

Definition: ψ ∈ Sm is said to be an aggregation of φ ∈ Sn ifthere exists a partition I1, . . . , Im of {1, . . . , n} such that

ψj =∑i∈Ij

φi.

M. Vidyasagar Distances Between Probability Distributions

Page 34: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Results

Theorem: Any optimal approximation of φ in the variation ofinformation metric d must be an aggregation of φ.

Theorem: An aggregation φ(a) of φ is an optimal approximationof φ in the variation of information metric d if and only if it hasmaximum entropy (amongst all aggregations).

Note: um, the uniform distribution with m elements, hasmaximum entropy in Sm. So should we try to minimizeρ(φ(a),um)? This is yet another bin-packing problem with all bincapacities equal to 1/m. But how valid is this approach?

M. Vidyasagar Distances Between Probability Distributions

Page 35: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

NP-Hardness of Optimal Order Reduction

Suppose m = 2 and let φ(a) be an aggregation of φ ∈ Sn. Thenφ(a) has maximum entropy if and only if the total variationdistance ρ(φ(a),u2) is minimized.

Therefore:

When m = 2, minimizing ρ(φ(a),u2) gives an optimalreduced order approximation to φ.

For m ≥ 3, minimizing ρ(φ(a),um) gives a suboptimalreduced order approximation to φ.

M. Vidyasagar Distances Between Probability Distributions

Page 36: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Reformulation of Problem

Problem: Given φ ∈ Sn and m < n, find an aggregation ofφ(a) ∈ Sm with maximum entropy.

Reformulation: Since the uniform distribution um has maximumentropy in Sm, So find an aggregation φ(a) such that the totalvariation distance ρ(φ(a),um) is minimized.

More general problem: (Not any harder) Given φ ∈ Sn, m < nand ξ ∈ Sm, find an aggregation φ(a) to minimize the totalvariation distance ρ(φ(a), ξ).

Use best-fit algorithm to find an aggregation of φ that is close tothe uniform distribution. Complexity is O(n logm).

M. Vidyasagar Distances Between Probability Distributions

Page 37: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Bound on Performance of Best Fit Algorithm

Theorem: Let φ(a) denote the aggregation of φ using the best fitalgorithm. Then

ρ(φ(a), ξ) ≤ 0.25mφmax,

where φmax is the largest component of φ and ρ is the totalvariation metric.

This can be turned into a (messy) bound on the entropy of φ(a).

M. Vidyasagar Distances Between Probability Distributions

Page 38: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Outline

1 Motivation

Problem Formulation

Source of Difficulty

2 Information Theory Background

3 A Metric Distance Between Distributions

Definition of the Metric Distance

Computing the Distance

4 Optimal Order Reduction

5 Concluding Remarks

M. Vidyasagar Distances Between Probability Distributions

Page 39: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Achievements

Definition of a proper metric between probability distributionsdefined on sets of different cardinalities, by maximizingmutual information between the two distributions.

Study of properties of the distance, and the problem ofcomputing the distance; showing the close relationship to thebin-packing problem with overstuffing.

Since bin-packing is NP-hard, adapting best-fit algorithm togenerate upper bounds for distance in polynomial time.

Characterization of solution of optimal order reductionproblem in terms of aggregating to maximize entropy.

Formulation as a bin-packing problem with overstuffing.

Upper bound on performance of algorithm.

M. Vidyasagar Distances Between Probability Distributions

Page 40: Metric Distances Between Probability Distributions of Di erent ...mxv091000/Talks/JHU1011.pdfNote that I(X;Y) = I(Y;X). M. Vidyasagar Distances Between Probability Distributions Motivation

MotivationInformation Theory Background

A Metric Distance Between DistributionsOptimal Order Reduction

Concluding Remarks

Next Steps

Extension of the metric to Markov processes, hidden Markovprocesses, arbitrary (but stationary ergodic) stochasticprocesses.

Alternatives to best-fit heuristic algorithm.

Thank You!

M. Vidyasagar Distances Between Probability Distributions