1 Bottleneck Problems: Information and Estimation-Theoretic View · 2020. 10. 16. · technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms

1

Bottleneck Problems: Information andEstimation-Theoretic View

Shahab Asoodeh and Flavio P. Calmon

Abstract

Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applica-tions in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), strong data processinginequalities, among others. In this work, we first investigate the functional properties of IB and PF in through a unified theoreticalframework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence,noisy source coding and dependence dilution. Leveraging these connections, we prove a new cardinality bound on the auxiliaryvariable in IB, making its computation more tractable for discrete random variables.

In the second part, we introduce a general family of optimization problems, termed as bottleneck problems, by replacing mutualinformation in IB and PF with other notions of mutual information, namely f -information and Arimoto’s mutual information.We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantee in a variety of inference tasks withstatistical constraints on accuracy and privacy. Although the underlying optimization problems are non-convex, we develop atechnique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upperconcave envelope of certain functions. By applying this technique to binary case, we derive closed form expressions for severalbottleneck problems.

CONTENTS

I Introduction 2I-A Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3I-B Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

II Information Bottleneck and Privacy Funnel: Definitions and Functional Properties 4II-A Gaussian IB and PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7II-B Evaluation of IB and PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8II-C Operational Meaning of IB and PF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

II-C1 Noisy Source Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11II-C2 Test Against Independence with Communication Constraint . . . . . . . . . . . . . . . . . . 12II-C3 Dependence Dilution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

II-D Cardinality Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13II-E Deterministic Information Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

III Family of Bottleneck Problems 14III-A Evaluation of UΦ,Ψ and LΦ,Ψ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15III-B Guessing Bottleneck Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16III-C Arimoto Bottleneck Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18III-D f -Bottleneck Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

IV Summary and Concluding Remarks 21

References 21

Appendix A: Proofs from Section II 24

Appendix B: Proofs from Section III 30

This work was supported in part by NSF under grants CIF 1922971, 1815361, 1742836, 1900750, and CIF CAREER 1845852.S. Asoodeh and F. P. Calmon are with School of Engineering and Applied Science, Harvard University (e-mails: {shahab, flavio}@seas.harvard.edu). Part

of the results in this paper was presented at the International Symposium on Information Theory 2018 [1].

2

I. INTRODUCTION

Optimization formulations that involve information-theoretic quantities (e.g., mutual information) have been instrumental ina variety of learning problems found in machine learning. A notable example is the information bottleneck (IB) method [2].Suppose Y is a target variable and X is an observable correlated variable with joint distribution PXY . The goal of IB is tolearn a "compact" summary (aka bottleneck) T of X that is maximally "informative" for inferring Y . The bottleneck variableT is assumed to be generated from X by applying a random function F to X , i.e., T = F (X), in such a way that it isconditionally independent of Y given X , that we denote by

Y (−− X (−− T. (1)

The IB quantifies this goal by measuring the “compactness” of T using the mutual information I(X;T ) and, similarly,“informativeness” by I(Y ;T ). For a given level of compactness R ≥ 0, IB extracts the bottleneck variable T that solves theconstrained optimization problem

IB(R) := sup I(Y ;T ) subject to I(X;T ) ≤ R, (2)

where the supremum is taken over all randomized functions T = F (X) satisfying Y (−− X (−− T .The optimization problem that underlies the information bottleneck has been studied in the information theory literature as

early as the 1970’s — see [3]–[6] — as a technique to prove impossibility results in information theory and also to study thecommon information between X and Y . Wyner and Ziv [3] explicitly determined the value of IB(R) for the special case ofbinary X and Y — a result widely known as Mrs. Gerber’s Lemma [3], [7]. More than twenty years later, the informationbottleneck function was studied by Tishby et al. [2] and re-formulated in a data analytic context. Here, the random variable Xrepresents a high-dimensional observation with a corresponding low-dimensional feature Y . IB aims at specifying a compresseddescription of image which is maximally informative about feature Y . This framework led to several applications in clustering[8]–[10] and quantization [11], [12].

A closely-related framework to IB is the privacy funnel (PF) problem [13]–[15]. In the PF framework, a bottleneck variable Tis sought to be maximally preserve "information" contained in X while revealing as little about Y as possible. This frameworkaims to capture the inherent trade-off between revealing X perfectly and leaking a sensitive attribute Y . For instance, supposea user wishes to share an image X for some classification tasks. The image might carry information about attributes, say Y ,that the user might consider as sensitive, even when such information is of limited use for the tasks, e.g, location, or emotion.The PF framework seeks to extract a representation of X from which the original image can be recovered with maximalaccuracy while minimizing the privacy leakage with respect to Y . Using mutual information for both privacy leakage andinformativeness, the privacy funnel can be formulated as

PF(r) := inf I(Y ;T ) subject to I(X;T ) ≥ r, (3)

where the infumum is taken over all randomized function T = F (X) and r is the parameter specifying the level ofinformativeness. It is evident from the formulations (2) and (3) that IB and PF are closely related. In fact, we shall seelater that they correspond to the upper and lower boundaries of a two-dimensional compact convex set. This duality has ledto design of greedy algorithms [13], [16] for estimating PF based on agglomerative information bottleneck [10] algorithm. Asimilar formulation has recently been proposed in [17] as a tool to train a neural network for learning a private representation ofdata X . Solving IB and PF optimization problems analytically is challenging. However, recent machine learning applications,and deep learning algorithms in particular, have reignited the study of both IB and PF (see Related Work).

In this paper, we first give a cohesive overview of the existing results surrounding the IB and the PF formulations. Wethen provide a comprehensive analysis of IB and PF from an information-theoretic perspective, as well as a survey of severalformulations connected to the IB and PF that have been introduced in the information theory and machine learning literature.Moreover, we overview connections with coding problems such as remote source-coding [18], testing against independence[19], and dependence dilution [20]. Leveraging these connections, we prove a new cardinality bound for the bottleneck variablein IB, leading to more tractable optimization problem for IB. We then consider a broad family of optimization problems bygoing beyond mutual information in formulations (2) and (3). We propose two candidates for this task: Arimoto’s mutualinformation [21] and f -information [22]. By replacing I(Y ;T ) and/or I(X;T ) with either of these measures, we generate afamily of optimization problems that we referred to as the bottleneck problems. These problems are shown to better capturethe underlying trade-offs intended by IB and PF. More specifically, our main contributions are listed next.• Computing IB and PF are notoriously challenging when X takes values in a set with infinite cardinality (e.g., X is drawn

from a continuous probability distribution). We consider three different scenarios to circumvent this difficulty. First, weassume that X is a Gaussian perturbation of Y , i.e., X = Y +NG where NG is a noise variable sampled from Gaussiandistribution independent of Y . Building upon the recent advances in entropy power inequality in [23], we derive a sharpupper bound for IB(R). As a special case, we consider jointly Gaussian (X,Y ) for which the upper bound becomes tight.This then provides a significantly simpler proof for the fact that in this special case the optimal bottleneck variable T isalso Gaussian than the original proof given in [24]. In the second scenario, we assume that Y is a Gaussian perturbation

3

of X , i.e., Y = X + NG. This corresponds to a practical setup where the feature Y might be perfectly obtained froma noisy observation of X . Relying on the recent results in strong data processing inequality [25], we obtain an upperbound on IB(R) which is tight for small values of R. In the last scenario, we compute second-order approximation ofPF(r) under the assumption that T is obtained by Gaussian perturbation of X , i.e., T = X +NG. Interestingly, the rateof increase of PF(r) for small values of r is shown to be dictated by an asymmetric measure of dependence introducedby Rényi [26].

• We extend the Witsenhausen and Wyner’s approach [4] for analytically computing IB and PF. This technique convertssolving the optimization problems in IB and PF to determining the convex and concave envelopes of a certain function,respectively. We apply this technique to binary X and Y and derive a closed form expression for PF(r)– we call thisresult Mr. Gerber’s Lemma.

• Relying on the connection between IB and noisy source coding [18] information theory literature, we show that theoptimal bottleneck variable T in optimization problem (2) takes values in a set T with cardinality |T | ≤ |X |. Comparedto the best cardinality bound previously known (i.e., |T | ≤ |X |), this result leads to a dimensionality reduction of thesearch space of optimization problem (2) from R|X |2 to R|X |(|X |−1). Moreover, we show that this does not hold for PF,indicating a fundamental difference in optimizations problems (2) and (3).

• Following [15], [27], we study the deterministic IB and PF (denoted by dIB and dPF) in which T is assumed to be adeterministic function of X , i.e., T = f(X) for some function f . By connecting dIB and dPF with entropy-constrainedscalar quantization problems in information theory [28], we obtain bounds on them explicitly in terms of |X |. Applyingthese bounds to IB, we obtain that IB(R)

I(X;Y ) is bounded by one from above and by min{ RH(X) ,

eR−1|X | } from below.

• By replacing I(Y ;T ) and/or I(X;T ) in (2) and (3) with Arimoto’s mutual information or f -information, we generate afamily of bottleneck problems. We then argue that these new functionals better describe the trade-offs that were intendedto be captured by IB and PF. The main reason is three-fold: First, as illustrated in Section II-C, mutual information in IBand PF are mainly justified when n� 1 independent samples (X1, Y1), . . . , (Xn, Yn) of PXY are considered. However,Arimoto’s mutual information allows for operational interpretation even in the single-shot regime (i.e., for n = 1). Second,I(Y ;T ) in IB and PF is meant to be a proxy for the efficiency of reconstructing Y given observation T . However, thiscan be accurately formalized by probability of correctly guessing Y given T (i.e., Bayes risk) or minimum mean-squareerror (MMSE) in estimating Y given T . While I(Y ;T ) bounds these two measures, we show that they are preciselycharacterized by Arimoto’s mutual information and f -information, respectively. Finally, when PXY is unknown, mutualinformation is known to be notoriously difficult to estimate. Nevertheless, Arimoto’s mutual information and f -informationare easier to estimate: While mutual information can be estimated with estimation error that scales as O(log n/

√n) [29],

Diaz et a. [30] showed that this estimation error for Arimoto’s mutual information and f -information is O(1/√n).

We also generalize our computation technique that enables us to analytically compute these bottleneck problems. Similaras before, this technique converts computing bottleneck problems to determining convex and concave envelopes of certainfunctions. Focusing on binary X and Y , we derive closed form expressions for some of the bottleneck problems.

A. Related Work

The IB formulation has been extensively applied in clustering [9], [31]–[33]. Clustering based on IB results in algorithmsthat cluster data points in terms of the similarity of PY |X . When data points lie in a metric space, usually geometric clusteringis preferred where clustering is based upon he geometric (e.g., Euclidean) distance. Strouse and Schwab [27], [34] proposedthe deterministic IB (denoted by dIB) by enforcing that PT |X is a deterministic mapping: dIB(R) denotes the supremum ofI(Y ; f(X)) over all functions f : X → T satisfying H(f(X)) ≤ R. This optimization problem is closely related to theproblem of scalar quantization in information theory: designing a function f : X → [M ] := {1, . . . ,M} with a pre-determinedoutput alphabet with f optimizing some objective functions. This objective might be maximizing or minimizing H(f(X))[35] or maximizing I(Y ; f(X)) for a random variable Y correlated with X [28], [36]–[38]. Since H(f(X)) ≤ logM forf : X → [M ], the latter problem provides lower bounds for dIB (and thus for IB). In particular, one exploit [39, Theorem 1] toobtain I(X;Y )− dIB(R) ≤ O(e−2R/|Y|−1) provided that min{|X |, 2R} > 2|Y|. This result establishes a linear gap betweendIB and I(X;Y ) irrespective of |X |.

The connection between quantization and dIB further allows us to obtain multiplicative bounds. For instance, if Y ∼Bernoulli( 1

2 ) and X = Y + NG, where NG ∼ N (0, 1) is independent of Y , then it is well-known in information theoryliterature that I(Y ; f(X)) ≥ 2

π I(X;Y ) for all f : X → {0, 1} (see, e.g., [40, Section 2.11]), thus dIB(R) ≥ 2π I(X;Y ) for

R ≤ 1. We further explore this connection to provide multiplicative bounds on dIB(R) in Section II-E.The study of IB has recently gained increasing traction in the context of deep learning. By taking T to be the activity of the

hidden layer(s), Tishby and Zaslavsky [41] (see also [42]) argued that neural network classifiers trained with cross-entropy lossand stochastic gradient descent (SGD) inherently aims at solving the IB optimization problems. In fact, it is claimed that thegraph of the function R 7→ IB(R) (the so-called the information plane) characterizes the learning dynamic of different layers inthe network: shallow layers correspond to maximizing I(Y ;T ) while deep layers’ objective is minimizing I(X;T ). While thegenerality of this claim was refuted empirically in [43] and theoretically in [44], [45], it inspired significant follow-up studies.

4

These include (i) modifying neural network training in order to solve the IB optimization problem [46]–[50]; (ii) creatingconnections between IB and generalization error [51], robustness [46], and detection of out-of-distribution data [52]; and (iii)using IB to understand specific characteristic of neural networks [50], [53]–[55].

In both IB and PF, mutual information poses some limitations. For instance, it may become infinity in deterministic neuralnetworks [43]–[45] and also may not lead to proper privacy guarantee [56]. As suggested in [50], [57], one way to address thisissue is to replace mutual information with other statistical measures. In the privacy literature, several measures with strongprivacy guarantee have been proposed including Rényi maximal correlation [20], [58], [59], probability of correctly recovering[60], [61], minimum mean-squared estimation error (MMSE) [62], [63], χ2-information [64] (a special case of f -informationto be described in Section III), Arimoto’s and Sibson’s mutual information [56], [65] – to be discussed in Section III, maximalleakage [66], and local differential privacy [67]. All these measures ensure interpretable privacy guarantees. For instance, it isshown in1 [62], [63] that if χ2-information between Y and T is sufficiently small, then no functions of Y can be efficientlyreconstructed given T ; thus providing an interpretable privacy guarantee.

Another limitation of mutual information is related to its estimation difficulty. It is known that mutual information can beestimated from n samples with the estimation error that scales as O(log n/

√n) [29]. However, as shown by Diaz et al. [30], the

estimation error for most of the above measures scales as O(1/√n). Furthermore, the recently popular variational estimators

for mutual information, typically implemented via deep learning methods [69]–[71], presents some fundamental limitations[72]: the variance of the estimator might grow exponentially with the ground truth mutual information and also the estimatormight not satisfy basic properties of mutual information such as data processing inequality or additivity. McAllester and Stratos[73] showed that some of these limitations are inherent to a large family of mutual information estimators.

B. Notation

We use capital letters, e.g., X , for random variables and calligraphic letters for their alphabets, e.g., X . If X is distributedaccording to probability mass function (pmf) PX , we write X ∼ PX . Given two random variables X and Y , we write PXYand PY |X as the joint distribution and the conditional distribution of Y given X . We also interchangeably refer to PY |X as achannel from X to Y . We use H(X) to denote both entropy and differential entropy of X , i.e., we have

H(X) = −∑x∈X

PX(x) logPX(x)

if X is a discrete random variable taking values in X with probability mass function (pmf) PX and

H(X) = −∫

log fX(x) log fX(x)dx,

where X is an absolutely continuous random variable with probability density function (pdf) fX . If X is binary randomvariable with PX(1) = p, we write X ∼ Bernoulli(p). In this case, its entropy is called binary entropy function and denoted byhb(p) := −p log p−(1−p) log(1−p). We use superscript G to describe a standard Gaussian random variable, i.e., NG ∼ N (0, 1).Given two random variables X and Y , their (Shannon’s) mutual information is denoted by I(X;Y ) := H(Y )−H(Y |X). Welet P(X ) denote the set of all probability distributions on the set X . Given an arbitrary QX ∈ P(X ) and a channel PY |X , welet QX ⊗PY |X denote the resulting output distribution on Y . For any a ∈ [0, 1], we use a to denote 1− a and for any integerk ∈ N, [k] := {1, 2, . . . , k}.

Throughout the paper, we assume a pair of (discrete or continuous) random variables (X,Y ) ∼ PXY are given with a fixedjoint distribution PXY , marginals PX and PY , and conditional distribution PY |X . We then use QX ∈ P(X ) to denote anarbitrary distribution with QY = QX ⊗ PY |X ∈ P(Y).

II. INFORMATION BOTTLENECK AND PRIVACY FUNNEL: DEFINITIONS AND FUNCTIONAL PROPERTIES

In this section, we review the information bottleneck and its closely related functional, the privacy funnel. We then provesome analytical properties of these two functionals and develop a convex analytic approach which enables us to computeclosed-form expressions for both these two functionals in some simple cases.

To precisely quantify the trade-off between these two conflicting goals, the IB optimization problem (2) was proposed [2].Since any randomized function T = F (X) can be equivalently characterized by a conditional distribution, (2) can be insteadexpressed as

IB(PXY , R) := supPT |X :Y(−−X(−−T

I(X;T )≤R

I(Y ;T ), or IB(PXY , R) := infPT |X :Y(−−X(−−T

I(Y ;T )≥R

I(X;T ). (4)

1The original results in [62], [63] involve Rényi maximal correlation instead of χ2-information. However, it can be shown that χ2-information is equal tothe sum of squares of the singular values of f(Y ) 7→ E[f(Y )|T ] minus one (the largest one), while Rényi maximal correlation is equal to the second largestsingular value [68]. Thus, χ2-information upper bounds Rényi maximal correlation.

5

(a) |X | = |Y| = 2 (b) |X | = 3, |Y| = 2 (c) |X | = 5, |Y| = 2

Fig. 1. Examples of the setM, defined in (6). The upper and lower boundaries of this set correspond to IB and PF, respectively. It is worthnoting that, while IB(R) = 0 only at R = 0, PF(r) = 0 holds in general for r belonging to a non-trivial interval (only for |X | > 2).Also, note that in general neither upper nor lower boundaries are smooth. A sufficient condition for smoothness is PY |X(y|x) > 0 (seeTheorem 1), thus both IB and PF are smooth in the binary case.

where R and R denote the level of desired compression and informativeness, respectively. We use IB(R) and IB(R) to denoteIB(PXY , R) and IB(PXY , R), respectively, when the joint distribution is clear from the context. Notice that if IB(PXY , R) = R,then IB(PXY , R) = R.

Now consider the setup where data X is required to be disclosed while maintaining the privacy of a sensitive attribute,represented by Y . This goal was formulated by PF in (3). As before, replacing randomized function T = F (X) with conditionaldistribution PT |X , we can equivalently express (3) as

PF(PXY , r) := infPT |X :Y(−−X(−−T

I(X;T )≥r

I(Y ;T ), or PF(PXY , r) := supPT |X :Y(−−X(−−T

I(Y ;T )≤r

I(X;T ), (5)

where r and r denote the level of desired privacy and informativeness, respectively. The case r = 0 is particularly interestingin practice and specifies perfect privacy, see e.g., [14], [74]. As before, we write PF(r) and PF(r) for PF(PXY , r) andPF(PXY , r) when PXY is clear from the context.

The following properties of IB and PF follow directly from their definitions.

Theorem 1. For a given PXY , the mappings IB(R) and PF(r) have the following properties:• IB(0) = PF(0) = 0.• IB(R) = I(X;Y ) for any R ≥ H(X) and PF(r) = I(X;Y ) for r ≥ H(X).• 0 ≤ IB(R) ≤ min{R, I(X;Y )} for any R ≥ 0 and PF(r) ≥ max{r −H(X|Y ), 0} for any r ≥ 0.• R 7→ IB(R) is continuous, strictly increasing, and concave on the range (0, I(X;Y )).• r 7→ PF(r) is continuous, strictly increasing, and convex on the range (0, I(X;Y )).• If PY |X(y|x) > 0 for all x ∈ X and y ∈ Y , then both R 7→ IB(R) and r 7→ PF(r) are continuously differentiable over

(0, H(X)).• R 7→ IB(R)

R is non-increasing and r 7→ PF(r)r is non-decreasing.

• We haveIB(R) := sup

PT |X :Y(−−X(−−TI(X;T )=R

I(Y ;T ), and PF(r) := infPT |X :Y(−−X(−−T

I(X;T )=r

I(Y ;T ).

According to this theorem, we can always restrict both R and r in (4) and (5), respectively, to [0, H(X)] as IB(R) =PF(r) = I(X;Y ) for all r,R ≥ H(X).

Define M =M(PXY ) ⊂ R2 as

M :={

(I(X;T ), I(Y ;T )) : Y (−− X (−− T, (X,Y ) ∼ PXY}. (6)

It can be directly verified that M is convex. According to this theorem, R 7→ IB(R) and r 7→ PF(r) correspond to the upperand lower boundary of M, respectively. The convexity of M then implies the concavity and convexity of IB and PF. Fig. 1illustrates the set M for the simple case of binary X and Y .

While both IB(0) = 0 and PF(0) = 0, their behavior in the neighborhood around zero might be completely different. Asillustrated in Fig. 1, IB(R) > 0 for all R > 0, whereas PF(r) = 0 for r ∈ [0, r0] for some r0 > 0. When such r0 > 0 exists,we say perfect privacy occurs: there exists a variable T satisfying Y (−− X (−− T such that I(Y ;T ) = 0 while I(X;T ) > 0;making T a representation of X having perfect privacy (i.e., no information leakage about Y ). A necessary and sufficientcondition for the existence of such T is given in [20, Lemma 10] and [14, Theorem 3], described next.

6

Theorem 2 (Perfect privacy). Let (X,Y ) ∼ PXY be given and A ⊂ [0, 1]|Y| be the set of vectors {PY |X(·|x), x ∈ X}. Thenthere exists r0 > 0 such that PF(r) = 0 for r ∈ [0, r0] if and only if vectors in A are linearly independent.

In light of this theorem, we obtain that perfect privacy occurs if |X | > |Y|. It also follows from the theorem that for binaryX , perfect privacy cannot occur (see Fig. 1a).

Theorem 1 enables us to derive a simple bounds for IB and PF. Specifically, the facts that PF(r)r is non-decreasing and IB(R)

Ris non-increasing immediately result in the the following linear bounds.

Theorem 3 (Linear lower bound). For r,R ∈ (0, H(X)), we have

infQX∈P(X)QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX)≤ PF(r)

r≤ I(X;Y )

H(X)≤ IB(R)

R≤ sup

QX∈P(X)QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX)≤ 1. (7)

In light of this theorem, if PF(r) = r, then I(X;Y ) = H(X), implying X = g(Y ) for a deterministic function g.Conversely, if X = g(Y ) then PF(r) = r because for all T forming the Markov relation Y (−− g(Y ) (−− T , we haveI(Y ;T ) = I(g(Y );T ). On the other hand, we have IB(R) = R if and only if there exists a variable T ∗ satisfying I(X;T ∗) =I(Y ;T ∗) and thus the following double Markov relations

Y (−− X (−− T ∗, and X (−− Y (−− T ∗.

It can be verified (see [75, Problem 16.25]) that this double Markov condition is equivalent to the existence of a pair offunctions f and g such that f(X) = g(Y ) and (X,Y ) (−− f(X) (−− T ∗. One special case of this setting, namely where g isan identity function, has been recently studied in details in [48] and will be reviewed in Section II-E. Theorem 3 also enablesus to characterize the "worst" joint distribution PXY with respect to IB and PF. As demonstrated in the following lemma, ifPY |X is an erasure channel then PF(r)

r = IB(R)R = I(X;Y )

H(X) .

Lemma 1. • Let PXY be such that Y = X ∪ {⊥}, PY |X(x|x) = 1− δ, and PY |X(⊥ |x) = δ for some δ > 0. Then

PF(r)

r=

IB(R)

R= 1− δ.

• Let PXY be such that X = Y ∪ {⊥}, PX|Y (y|y) = 1− δ, and PX|Y (⊥ |y) = δ for some δ > 0. Then

PF(r) = max{r −H(X|Y ), 0}.

The bounds in Theorem 3 hold for all r and R in the interval [0, H(X)]. We can, however, improve them when r and Rare sufficiently small. Let PF′(0) and IB′(0) denote the slope of PF(·) and IB(·) at zero, i.e., PF′(0) := limr→0+

PF(r)r and

IB′(0) := limR→0+IB(R)R .

Theorem 4. Given (X,Y ) ∼ PXY , we have

infQX∈P(X)QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX)= PF′(0) ≤ min

x∈X :PX(x)>0

DKL(PY |X(·|x)‖PY (·))− logPX(x)

≤ maxx∈X :

PX(x)>0

DKL(PY |X(·|x)‖PY (·))− logPX(x)

≤ IB′(0) = supQX∈P(X)QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX).

This theorem provides the exact values of PF′(0) and IB′(0) and also simple bounds for them. Although the exact expressionsfor PF′(0) and IB′(0) are usually difficult to compute, a simple plug-in estimator is proposed in [76] for IB′(0). This estimatorcan be readily adapted to estimate PF′(0). Theorem 4 reveals a profound connection between IB and the strong data processinginequality (SDPI) [77]. More precisely, thanks to the pioneering work of Anantharam et al. [78], it is known that the supremumof DKL(QY ‖PY )

DKL(QX‖PX) over all QX 6= PX is equal the supremum of I(Y ;T )I(X;T ) over all PT |X satisfying Y (−− X (−− T and hence

IB′(0) specifies the strengthening of the data processing inequality of mutual information. This connection may open a newavenue for new theoretical results for IB, especially when X or Y are continuous random variables. In particular, the recentnon-multiplicative SDPI results [25], [79] seem insightful for this purpose.

In many practical cases, we might have n i.i.d. samples (X1, Y1), . . . , (Xn, Yn) of (X,Y ) ∼ PXY . We now study how IBbehaves in n. Let Xn := (X1, . . . , Xn) and Y n := (Y1, . . . , Yn). Due to the i.i.d. assumption, we have PXnY n(xn, yn) =∏ni=1 PXY (xi, yi). This can also be described by independently feeding Xi, i ∈ [n], to channel PY |X producing Yi. The

following theorem, demonstrated first in [4, Theorem 2.4], gives a formula for IB in terms of n.

Theorem 5 (Additivity). We have1

nIB(PXnY n , nR) = IB(PXY , R).

7

This theorem demonstrates that an optimal channel PTn|Xn for i.i.d. samples (Xn, Y n) ∼ PXY is obtained by the Kroneckerproduct of an optimal channel PT |X for (X,Y ) ∼ PXY . This, however, may not hold in general for PF, that is, we mighthave PF(PXnY n , nr) < nPF(PXY , r).

A. Gaussian IB and PF

In this section, we turn our attention to a special, yet important, case where X = Y +σNG, where σ > 0 and NG ∼ N (0, 1)is independent of Y . This setting subsumes the popular case of jointly Gaussian (X,Y ) whose information bottleneck functionalwas computed in [80].

Lemma 2. Let {Yi}ni=1 be n i.i.d. copies of Y ∼ PY and Xi = Yi+σNGi where {NG

i } are i.i.d samples of N (0, 1) independentof Y . Then, we have

1

nIB(PXnY n , nR) ≤ H(X)− 1

2log[2πeσ2 + e2(H(Y )−R)

].

This lemma establishes an upper bound for IB without any assumptions on PT |X . This upper bound provides a significantlysimpler proof for the well-known fact that for the jointly Gaussian (X,Y ), the optimal channel PT |X is Gaussian. This resultwas first proved in [24] and used in [80] to compute an expression of IB for the Gaussian case.

Corollary 1. If (X,Y ) are jointly Gaussian with correlation coefficient ρ, then we have

IB(R) =1

2log

1

1− ρ2 + ρ2e−2R. (8)

Moreover, the optimal channel PT |X is given by PT |X(·|x) = N (0, σ2) for σ2 = σ2Y

e−2R

ρ2(1−e−2R)where σ2

Y is the variance ofY .

In Lemma 2, we assumed that X is a Gaussian perturbation of Y . However, in some practical scenarios, we might have Yas a Gaussian perturbation of X . For instance, let X represent an image and Y be a feature of the image that can be perfectlyobtained from a noisy observation of X . Then, the goal is to compress the image with a given compression rate while retainingmaximal information about the feature. The following lemma, which is an immediate consequence of [25, Theorem 1], givesan upper bound for IB in this case.

Lemma 3. Let Xn be n i.i.d. copies of a random variable X satisfying E[X2] ≤ 1 and Yi be the result of passing Xi, i ∈ [n],through a Gaussian channel Y = X + σNG, where σ > 0 and NG ∼ N (0, 1) is independent of X . Then, we have

1

nIB(PXnY n , nR) ≤ R−Ψ(R, σ), (9)

where

Ψ(R, σ) := maxx∈[0, 12 ]

2Q

(√1

xσ2

)(R− hb(x)− x

2log

(1 +

1

xσ2

)), (10)

Q(t) :=∫∞t

1√2πe−

t2

2 dt is the Gaussian complimentary CDF and hb(a) := −a log(a) − (1 − a) log(1 − a) for a ∈ (0, 1) isthe binary entropy function. Moreover, we have

1

nIB(PXnY n , nR) ≤ R− e−

1Rσ2

log 1R+Θ(log 1

R ). (11)

Note that that Lemma 3 holds for any arbitrary X (provided that E[X2] ≤ 1) and hence (9) bounds information bottleneckfunctionals for a wide family of PXY . However, the bound is loose in general for large values of R. For instance, if (X,Y )are jointly Gaussian (implying Y = X+σNG for some σ > 0), then the right-hand side of (9) does not reduce to (8). To showthis, we numerically compute the upper bound (9) and compare it with the Gaussian information bottleneck (8) in Fig. 2.

The privacy funnel functional is much less studied even for the simple case of jointly Gaussian. Solving the optimization inPF over PT |X without any assumptions is a difficult challenge. A natural assumption to make is that PT |X(·|x) is Gaussianfor each x ∈ X . This leads to the following variant of PF

PFG(r) := infσ≥0,

I(X;Tσ)≥r

I(Y ;Tσ),

whereTσ := X + σNG,

and NG ∼ N (0, 1) is independent of X . This formulation is tractable and can be computed in closed form for jointly Gaussian(X,Y ) as described in the following example.

8

Fig. 2. Comparison of (8), the exact value of IB for jointly Gaussian X and Y (i.e., Y = X + σNG with X and NG being both standardGaussian N (0, 1)), with the general upper bound (9) for σ2 = 0.5. It is worth noting that while the Gaussian IB converges to I(X;Y ) ≈ 0.8,the upper bound diverges.

Example 1. Let X and Y be jointly Gaussian with correlation coefficient ρ. First note that since mutual information is invariantto scaling, we may assume without loss of generality that both X and Y are zero mean and unit variance and hence we canwrite X = ρY +

√1− ρ2MG where MG ∼ N (0, 1) is independent of Y . Consequently, we have

I(X;Tσ) =1

2log

(1 +

1

σ2

), (12)

andI(Y ;Tσ) =

1

2log

(1 +

ρ2

1− ρ2 + σ2

). (13)

In order to ensure I(X;Tσ) ≥ r, we must have σ ≤(e2r − 1

)− 12 . Plugging this choice of σ into (13), we obtain

PFG(r) =1

2log

(1

1− ρ2 (1− e−2r)

). (14)

This example indicates that for jointly Gaussian (X,Y ), we have PFG(r) = 0 if and only if r = 0 (thus perfect privacy does

not occur) and the constraint I(X;Tσ) = r is satisfied by a unique σ. These two properties in fact hold for all continuousvariables X and Y with finite second moments as demonstrated in Lemma 10 in Appendix A. We use these properties toderive a second-order approximation of PFG(r) when r is sufficiently small. For the following theorem, we use var(U) todenote the variance of the random variable U and var(U |V ) := E[(U − E[U |V ])2|V ]. We use σ2

X = var(X) for short.

Theorem 6. For any pair of continuous random variables (X,Y ) with finite second moments, we have as r → 0

PFG(r) = η(X,Y )r + ∆(X,Y )r2 + o(r2),

where η(X,Y ) := var(E[X|Y ])σ2X

and

∆(X,Y ) :=2

σ4X

[E[var2(X|Y )]− σ2

XE[var(X|Y )]].

It is worth mentioning that the quantity η(X,Y ) was first defined by Rényi [26] as an asymmetric measure of correlationbetween X and Y . In fact, it can be shown that η(X,Y ) = supf ρ

2(X, f(Y )), where supremum is taken over all measurablefunctions f and ρ(·, ·) denotes the correlation coefficient. As a simple illustration of Theorem 6, consider jointly Gaussian Xand Y with correlation coefficient ρ for which PFG was computed in Example 1. In this case, it can be easily verified thatη(X,Y ) = ρ2 and ∆(X,Y ) = −2σ2

Xρ2(1 − ρ2). Hence, for jointly Gaussian (X,Y ) with correlation coefficient ρ and unit

variance, we have PFG(r) = ρ2r − 2ρ2(1− ρ2)r2 + o(r2). In Fig. 3, we compare the approximation given in Theorem 6 forthis particular case.

B. Evaluation of IB and PF

The constrained optimization problems in the definitions of IB and PF are usually challenging to solve numerically dueto the non-linearity in the constraints. In practice, however, both IB and PF are often approximated by their correspondingLagrangian optimizations

LIB(β) := supPT |X

I(Y ;T )− βI(X;T ) = H(Y )− βH(X)− infPT |X

[H(Y |T )− βH(X|T )] , (15)

9

Fig. 3. Second-order approximation of PFG according to Theorem 6 for jointly Gaussian X and Y with correlation coefficient ρ = 0.8. Forthis particular case, the exact expression of PFG is computed in (14).

andLPF(β) := inf

PT |XI(Y ;T )− βI(X;T ) = H(Y )− βH(X)− sup

PT |X

[H(Y |T )− βH(X|T )] , (16)

where β ∈ R+ is the Lagrangian multiplier that controls the tradeoff between compression and informativeness in for IB andthe privacy and informativeness in PF. Notice that for the computation of LIB, we can assume, without loss of generality, thatβ ∈ [0, 1] since otherwise the maximizer of (15) is trivial. It is worth noting that LIB(β) and LPF(β) in fact correspond tolines of slope β supporting M from above and below, thereby providing a new representation of M.

Let (X ′, Y ′) be a pair of random variables with X ′ ∼ QX for some QX ∈ P(X ) and Y ′ is the output of PY |X when theinput is X ′. Define

Fβ(QX) := H(Y ′)− βH(X ′).

This function, in general, is neither convex or concave in QX . For instance, F (0) is concave and F (1) is convex in PX .The lower convex envelope (resp. upper concave envelope) of Fβ(QX) is defined as the largest (resp. smallest) convex (resp.concave) smaller (larger) than Fβ(QX). Let K∪[Fβ(QX)] and K∩[Fβ(QX)] denote the lower convex and upper concaveenvelopes of Fβ(QX), respectively. If Fβ(QX) is convex at PX , that is K∪[Fβ(QX)]

∣∣PX

= Fβ(PX), then Fβ(QX) remainsconvex at PX for all β′ ≥ β because

K∪[Fβ′(QX)] = K∪[Fβ(QX)− (β′ − β)H(X ′)]

≥ K∪[Fβ(QX)] +K∪[−(β′ − β)H(X ′)]

= K∪[Fβ(QX)]− (β′ − β)H(X ′),

where the last equality follows from the fact that −(β′ − β)H(X) is convex. Hence, at PX we have

K∪[Fβ′(QX)]∣∣PX≥ K∪[Fβ(QX)]

∣∣PX− (β′ − β)H(X) = Fβ(PX)− (β′ − β)H(X) = Fβ′(PX).

Analogously, if Fβ(QX) is concave at PX , that is K∩[Fβ(QX)]∣∣PX

= Fβ(PX), then Fβ(QX) remains concave at PX for allβ′ ≤ β.

Notice that, according to (15) and (16), we can write

LIB(β) = H(Y )− βH(X)−K∪[Fβ(QX)]∣∣PX, (17)

andLPF(β) = H(Y )− βH(X)−K∩[Fβ(QX)]

∣∣PX. (18)

In light of the above arguments, we can writeLIB(β) = 0,

for all β > βIB where βIB is the smallest β such that Fβ(PX) touches K∪[Fβ(QX)]. Similarly,

LPF(β) = 0,

for all β < βPF where βPF is the largest β such that Fβ(PX) touches K∩[Fβ(QX)]. In the following theorem, we show thatβIB and βPF are given by the values of IB′(0) and PF′(0), respectively, given in Theorem 4.

10

Proposition 1. We have,

βIB = supQX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX),

andβPF = inf

QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX).

Kim et al. [76] have recently proposed an efficient algorithm to estimate βIB from samples of PXY involving a simpleoptimization problem. This algorithm can be readily adapted for estimating βPF. Proposition 1 implies that in optimizing theLagrangians (17) and (18), we can restrict the Lagrange multiplier β, that is

LIB(β) = H(Y )− βH(X)−K∪[Fβ(QX)]∣∣PX, for β ∈ [0, βIB], (19)

andLPF(β) = H(Y )− βH(X)−K∩[Fβ(QX)]

∣∣PX, for β ∈ [βPF,∞). (20)

Remark 1. As demonstrated by Kolchinsky et al. [48], the boundary points 0 and βIB are required for the computation ofLIB(β). In fact, when Y is a deterministic function of X , then only β = 0 and β = βIB are required to compute the IB andother values of β are vacuous. The same argument can also be used to justify the inclusion of βPF in computing LPF(β). Notealso that since Fβ(QX) becomes convex for β > βIB, computing K∩[Fβ(QX)] becomes trivial for such values of β.

Remark 2. Observe that the lower convex envelope of any function f can be obtained by taking Legendre-Fenchel transforma-tion (aka. convex conjugate) twice. Hence, one can use the existing linear-time algorithms for approximating Legendre-Fencheltransformation (e.g., [81], [82]) for approximating K∪[Fβ(QX)].

Once LIB(β) and LPF(β) are computed, we can derive IB and PF via standard results in optimization (see [4, Section IV]for more details):

IB(R) = infβ∈[0,βIB]

βR+ LIB(β), (21)

andPF(r) = sup

β∈[βPF,∞]

βr + LPF(β). (22)

Following the convex analysis approach outlined by Witsenhausen and Wyner [4], IB and PF can be directly computed fromLIB(β) and LPF(β) by observing the following. Suppose for some β, K∪[Fβ(QX)] (resp. K∩[Fβ(QX)]) at PX is obtainedby a convex combination of points Fβ(Qi), i ∈ [k] for some Q1, . . . , Qk in P(X ), integer k ≥ 2, and weights λi ≥ 0 (with∑i λi = 1). Then

∑i λiQ

i = PX , and T ∗ with properties PT∗(i) = λi and PX|T∗=i = Qi attains the minimum (resp.maximum) of H(Y |T ) − βH(X|T ). Hence, (I(X;T ∗), I(Y ;T ∗)) is a point on the upper (resp. lower) boundary of M;implying that IB(R) = I(Y ;T ∗) for R = I(X;T ∗) (resp. PF(r) = I(Y ;T ∗) for r = I(X;T ∗)). If for some β, K∪[Fβ(QX)]at PX coincides with Fβ [PX ], then this corresponds to LIB(β) = 0. The same holds for K∪[Fβ(QX)]. Thus, all the informationabout the functional IB (resp. PF) is contained in the subset of the domain of K∪[Fβ(QX)] (resp. K∩[Fβ(QX)]) over whichit differs from Fβ(QX). We will revisit and generalize this approach later in Section III.

We can now instantiate this for the binary symmetric case. Suppose X and Y are binary variables and PY |X is binarysymmetric channel with crossover probability δ, denoted by BSC(δ) and defined as

BSC(δ) =

[1− δ δδ 1− δ

], (23)

for some δ ≥ 0. To describe the result in a compact fashion, we introduce the following notation: we let hb : [0, 1] → [0, 1]denote the binary entropy function, i.e., hb(p) = −p log p− (1− p) log(1− p). Since this function is strictly increasing [0, 1

2 ],its inverse exists and is denoted by h−1

b : [0, 1]→ [0, 12 ]. Also, a ∗ b := a(1− b) + b(1− a) for a, b ∈ [0, 1].

Lemma 4 (Mr. and Mrs. Gerber’s Lemma). For X ∼ Bernoulli(p) for p ≤ 12 and PY |X = BSC(δ) for δ ≥ 0, we have

IB(R) = hb(p ∗ δ)− hb(δ ∗ h−1

b

(hb(p)−R

)), (24)

andPF(r) = hb(p ∗ δ)− αhb

(δ ∗ p

z

)− αhb (δ) , (25)

where r = hb(p)− αhb(pz

), z = max (α, 2p), and α ∈ [0, 1].

The result in (24) was proved by Wyner and Ziv [3] and is widely known as Mrs. Gerber’s Lemma in information theory.Due to the similarity, we refer to (25) as Mr. Gerber’s Lemma. As described above, to prove (24) and (25) it suffices to derive

11

(a) β = 0.55 (b) β = 0.53 (c) β = 0.49

Fig. 4. The mapping q 7→ Fβ(q) = H(Y ′)− βH(X ′) where X ′ ∼ Bernoulli(q) and Y ′ is the result of passing X ′ through BSC(0.1), see(26).

the convex and concave envelopes of the mapping Fβ : [0, 1]→ R given by

Fβ(q) := Fβ(QX) = hb(q ∗ δ)− βhb(q), (26)

where q ∗ δ := qδ + δq is the output distribution of BSC(δ) when the input distribution is Bernoulli(q) for some q ∈ (0, 1). Itcan be verified that βIB ≤ (1− 2δ)2. This function is depicted in Fig. 4 depending of the values of β ≤ (1− 2δ)2.

C. Operational Meaning of IB and PF

In this section, we illustrate several information-theoretic settings which shed light on the operational interpretation of bothIB and PF. The operational interpretation of IB has recently been extensively studied in information-theoretic settings in [83],[84]. In particular, it was shown that IB specifies the rate-distortion region of noisy source coding problem [18], [85] underthe logarithmic loss as the distortion measure and also the rate region of the lossless source coding with side informationat the decoder [86]. Here, we state the former setting (as it will be useful for our subsequent analysis of cardinality bound)and also provide a new information-theoretic setting in which IB appears as the solution. Then, we describe another setting,the so-called dependence dilution, whose achievable rate region has an extreme point specified by PF. This in fact delineatean important difference between IB and PF: while IB describes the entire rate-region of an information-theoretic setup, PFspecifies only a corner point of a rate region. Other information-theoretic settings related to IB and PF include CEO problem[87] and source coding for the Gray-Wyner network [88].

1) Noisy Source Coding: Suppose Alice has access only to a noisy version X of a source of interest Y . She wishes totransmit a rate-constrained description from her observation (i.e., X) to Bob such that he can recover Y with small averagedistortion. More precisely, let (Xn, Y n) be n i.i.d. samples of (X,Y ) ∼ PXY . Alice encodes her observation Xn throughan encoder φ : Xn → {1, . . . ,Kn} and sends φ(Xn) to Bob. Upon receiving φ(Xn), Bob reconstructs a "soft" estimate ofY n via a decoder ψ : {1, . . . ,Kn} → Yn where Y = P(Y). That is, the reproduction sequence yn consists of n probabilitymeasures on Y . For any source and reproduction sequences yn and yn, respectively, the distortion is defined as

d(yn, yn) :=1

n

n∑i=1

d(yi, yi),

whered(y, y) := log

1

y(y). (27)

We say that a pair of rate-distortion (R,D) is achievable if there exists a pair (φ, ψ) of encoder and decoder such that

lim supn→∞

E[d(Y n, ψ(φ(Xn)))] ≤ D, and lim supn→∞

1

nlogKn ≤ R. (28)

The noisy rate-distortion function Rnoisy(D) for a given D ≥ 0, is defined as the minimum rate R such that (R,D) is anachievable rate-distortion pair. This problem arises naturally in many data analytic problems. Some examples include featureselection of a high-dimensional dataset, clustering, and matrix completion. This problem was first studied by Dobrushin andTsybakov [18], who showed that Rnoisy(D) is analogous to the classical rate-distortion function

Rnoisy(D) = infPY |X :E[d(Y,Y )]≤D,Y(−−X(−−Y

I(X; Y ). (29)

12

It can be easily verified that E[d(Y, Y )] = H(Y |Y ) and hence (after relabeling Y as T )

Rnoisy(D) = infPT |X :I(Y ;T )≥R,Y(−−X(−−T

I(X;T ), (30)

where R = H(Y )−D, which is equal to IB defined in (4). For more details in connection between noisy source coding and IB,the reader is referred to [83], [84], [87], [89]. Notice that one can study an essentially identical problem where the distortionconstraint (28) is replaced by

limn→∞

1

nI(Y n;ψ(φ(Xn)))] ≥ R, and lim sup

n→∞

1

nlogKn ≤ R.

This problem is addressed in [90] for discrete alphabets X and Y and extended recently in [91] for any general alphabets.2) Test Against Independence with Communication Constraint: As mentioned earlier, the connection between IB and noisy

source coding, described above, was known and studied in [83], [84]. Here, we provide a new information-theoretic settingwhich provides yet another operational meaning for IB. Given n i.i.d. samples (X1, Y1), . . . , (Xn, Yn) from joint distributionQ, we wish to test whether Xi are independent of Yi, that is, Q is a product distribution. This task is formulated by thefollowing hypothesis test:

H0 : Q = PXY ,

H1 : Q = PXPY ,(31)

for a given joint distribution PXY with marginals PX and PY . Ahlswede and Csiszár [19] investigated this problem under acommunication constraint: While Y observations (i.e., Y1, . . . , Yn) are available, the X observations need to be compressed atrate R, that is, instead of Xn, only φ(Xn) is present where φ : Xn → {1, . . . ,Kn} satisfies

1

nlogKn ≤ R.

For the type I error probability not exceeding a fixed ε ∈ (0, 1), Ahlswede and Csiszár [19] derived the smallest possible type2 error probability, defined as

βR(n, ε) = minφ:Xn→[K]1n

logKn≤R

minA⊂[Kn]×Yn

{(Pφ(Xn) × PY n)(A) : Pφ(Xn)×Y n(A) ≥ 1− ε

}.

The following gives the asymptotic expression of βR(n, ε) for every ε ∈ (0, 1). For the proof, refer to [19, Theorem 3].

Theorem 7 ( [19]). For every R ≥ 0 and ε ∈ (0, 1), we have

limn→∞

− 1

nlog βR(n, ε) = IB(R).

In light of this theorem, IB(R) specifies the exponential rate at which the type II error probability of the hypothesis test(31) decays as the number of samples increases.

3) Dependence Dilution: Inspired by the problems of information amplification [92] and state masking [93], Asoodeh etal. [20] proposed the dependence dilution setup as follows. Consider a source sequences Xn of n i.i.d. copies of X ∼ PX .Alice observes the source Xn and wishes to encode it via the encoder

fn : Xn → {1, 2, . . . , 2nR},

for some R > 0. The goal is to ensure that any user observing fn(Xn) can construct a list, of fixed size, of sequences in Xnthat contains likely candidates of the actual sequence Xn while revealing negligible information about a correlated source Y n.To formulate this goal, consider the decoder

gn : {1, 2, . . . , 2nR} → 2Xn

,

where 2Xn

denotes the power set of Xn. A dependence dilution triple (R,Γ,∆) ∈ R3+ is said to be achievable if, for any

δ > 0, there exists a pair of encoder and decoder (fn, gn) such that for sufficiently large n

Pr (Xn /∈ gn(J)) < δ, (32)

having fixed size |gn(J)| = 2n(H(X)−Γ), where J = fn(Xn) and simultaneously

1

nI(Y n; J) ≤ ∆ + δ. (33)

Notice that without side information J , the decoder can only construct a list of size 2nH(X) which contains Xn with probabilityclose to one. However, after J is observed and the list gn(J) is formed, the decoder’s list size can be reduced to 2n(H(X)−Γ)

13

and thus reducing the uncertainty about Xn by nΓ ∈ [0, nH(X)]. This observation can be formalized to show (see [92] fordetails) that the constraint (32) is equivalent to

1

nI(Xn; J) ≥ Γ− δ, (34)

which lower bounds the amount of information J carries about Xn. Built on this equivalent formulation, Asoodeh et al. [20,Corollary 15] derived a necessary condition for the achievable dependence dilution triple.

Theorem 8 ( [20]). Any achievable dependence dilution triple (R,Γ,∆) satisfiesR ≥ Γ

Γ ≤ I(X;T )

∆ ≥ I(Y ;T )− I(X;T ) + Γ,

for some auxiliary random variable T satisfying Y (−− X (−− T and taking |T | ≤ |X |+ 1 values.

According to this theorem, PF(Γ) specifies the best privacy performance of the dependence dilution setup for the maximumamplification rate Γ. While this informs the operational interpretation of PF, Theorem 8 only provides an outer bound forthe set of achievable dependence dilution triple (R,Γ,∆). It is, however, not clear that PF characterizes the rate region of aninformation-theoretic setup.

The fact that IB fully characterizes the rate-region of an source coding setup has an important consequence: the cardinalityof the auxiliary random variable T in IB can be improved to |X | instead of |X |+ 1.

D. Cardinality Bound

Recall that in the definition of IB in (4), no assumption was imposed on the auxiliary random variable T . A straightforwardapplication of Carathéodory-Fenchel-Eggleston theorem2 reveals that IB is attained for T taking values in a set T with cardinality|T | ≤ |X |+ 1. Here, we improve this bound and show that cardinality bound to |T | ≤ |X |.

Theorem 9. For any joint distribution PXY and R ∈ (0, H(X)], information bottleneck IB(R) is achieved by T taking atmost |X | values.

The proof of this theorem hinges on the operational characterization of IB as the lower boundary of the rate-distortion regionof noisy source coding problem discussed in Section II-C. Specifically, we first show that the extreme points of this region isachieved by T taking |X | values. We then make use of a property of the noisy source coding problem (namely, time-sharing)to argue that all points of this region (including the boundary points) can be attained by such T . It must be mentioned thatthis result was already claimed by Harremoës and Tishby in [95] without proof.

In many practical scenarios, feature X has a large alphabet. Hence, the bound |T | ≤ |X |, albeit optimal, still can make theinformation bottleneck function computationally intractable over large alphabets. However, label Y usually has a significantlysmaller alphabet. While it is in general impossible to have a cardinality bound for T in terms of |Y|, one can considerapproximating IB assuming T takes N values. The following result, recently proved by Hirche and Winter [96], is in thisspirit.

Theorem 10 ( [96]). For any (X,Y ) ∼ PXY , we have

IB(R,N) ≤ IB(R) ≤ IB(R,N) + δ(N),

where δ(N) = 4N−1|Y|

[log |Y|4 + 1

|Y| logN]

and IB(R,N) denotes the information bottleneck functional (4) with the addi-tional constraint that |T | ≤ N .

Recall that, unlike PF, the graph of IB characterizes the rate region of a Shannon-theoretic coding problem (as illustratedin Section II-C), and hence any boundary points can be constructed via time-sharing of extreme points of the rate region.This lack of operational characterization of PF translates into a worse cardinality bound than that of IB. In fact, for PF thecardinality bound |T | ≤ |X | + 1 cannot be improved in general. To demonstrate this, we numerically solve the optimizationin PF assuming that |T | = |X | when both X and Y are binary. As illustrated in Fig. 5, this optimization does not lead to aconvex function, and hence, cannot be equal to PF.

E. Deterministic Information Bottleneck

As mentioned earlier, IB formalizes an information-theoretic approach to clustering high-dimensional feature X into clusterlabels T that preserve as much information about the label Y as possible. The clustering label is assigned by the soft operatorPT |X that solves the IB formulation (4) according to the rule: X = x is likely assigned label T = t if DKL(PY |x‖PY |t) is

2This is a strengthening of the original Carathéodory theorem when the underlying space is connected, see e.g., [94, Section III] or [75, Lemma 15.4].

14

Fig. 5. The set {(I(X;T ), I(Y ;T ))} with PX = Bernoulli(0.9), PY |X=0 = [0.9, 0.1], PY |X=1 = [0.85, 0.15], and T restricted to bebinary. While the upper boundary of this set is concave, the lower boundary is not convex. This implies that, unlike IB, PF(r) cannot beattained by binary variables T .

small where PY |t =∑x PY |xPX|t. That is, clustering is assigned based on the similarity of conditional distributions. As in

many practical scenarios, a hard clustering operator is preferred, Strouse and Schwab [27] suggested the following variant ofIB, termed as deterministic information bottleneck dIB

dIB(PXY , R) := supf :X→T ,

H(f(X))≤R

I(Y ; f(X)), (35)

where the maximization is taken over all deterministic functions f whose range is a finite set T . Similarly, one can define

dPF(PXY , r) := inff :X→T ,

H(f(X))≥r

I(Y ; f(X)). (36)

One way to ensure that H(f(X)) ≤ R for a deterministic function f is to restrict the cardinality of the range of f : iff : X → [eR] then H(f(X)) is necessarily smaller than R. Using this insight, we derive a lower for dIB(PXY , R) in thefollowing lemma.

Lemma 5. For any given PXY , we have

dIB(PXY , R) ≥ eR − 1

|X |I(X;Y ),

anddPF(PXY , r) ≤

er − 1

|X |I(X;Y ) + Pr(X ≥ er) log

1

Pr(X ≥ er).

Note that both R and r are smaller than H(X) and thus the multiplicative factors of I(X;Y ) in the lemma are smallerthan one. In light of this lemma, we can obtain

eR − 1

|X |I(X;Y ) ≤ IB(R) ≤ I(X;Y ),

andPF(r) ≤ er − 1

|X |I(X;Y ) + Pr(X ≥ er) log

1

Pr(X ≥ er).

In most of practical setups, |X | might be very large, making the above lower bound for IB vacuous. In the following lemma,we partially address this issue by deriving a bound independent of X when Y is binary.

Lemma 6. Let PXY be a joint distribution of arbitrary X and binary Y ∼ Bernoulli(q) for some q ∈ (0, 1). Then, for anyR ≥ log 5 we have

dIB(PXY , R) ≥ I(X;Y )− 2αhb

( I(X;Y )

2α(eR − 4)

),

where α = max{log 1q , log 1

1−q}.

III. FAMILY OF BOTTLENECK PROBLEMS

In this section, we introduce a family of bottleneck problems by extending IB and PF to a large family of statistical measures.Similar to IB and PF, these bottleneck problems are defined in terms of boundaries of a two-dimensional convex set induced

15

by a joint distribution PXY . Recall that R 7→ IB(PXY , R) and r 7→ PF(PXY , r) are the upper and lower boundary of the setM defined in (6) and expressed here again for convenience

M ={

(I(X;T ), I(Y ;T )) : Y (−− X (−− T, (X,Y ) ∼ PXY}. (37)

Since PXY is given, H(X) and H(Y ) are fixed. Thus, in characterizing M it is sufficient to consider only H(X|T ) andH(Y |T ). To generalize IB and PF, we must therefore generalize H(X|T ) and H(Y |T ).

Given a joint distribution PXY and two non-negative real-valued functions Φ : P(X ) → R+ and Ψ : P(Y) → R+, wedefine

Φ(X|T ) := E[Φ(PX|T )

]=∑t∈T

PT (t)Φ(PX|T=t), (38)

andΨ(Y |T ) := E

[Ψ(PY |T )

]=∑t∈T

PT (t)Ψ(PY |T=t). (39)

When X ∼ PX and Y ∼ PY , we interchangeably write Φ(X) for Φ(PX) and Φ(Y ) for Ψ(PY ).These definitions provide natural generalizations for Shannon’s entropy and mutual information. Moreover, as we discuss

later in Sections III-B and III-C, it also can be specialized to represent a large family of popular information-theoretic andstatistical measures. Examples include information and estimation theoretic quantities such as Arimoto’s conditional entropy oforder α for Φ(QX) = ||QX ||α, probability of correctly guessing for Φ(QX) = ||QX ||∞, maximal correlation for binary case,and f -information for Φ(QX) given by f -divergence. We are able to generate a family of bottleneck problems using differentinstantiations of Φ(X|T ) and Ψ(Y |T ) in place of mutual information in IB and PF. As we argue later, these problems bettercapture the essence of "informativeness" and "privacy"; thus providing analytical and interpretable guarantees similar in spiritto IB and PF.

Computing these bottleneck problems in general boils down to the following optimization problems

UΦ,Ψ(ζ) := supPT |X :Y(−−X(−−T

Φ(X|T )≤ζ

Ψ(Y |T ), (40)

andLΦ,Ψ(ζ) := inf

PT |X :Y(−−X(−−TΦ(X|T )≥ζ

Ψ(Y |T ). (41)

Consider the setMΦ,Ψ :=

{(Φ(X|T ),Ψ(Y |T )) : Y (−− X (−− T, (X,Y ) ∼ PXY

}. (42)

Note that if both Φ and Ψ are continuous (with respect to the total variation distance), then MΦ,Ψ is compact. Moreover, itcan be easily verified that MΦ,Ψ is convex. Hence, its upper and lower boundaries are well-defined and are characterized bythe graphs of UΦ,Ψ and LΦ,Ψ, respectively. As mentioned earlier, these functional are instrumental for computing the generalbottleneck problem later. Hence, before we delve into the examples of bottleneck problems, we extend the approach given inSection II-B to compute UΦ,Ψ and LΦ,Ψ.

A. Evaluation of UΦ,Ψ and LΦ,Ψ

Analogous to Section II-B, we first introduce the Lagrangians of UΦ,Ψ and LΦ,Ψ as

LUΦ,Ψ(β) := sup

PT |X

Ψ(Y |T )− βΦ(X|T ), (43)

andLL

Φ,Ψ(β) := infPT |X

Ψ(Y |T )− βΦ(X|T ), (44)

where β ≥ 0 is the Lagrange multiplier, respectively. Let (X ′, Y ′) be a pair of random variable with X ′ ∼ QX and Y ′ is theresult of passing X ′ through the channel PY |X . Letting

FΦ,Ψβ (QX) := Ψ(Y ′)− βΦ(X ′), (45)

we obtain thatLU

Φ,Ψ(β) = K∩[FΦ,Ψβ (QX)]

∣∣PX

and LLΦ,Ψ(β) = K∪[FΦ,Ψ

β (QX)]∣∣PX, (46)

recalling that K∩ and K∪ are the upper concave and lower convex envelop operators. Once we compute LUΦ,Ψ and LL

Φ,Ψ for allβ ≥ 0, we can use the standard results in optimizations theory (similar to (21) and (22)) to recover UΦ,Ψ and LΦ,Ψ. However, wecan instead extend the approach Witsenhausen and Wyner [4] described in Section II-B. Suppose for some β, K∩[FΦ,Ψ

β (QX)]

16

(resp. K∪[FΦ,Ψβ (QX)]) at PX is obtained by a convex combination of points FΦ,Ψ

β (Qi), i ∈ [k] for some Q1, . . . , Qk inP(X ), integer k ≥ 2, and weights λi ≥ 0 (with

∑i λi = 1). Then

∑i λiQ

i = PX , and T ∗ with properties PT∗(i) = λi andPX|T∗=i = Qi attains the maximum (resp. minimum) of Ψ(Y |T )− βΦ(X|T ), implying that (Φ(X|T ∗),Ψ(Y |T ∗)) is a pointon the upper (resp. lower) boundary of MΦ,Ψ. Consequently, such T ∗ satisfies UΦ,Ψ(ζ) = Ψ(Y |T ∗) for ζ = Φ(X|T ∗) (resp.LΦ,Ψ(ζ) = Ψ(Y |T ∗) for ζ = Φ(X|T ∗)). The algorithm to compute UΦ,Ψ and LΦ,Ψ is then summarized in the following threesteps:• Construct the functional FΦ,Ψ

β (QX) := Ψ(Y ′)− βΦ(X ′) for X ′ ∼ QX and Y ′ ∼ QX ⊗ PY |X and all QX ∈ P(X ) andβ ≥ 0.

• Compute K∩[FΦ,Ψ(QX)]∣∣PX

and K∪[FΦ,Ψ(QX)]∣∣PX

evaluated at PX .• If for distributions Q1, . . . , Qk in P(X) for some k ≥ 1, we have K∩[FΦ,Ψ(QX)]

∣∣PX

=∑ki=1 λiF

Φ,Ψ(Qi) or K∪[FΦ,Ψ(QX)]∣∣PX

=∑ki=1 λiF

Φ,Ψ(Qi) for some λi ≥ 0 satisfying∑ki=1 λi = 1, then then PX|T=i = Qi, i ∈ [k] and PT (i) = λi give the

optimal T ∗ in UΦ,Ψ and LΦ,Ψ, respectively.We will apply this approach to analytically compute UΦ,Ψ and LΦ,Ψ (and the corresponding bottleneck problems) for binary

cases in the following sections.

B. Guessing Bottleneck ProblemsLet PXY be given with marginals PX and PY and the corresponding channel PY |X . Let also QX ∈ P(X ) be an arbitrary

distribution on X and QY = QX ⊗PY |X be the output distribution of PY |X when fed with QX . Any channel PT |X , togetherwith the Markov structure Y (−− X (−− T , generates unique PX|T and PY |T . We need the following basic definition fromstatistics.

Definition 1. Let U be a discrete and V be an arbitrary random variables supported on U and V with |U| <∞, respectively.Then Pc(U) the probability of correctly guessing U and Pc(U |V ) the probability of correctly guessing U given V are givenby

Pc(U) := maxu∈U

PU (u),

andPc(U |V ) := max

gPr(U = g(V )) = E

[maxu∈U

PU |V (u|V )

].

Moreover, the multiplicative gain of the observation V in guessing U is defined3

I∞(U ;V ) := logPc(U |V )

Pc(U).

As the names suggest, Pc(U |V ) and Pc(U) characterize the optimal efficiency of guessing U with or without the observationV , respectively. Intuitively, I∞(U ;V ) quantifies how useful the observation V is in estimating U : If it is small, then it means itis nearly as hard for an adversary observing V to guess U as it is without V . This observation motivates the use of I∞(Y ;T )as a measure of privacy in lieu of I(Y ;T ) in PF.

It is worth noting that I∞(U ;V ) is not symmetric in general, i.e., I∞(U ;V ) 6= I∞(V ;U). Since observing T can onlyimprove, we have Pc(Y |T ) ≥ Pc(Y ); thus I∞(Y ;T ) ≥ 0. However, I∞(Y ;T ) = 0 does not necessarily imply independent ofY and T ; instead, it means T is useless in estimating Y . As an example, consider Y ∼ Bernoulli(p) and PT |Y=0 = Bernoulli(δ)and PT |Y=1 = Bernoulli(η) with δ, η ≤ 1

2 < p. Then Pc(Y ) = p and

Pc(Y |T ) = max{δp, ηp}+ ηp.

Thus, if δp ≤ ηp, then Pc(Y |T ) = Pc(Y ). This then implies that I∞(Y ;T ) = 0 whereas Y and T are clearly dependent;i.e, I(Y ;T ) > 0. While in general I(Y ;T ) and I∞(Y ;T ) are not related, it can be shown that I(Y ;T ) ≤ I∞(Y ;T ) if Y isuniform (see [60, Proposition 1]). Hence, only with this uniformity assumption, I∞(Y ;T ) implies the independence.

Consider Ψ(QX) = −∑x∈X QX(x) log (QX(x)) and Ψ(QY ) = ‖QY ‖∞. Clearly, we have Φ(X|T ) = H(X|T ). Note that

Ψ(Y |T ) =∑t∈T

PT (t)||PY |T=t||∞ = Pc(Y |T ), (47)

thus both measures H(X|T ) and Pc(Y |T ) are special cases of the models described in the previous section. In particular, wecan define the corresponding UΦ,Ψ and LΦ,Ψ. We will see later that I(X;T ) and Pc(Y |T ) correspond to Arimoto’s mutualinformation of orders 1 and ∞, respectively. Define

IB(∞,1)(R) := supPT |X :Y(−−X(−−T

I(X;T )≤R

I∞(Y ;T ). (48)

3The reason for ∞ in the notation becomes clear later.

17

This bottleneck functional formulated an interpretable guarantee:

IB(∞,1)(R) characterizes the best error probability in recovering Yamong all R-bit summaries of X

Recall that the functional PF(r) aims at extracting maximum information of X while protecting privacy with respect to Y .Measuring the privacy in terms of Pc(Y |T ), this objective can be better formulated by

PF(∞,1)(r) := infPT |X :Y(−−X(−−T

I(X;T )≥r

I∞(Y ;T ), (49)

with the interpretable privacy guarantee:

PF(∞,1)(r) characterizes the smallest probability of revealing private feature Yamong all representations of X preserving at least r bits information of X

Notice that the variable T in the formulations of IB(∞,1) and PF(∞,1) takes values in a set T of arbitrary cardinality.However, a straightforward application of the Carathéodory-Fenchel-Eggleston theorem (see e.g., [75, Lemma 15.4]) revealsthat the cardinality of T can be restricted to |X |+ 1 without loss of generality. In the following lemma, we prove more basicproperties of IB(∞,1) and PF(∞,1).

Lemma 7. For any PXY with Y supported on a finite set Y , we have• IB(∞,1)(0) = PF(∞,1)(0) = 0.• IB(∞,1)(R) = I∞(X;Y ) for any R ≥ H(X) and PF(∞,1)(r) = I∞(X;Y ) for r ≥ H(X).• R 7→ exp(IB(∞,1)(R)) is strictly increasing and concave on the range (0, I∞(X;Y )).• r 7→ exp(PF(∞,1)(r)) is strictly increasing, and convex on the range (0, I∞(X;Y )).

The proof follows the same lines as Theorem 1 and hence omitted. Lemma 7 in particular implies that inequalities I(X;T ) ≤R and I(X;T ) ≥ r in the definition of IB(∞,1) and PF(∞,1) can be replaced by I(X;T ) = R and I(X;T ) = r, respectively.It can be verified that I∞ satisfies the data-processing inequality, i.e., I∞(Y ;T ) ≤ I∞(Y ;X) for the Markov chain Y (−−X (−− T . Hence, both IB(∞,1) and PF(∞,1) must be smaller than I∞(Y ;X). The properties listed in Lemma 7 enable us toderive a slightly tighter upper bound for PF(∞,1) as demonstrated in the following.

Lemma 8. For any PXY with Y supported on a finite set Y , we have

PF(∞,1)(r) ≤ log

[1 +

r

H(X)

(eI∞(Y ;X) − 1

)],

andlog

[1 +

R

H(X)

(eI∞(Y ;X) − 1

)]≤ IB(∞,1)(R) ≤ I∞(Y ;X).

This lemma shows that the gap between I∞(Y ;X) and IB(∞,1)(R) when R is sufficiently close to H(X) behaves like

I∞(Y ;X)− IB(∞,1)(R) ≤ I∞(Y ;X)− log

[1 +

R

H(X)

(eI∞(Y ;X) − 1

)]≈(

1− e−I∞(Y ;X))(

1− R

H(X)

).

Thus, IB(∞,1)(R) approaches I∞(Y ;X) as R→ H(X) at least linearly.In the following theorem, we apply the technique delineated in Section III-A to derive closed form expressions for IB(∞,1)

and PF(∞,1) for the binary symmetric case, thereby establishing similar results as Mr and Mrs. Gerber’s Lemma.

Theorem 11. For X ∼ Bernoulli(p) and PY |X = BSC(δ) with p, δ ≤ 12 , we have

PF(∞,1)(r) = log

[δ − (hb(p)− r)( 1

2 − δ)1− δ ∗ p

], (50)

and

IB(∞,1)(R) = log

[1− δ ∗ h−1

b (hb(p)−R)

1− δ ∗ p

], (51)

where δ = 1− δ.

As described in Section III-A, to compute IB(∞,1) and PF(∞,1) it suffices to derive the convex and concave envelopes ofthe mapping F

(∞,1)β (q) := Pc(Y

′) + βH(X ′) where X ′ ∼ Bernoulli(q) and Y ′ is the result of passing X ′ through BSC(δ),i.e., Y ′ ∼ Bernoulli(δ ∗ q). In this case, Pc(Y

′) = max{δ ∗ q, 1− δ ∗ q} and F (∞,1)β can be expressed as

q 7→ F(∞,1)β (q) = max{δ ∗ q, 1− δ ∗ q}+ βhb(q). (52)

18

(a) β = 0.2 (b) β = 0.0.4 (c) β = 0.5

Fig. 6. The mapping q 7→ F(∞,1)β (q) = Pc(Y

′) + βH(X ′) where X ′ ∼ Bernoulli(q) and Y ′ ∼ Bernoulli(q)⊗ BSC(0.1).

This function is depicted in Fig. 6. The detailed derivation of convex and concave envelope of F (∞,1)β is given in Appendix B.

The proof of this theorem also reveals the following intuitive statements. If X ∼ Bernoulli(p) and PY |X = BSC(δ), then amongall random variables T satisfying Y (−− X (−− T and H(X|T ) ≤ λ, the minimum Pc(Y |T ) is given by δ − λ(0.5 − δ).Notice that, without any information constraint (i.e., λ = 0), Pc(Y |T ) = Pc(Y |X) = δ. Perhaps surprisingly, this shows thatthe mutual information constraint has a linear effect on the privacy of Y . Similarly, to prove (51), we show that among allR-bit representations T of X , the best achievable accuracy Pc(Y |T ) is given by 1− δ ∗ h−1

b (hb(p)−R). This can be provedby combining Mrs. Gerber’s Lemma (cf. Lemma 4) and Fano’s inequality as follows. For all T such that H(X|T ) ≥ λ,the minimum of H(Y |T ) is given by hb(δ ∗ h−1

b (λ)). Since by Fano’s inequality, H(Y |T ) ≤ hb(1 − Pc(Y |T )), we obtainδ ∗h−1

b (λ) ≤ 1−Pc(Y |T ) which leads to the same result as above. Nevertheless, in Appendix B we give another proof basedon the discussion of Section III-A.

C. Arimoto Bottleneck Problems

The bottleneck framework proposed in the last section benefited from interpretable guarantees brought forth by the quantityI∞. In this section, we define a parametric family of statistical quantities, the so-called Arimoto’s mutual information, whichincludes both Shannon’s mutual information and I∞ as extreme cases.

Definition 2 ( [21]). Let U ∼ PU and V ∼ PV be two random variables supported over finite sets U and V , respectively.Their Arimoto’s mutual information of order α > 1 is defined as

Iα(U ;V ) = Hα(U)−Hα(U |V ), (53)

whereHα(U) :=

α

1− αlog ||PU ||α, (54)

is the Rényi entropy of order α and

Hα(U |V ) :=α

1− αlog∑v∈V

PV (v)||PU |V=v||α, (55)

is the Arimoto’s conditional entropy of order α.

By continuous extension, one can define I − α(U ;V ) for α = 1 and α =∞ as I(U ;V ) and I∞(U ;V ), respectively. Thatis,

limα→1+

Iα(U ;V ) = I(U ;V ), and limα→∞

Iα(U ;V ) = I∞(U ;V ). (56)

Arimoto’s mutual information was first introduced by Arimoto [21] and then later revisited by Liese and Vajda in [97] andmore recently by Verdú in [98]. More in-depth analysis and properties of Iα can be found in [99]. It is shown in [66, Lemma1] that Iα(U ;V ) for α ∈ [1,∞] quantifies the minimum loss in recovering U given V where the loss is measured in termsof the so-called α-loss. This loss function reduces to logarithmic loss (27) and Pc(U |V ) for α = 1 and α =∞, respectively.This sheds light on the utility and/or privacy guarantee promised by a constraint on Arimoto’s mutual information. It is nownatural to use Iα for defining a family of bottleneck problems.

Definition 3. Given a pair of random variables (X,Y ) ∼ PXY over finite sets X and Y and α, γ ∈ [1,∞], we define IB(α,γ)

and PF(α,γ) asIB(α,γ)(R) := sup

PT |X :Y(−−X(−−TIγ(X;T )≤R

Iα(Y ;T ), (57)

19

1

0

1

0

XY T

δ

δ

1

01− η

η

Fig. 7. The structure of the optimal PT |X for PF(∞,∞) when PY |X = BSC(δ) and X ∼ Bernoulli(p) with δ, p ∈ [0, 12]. If the accuracy

constraint is Pc(X|T ) ≥ λ (or equivalently I∞(X|T ) ≥ log λp

), then the parameter of optimal PT |X is given by η = λp

, leading toPc(Y |T ) = δ ∗ λ.

andPF(α,γ)(r) := inf

PT |X :Y(−−X(−−TIγ(X;T )≥r

Iα(Y ;T ), (58)

Of course, IB(1,1)(R) = IB(R) and PF(1,1)(r) = PF(r). It is known that Arimoto’s mutual information satisfies thedata-processing inequality [99, Corollary 1], i.e., Iα(Y ;T ) ≤ Iα(Y ;X) for the Markov chain Y (−− X (−− T . On theother hand, Iγ(X;T ) ≤ Hγ(X). Thus, both IB(α,γ)(R) and PF(α,γ)(r) equal Iα(Y ;X) for R, r ≥ Hγ(X). Note also thatHα(Y |T ) = α

1−α log Ψ(Y |T ) where Ψ(Y |T ) (see (39)) corresponding to the function Ψ(QY ) = ||QY ||α. Consequently, IB(α,γ)

and PF(α,γ) are characterized by the lower and upper boundary of MΦ,Ψ, defined in (42), with respect to Φ(QX) = ||QX ||γand Ψ(QY ) = ||QY ||α. Specifically, we have

IB(α,γ)(R) = Hα(Y ) +α

α− 1logUΦ,Ψ(ζ), (59)

where ζ = e−(1− 1γ )(Hγ(X)−R), and

PF(α,γ)(r) = Hα(Y ) +α

α− 1log LΦ,Ψ(ζ), (60)

where ζ = e−(1− 1γ )(Hγ(X)−r) and Φ(QX) = ‖QX‖γ and Ψ(QY ) = ‖QY ‖α. This paves the way to apply the technique

described in Section II-B to compute IB(α,γ) and PF(α,γ). Doing so requires the upper concave and lower convex envelope ofthe mapping QX 7→ ‖QY ‖α − β‖QX‖γ for some β ≥ 0, where QY ∼ QX ⊗ PY |X . In the following theorem, we drive theseenvelopes and give closed form expressions for IB(α,γ) and PF(α,γ) for a special case where α = γ ≥ 2.

Theorem 12. Let X ∼ Bernoulli(p) and PY |X = BSC(δ) with p, δ ≤ 12 . We have for α ≥ 2

PF(α,α)(r) =α

1− αlog

[‖p ∗ δ‖α‖q ∗ δ‖α

],

where ‖a‖α := ‖[a, a]‖α for a ∈ [0, 1] and q ≤ p solves

α

1− αlog

[‖p‖α‖q‖α

]= r.

Moreover,

IB(α,α)(R) =α

α− 1log

[λ‖δ‖α + λ‖ qz ∗ δ‖α

‖p ∗ α‖α

],

where z = max{2p, λ} and λ ∈ [0, 1] solves

α

α− 1log

λ+ λ‖pz‖α‖p‖α

= R.

By letting α → ∞, this theorem indicates that for X and Y connected through BSC(δ) and all variables T formingY (−− X (−− T , we have

Pc(X|T ) ≥ λ =⇒ Pc(Y |T ) ≥ δ ∗ λ, (61)

which can be shown to be achieved T ∗ generated by the following channel (see Fig. 7)

PT∗|X =

[λ−pp

λp

0 1

]. (62)

Note that, by assumption, p ≤ 12 , and hence the event {X = 1} is less likely than {X = 0}. Therefore, (61) demonstrates that

to ensure correct recoverability of X with probability at lest λ, the most private approach (with respect to Y ) is to obfuscate the

20

higher-likely event {X = 0} with probability λp . As demonstrated in (61) the optimal privacy guarantee is linear in the utility

parameter in the binary symmetric case. This is in fact a special case of the larger result recently proved in [60, Theorem 1]:the infimum of Pc(Y |T ) over all variables T such that Pc(X|T ) ≥ λ is piece-wise linear in λ, on equivalently, the mappinger 7→ exp(PF(∞,∞)(r)) is piece-wise linear.

Computing PF(α,γ) analytically for every α, γ > 1 seems to be challenging, however, the following lemma provides boundsfor PF(α,γ) and IB(α,γ) in terms of PF(∞,∞) and IB(∞,∞), respectively.

Lemma 9. For any pair of random variables (X,Y ) over finite alphabets and α, γ > 1, we haveα

α− 1PF(∞,∞)(f(r))− α

α− 1H∞(Y ) +Hα(Y ) ≤ PF(α,γ)(r) ≤ PF(∞,∞)(g(r)) +Hα(Y )−H∞(Y ),

andα

α− 1IB(∞,∞)(f(R))− α

α− 1H∞(Y ) +Hα(Y ) ≤ IB(α,γ)(R) ≤ IB(∞,∞)(g(R)) +Hα(Y )−H∞(Y ),

where f(a) = max{a−Hγ(X) +H∞(X), 0} and g(b) = γ−1γ b+H∞(X)− γ−1

γ Hγ(X).

The previous lemma can be directly applied to derive upper and lower bounds for PF(α,γ) and IB(α,γ) given PF(∞,∞) andIB(∞,∞).

D. f -Bottleneck Problems

In this section, we describe another instantiation of the general framework introduced in terms of functions Φ and Ψ thatenjoys interpretable estimation-theoretic guarantee.

Definition 4. Let f : (0,∞)→ R be a convex function with f(1) = 0. Furthermore, let U and V be two real-valued randomvariables supported over U and V , respectively. Their f -information is defined by

If (U ;V ) := Df (PUV ‖PUPV ), (63)

where Df (·‖·) is the f -divergence [100] between distributions and defined as

Df (P‖Q) := EQ[f

(dPdQ

)].

Due to convexity of f , we have Df (P‖Q) ≥ f(1) = 0 and hence f -information is always non-negative. If, furthermore,f is strictly convex at 1, then equality holds if and only P = Q. Csiszár introduced f -divergence in [100] and applied itto several problems in statistics and information theory. More recent developments about the properties of f -divergence andf -information can be found in [22] and the references therein. Any convex function f with the property f(1) = 0 resultsin an f -information. Popular examples include f(t) = t log t corresponding to Shannon’s mutual information, f(t) = |t − 1|corresponding to T -information [79], and also f(t) = t2− 1 corresponding to χ2-information [64] for . It is worth mentioningthat Arimoto’s mutual information of order4 α ∈ (0, 1) was also shown to be an f -information in the binary case for a certainfunction f , see [97, Theorem 8].

Let (X,Y ) ∼ PXY be given with marginals PX and PY . Consider functions Φ and Ψ on P(X ) and P(Y) defined as

Φ(QX) := Df (QX‖PX) and Ψ(QY ) := Df (QY ‖PY ).

Given a conditional distribution PT |X , it is easy to verify that Φ(X|T ) = If (X;T ) and Ψ(Y |T ) = If (Y ;T ). This in turnimplies that f -information can be utilized in (40) and (41) to define general bottleneck: Let f : (0,∞)→ R and g : (0,∞)→ Rbe two convex functions satisfying f(1) = g(1) = 0. Then we define

IB(f,g)(R) := supPT |X :Y(−−X(−−T

Ig(X;T )≤R

If (Y ;T ), (64)

andPF(f,g)(r) := inf

PT |X :Y(−−X(−−TIg(X;T )≥r

If (Y ;T ). (65)

In light of the discussion in Section III-A, the optimization problems in IB(f,g) and IB(f,g) can be analytically solved bydetermining the upper concave and lower convex envelope of the mapping

QX 7→ F (f,g)β := Df (QY ‖PY )− βDg(QX‖PX), (66)

where β ≥ 0 is the Lagrange multiplier and QY = QX ⊗ PY |X .

4Note that Arimoto’s mutual information was defined in Definition 2 for α > 1. However, it was similarly defined in [97] for α ∈ (0, 1).

21

Consider the function fα(t) = tα−1α−1 with α ∈ (1,∞)∪(1,∞). The corresponding f -divergence is sometimes called Hellinger

divergence of order α, see e.g., [101]. Note that Hellinger divergence of order 2 reduces to χ2-divergence. Calmon et al. [63]and Asoodeh et al. [62] showed that if If2(Y ;T ) ≤ ε for some ε ∈ (0, 1), then the minimum mean-squared error (MMSE) ofreconstructing any zero-mean unit-variance function of Y given T is lower bounded by 1 − ε, i.e., no function of Y can bereconstructed with small MMSE given an observation of T . This result serves a natural justification for If2 as an operationalmeasure of both privacy and utility in a bottleneck problem.

Unfortunately, our approach described in Section III-A cannot be used to compute IB(f2,f2) or PF(f2,f2) in the binarysymmetric case. The difficulty lies in the fact that the function Ff2,f2β , defined in (66), for the binary symmetric case is eitherconvex or concave on its entire domain depending on the value of β. Nevertheless, one can consider Hellinger divergenceof order α with α 6= 2 and then apply our approach to compute IB(fα,fα) or PF(fα,fα). Since Df2(P‖Q) ≤ (1 + (α −1)Dfα(P‖Q))1/(α−1) − 1 (see [102, Corollary 5.6], one can justify Ifα as a measure of privacy and utility in a similar wayas If2 .

We end this section by a remark about estimating the measures studied in this section. While we consider information-theoretic regime where the underlying distribution PXY is known, in practice only samples (xi, yi) are given. Consequently,the de facto guarantees of bottleneck problems might be considerably different from those shown in this work. It is thereforeessential to asses the guarantees of bottleneck problems when accessing only samples. To do so, one must derive bounds onthe discrepancy between Pc, Iα, and If computed on the empirical distribution and the true (unknown) distribution. Thesebounds can then be used to shed light on the de facto guarantee of the bottleneck problems. Relying on [30, Theorem 1],one can obtain that the gaps between the measures Pc, Iα, and If computed on empirical distributions and the true one scaleas O(1/

√n) where n is the number of samples. This is in contrast with mutual information for which the similar upper

bound scales as O(log n/√n) as shown in [29]. Therefore, the above measures appear to be easier to estimate than mutual

information.

IV. SUMMARY AND CONCLUDING REMARKS

Following the recent surge in the use of information bottleneck (IB) and privacy funnel (PF) in developing and analyzingmachine learning models, we investigated the functional properties of these two optimization problems. Specifically, we showedthat IB and PF correspond to the upper and lower boundary of a two-dimensional convex set M = {(I(X;T ), I(Y ;T )) :Y (−− X (−− T} where (X,Y ) ∼ PXY represents the observable data X and target feature Y and the auxiliary randomvariable T varies over all possible choices satisfying the Markov relation Y (−− X (−− T . This unifying perspective on IBand PF allowed us to adapt the classical technique of Witsenhausen and Wyner [4] devised for computing IB to be applicablefor PF as well. We illustrated this by deriving a closed form expression for PF in the binary case — a result reminiscent of theMrs. Gerber’s Lemma [3] in information theory literature. We then showed that both IB and PF are closely related to severalinformation-theoretic coding problems such as noisy random coding, hypothesis testing against independence, and dependencedilution. While these connections were partially known in previous work (see e.g., [83], [84]), we show that they lead to animprovement on the cardinality of T for computing IB. We then turned our attention to the continuous setting where X and Yare continuous random variables. Solving the optimization problems in IB and PF in this case without any further assumptionsseems a difficult challenge in general and leads to theoretical results only when (X,Y ) is jointly Gaussian. Invoking recentresults on the entropy power inequality [23] and strong data processing inequality [25], we obtained tight bounds on IB intwo different cases: (1) when Y is a Gaussian perturbation of X and (2) when X is a Gaussian perturbation of Y . We alsoutilized the celebrated I-MMSE relationship [103] to derive a second-order approximation of PF when T is considered to bea Gaussian perturbation of X .

In the second part of the paper, we argue that the choice of (Shannon’s) mutual information in both IB and PF does not seemto carry specific operational significance. It does, however, have a desirable practical consequence: it leads to self-consistentequations [2] that can be solved iteratively (without any guarantee to convergence though). In fact, this property is unique tomutual information among other existing information measures [95]. Nevertheless, we argued that other information measuresmight lead to better interpretable guarantee for both IB and PF. For instance, statistical accuracy in IB and privacy leakagein PF can be shown to be precisely characterized by probability of correctly guessing (aka Bayes risk) or minimum mean-squared error (MMSE). Following this observation, we introduced a large family of optimization problems, which we callbottleneck problems, by replacing mutual information in IB and PF with Arimoto’s mutual information [21] or f -information[22]. Invoking results from [29], [30], we also demonstrated that these information measures are in general easier to estimatefrom data than mutual information. Similar to IB and PF, the bottleneck problems were shown to be fully characterized byboundaries of a two-dimensional convex set parameterized by two real-valued non-negative functions Φ and Ψ. This perspectiveenabled us to generalize the technique used to compute IB and PF for evaluating bottleneck problems. Applying this techniqueto the binary case, we derived closed form expressions for several bottleneck problems.

REFERENCES

[1] H. Hsu, S. Asoodeh, S. Salamatian, and F. P. Calmon, “Generalizing bottleneck problems,” in 2018 IEEE International Symposium on InformationTheory (ISIT), 2018, pp. 531–535.

22

[2] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc. of IEEE 37th Annual Allerton Conference on Communications,Control, and Computing, 2000, pp. 368–377.

[3] A. Wyner and J. Ziv, “A theorem on the entropy of certain binary sequences and applications: Part I,” IEEE Trans. Inf. Theory, vol. 19, no. 6, pp.769–772, Nov. 1973.

[4] H. Witsenhausen and A. Wyner, “A conditional entropy bound for a pair of discrete random variables,” IEEE Trans. Inf. Theory, vol. 21, no. 5, pp.493–501, Sep. 1975.

[5] R. Ahlswede and J. Körner, “On the connection between the entropies of input and output distributions of discrete memoryless channels.” Proc. ofthe Fifth Conference on Probability Theory, 1974, Brasov, Romania, The Centra of Mathematical Statistics of the Ministry of Education 1977.

[6] A. Wyner, “A theorem on the entropy of certain binary sequences and applications–ii,” IEEE Transactions on Information Theory, vol. 19, no. 6, pp.772–777, 1973.

[7] Y. H. Kim and A. El Gamal, Network Information Theory. Cambridge University Press, Cambridge, 2012.[8] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. of the 23rd annual international

ACM SIGIR conference on Research and development in information retrieval. ACM, 2000, pp. 208–215.[9] S. Still and W. Bialek, “How many clusters? an information-theoretic perspective,” Neural Comput., vol. 16, no. 12, pp. 2483–2506, Dec. 2004.

[10] N. Slonim and N. Tishby, “Agglomerative information bottleneck,” in Proceedings of the 12th International Conference on Neural Information ProcessingSystems, ser. NIPS’99. MIT Press, 1999, pp. 617–623.

[11] J. Cardinal, “Compression of side information,” in 2003 International Conference on Multimedia and Expo. ICME ’03. Proceedings (Cat. No.03TH8698),vol. 2, 2003, pp. II–569.

[12] G. Zeitler, R. Koetter, G. Bauch, and J. Widmer, “Design of network coding functions in multihop relay networks,” in 2008 5th International Symposiumon Turbo Codes and Related Topics, 2008, pp. 249–254.

[13] A. Makhdoumi, S. Salamatian, N. Fawaz, and M. Médard, “From the information bottleneck to the privacy funnel,” 2014 IEEE Information TheoryWorkshop (ITW 2014), pp. 501–505, 2014.

[14] F. P. Calmon, A. Makhdoumi, and M. Médard, “Fundamental limits of perfect privacy,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), 2015, pp. 1796–1800.[15] S. Asoodeh, F. Alajaji, and T. Linder, “Notes on information-theoretic privacy,” in Proc. 52nd Annual Allerton Conference on Communication, Control,

and Computing, Sep. 2014, pp. 1272–1278.[16] N. Ding and P. Sadeghi, “A submodularity-based clustering algorithm for the information bottleneck and privacy funnel,” in 2019 IEEE Information

Theory Workshop (ITW), 2019, pp. 1–5.[17] M. Bertran, N. Martinez, A. Papadaki, Q. Qiu, M. Rodrigues, G. Reeves, and G. Sapiro, “Adversarially learned representations for information

obfuscation and inference,” ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach,California, USA: PMLR, 09–15 Jun 2019, pp. 614–623. [Online]. Available: http://proceedings.mlr.press/v97/bertran19a.html

[18] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IRE Trans. Inf. Theory, vol. 8, no. 5, pp. 293–304, Sep. 1962.[19] R. Ahlswede and I. Csiszar, “Hypothesis testing with communication constraints,” IEEE Trans. Inf. Theory, vol. 32, no. 4, pp. 533–542, July 1986.[20] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Information extraction under privacy constraints,” Information, vol. 7, 2016. [Online]. Available:

http://www.mdpi.com/2078-2489/7/1/15[21] S. Arimoto, “Information measures and capacity of order α for discrete memoryless channels,” in Topics in Information Theory, Coll. Math. Soc. J.

Bolyai (I. Csiszár and P. Elias Eds.), vol. 16. North-Holland, Amsterdam, 1977, pp. 41–52.[22] M. Raginsky, “Strong data processing inequalities andφ-sobolev inequalities for discrete channels,” IEEE Trans. Inf. Theory, vol. 62, no. 6, pp. 3355–

3389, June 2016.[23] T. A. Courtade, “Strengthening the entropy power inequality,” in IEEE International Symposium on Information Theory (ISIT), 2016, pp. 2294–2298.[24] A. Globerson and N. Tishby, “On the optimality of the gaussian information bottleneck curve,” Hebrew University Technical Report, Tech. Rep., 2004.[25] F. P. Calmon, Y. Polyanskiy, and Y. Wu, “Strong data processing inequalities in power-constrained gaussian channels,” in Proc. of IEEE International

Symposium on Information Theory (ISIT). IEEE, 2015, pp. 2558–2562.[26] A. Rényi, “On measures of dependence,” Acta Mathematica Academiae Scientiarum Hungarica, vol. 10, no. 3, pp. 441–451, 1959.[27] D. Strouse and D. J. Schwab, “The deterministic information bottleneck,” Neural Computation, vol. 29, no. 6, pp. 1611–1630, 2017.[28] A. Bhatt, B. Nazer, O. Ordentlich, and Y. Polyanskiy, “Information-distilling quantizers,” CoRR, vol. abs/1812.03031, 2018. [Online]. Available:

http://arxiv.org/abs/1812.03031[29] O. Shamir, S. Sabato, and N. Tishby, “Learning and generalization with the information bottleneck,” Theor. Comput. Sci., vol. 411, no. 29-30, pp.

2696–2711, June 2010.[30] M. Diaz, H. Wang, F. P. Calmon, and L. Sankar, “On the robustness of information-theoretic privacy measures and mechanisms,” IEEE Transactions

on Information Theory, vol. 66, no. 4, pp. 1949–1978, 2020.[31] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proceedings of the 23rd Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’00. Association for Computing Machinery,2000, pp. 208–215.

[32] R. El-Yaniv and O. Souroujon, “Iterative double clustering for unsupervised and semi-supervised learning,” in Proceedings of the 12th EuropeanConference on Machine Learning. Berlin, Heidelberg: Springer-Verlag, 2001, pp. 121–132.

[33] G. Elidan and N. Friedman, “Learning hidden variable networks: The information bottleneck approach,” J. Mach. Learn. Res., vol. 6, pp. 81–127, Dec.2005.

[34] D. Strouse and D. J. Schwab, “Geometric Clustering with the Information Bottleneck,” Neural Computation, vol. 31, no. 3, pp. 596–612, 2019, pMID:30314426.

[35] F. Cicalese, L. Gargano, and U. Vaccaro, “Bounds on the entropy of a function of a random variable and their applications,” IEEE Transactions onInformation Theory, vol. 64, no. 4, pp. 2220–2230, 2018.

[36] T. Koch and A. Lapidoth, “At low snr, asymmetric quantizers are better,” IEEE Transactions on Information Theory, vol. 59, no. 9, pp. 5421–5445,2013.

[37] R. Pedarsani, S. H. Hassani, I. Tal, and E. Telatar, “On the construction of polar codes,” in 2011 IEEE International Symposium on Information TheoryProceedings, 2011, pp. 11–15.

[38] I. Tal, A. Sharov, and A. Vardy, “Constructing polar codes for non-binary alphabets and macs,” in 2012 IEEE International Symposium on InformationTheory Proceedings, 2012, pp. 2132–2136.

[39] A. Kartowsky and I. Tal, “Greedy-merge degrading has optimal power-law,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 917–934,2019.

[40] A. J. Viterbi and J. K. Omura, Principles of Digital Communication and Coding, 1st ed. USA: McGraw-Hill, Inc., 1979.[41] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in Proc. of IEEE Information Theory Workshop (ITW), 2015,

pp. 1–5.[42] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” vol. abs/1703.00810, 2017. [Online]. Available:

http://arxiv.org/abs/1703.00810[43] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,”

in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=ry_WPG-A-

http://proceedings.mlr.press/v97/bertran19a.html

http://www.mdpi.com/2078-2489/7/1/15

http://arxiv.org/abs/1812.03031


https://openreview.net/forum?id=ry_WPG-A-

23

[44] Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy, “Estimating information flow in deepneural networks,” ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA:PMLR, 09–15 Jun 2019, pp. 2299–2308. [Online]. Available: http://proceedings.mlr.press/v97/goldfeld19a.html

[45] R. A. Amjad and B. C. Geiger, “Learning representations for neural network-based classification using the information bottleneck principle,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2225–2239, 2020.

[46] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” arXiv preprint arXiv:1612.00410, 2016.[47] A. Kolchinsky, B. D. Tracey, and D. H. Wolpert, “Nonlinear information bottleneck,” arXiv preprint arXiv:1705.02436, 2017.[48] A. Kolchinsky, B. D. Tracey, and S. V. Kuyk, “Caveats for information bottleneck in deterministic scenarios,” in International Conference on Learning

Representations, 2019. [Online]. Available: https://openreview.net/forum?id=rke4HiAcY7[49] M. Chalk, O. Marre, and G. Tkacik, “Relevant sparse codes with variational information bottleneck,” in Proceedings of the 30th International Conference

on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY, USA: Curran Associates Inc., 2016, pp. 1965–1973.[50] K. Wickstrøm, S. Løkse, M. Kampffmeyer, S. Yu, J. Principe, and R. Jenssen, “Information plane analysis of deep neural networks via matrix–based

rényi’s entropy and tensor kernels,” 2020. [Online]. Available: https://openreview.net/forum?id=B1l0wp4tvr[51] V. Matias, P. Piantanida, and L. Rey Vega, “The Role of the Information Bottleneck in Representation Learning,” in IEEE International Symposium

on Information Theory (ISIT 2018), Vail, United States, Jun. 2018. [Online]. Available: https://hal-centralesupelec.archives-ouvertes.fr/hal-01756003[52] A. Alemi, I. Fischer, and J. Dillon, Eds., Uncertainty in the Variational Information Bottleneck, 2018. [Online]. Available: https://arxiv.org/abs/1807.00906[53] S. Yu, R. Jenssen, and J. Príncipe, “Understanding convolutional neural network training with information theory,” ArXiv, vol. abs/1804.06537, 2018.[54] H. Cheng, D. Lian, S. Gao, and Y. Geng, “Evaluating capability of deep neural networks for image classification via information plane,” in ECCV,

2018.[55] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a

constrained variational framework,” in ICLR, 2017.[56] I. Issa, A. B. Wagner, and S. Kamath, “An operational approach to information leakage,” IEEE Transactions on Information Theory, vol. 66, no. 3, pp.

1625–1657, 2020.[57] M. Cvitkovic and G. Koliander, “Minimal achievable sufficient statistic learning,” ser. Proceedings of Machine Learning Research,

K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA: PMLR, June 2019, pp. 1465–1474. [Online]. Available:http://proceedings.mlr.press/v97/cvitkovic19a.html

[58] S. Asoodeh, F. Alajaji, and T. Linder, “On maximal correlation, mutual information and data privacy,” in Proc. IEEE 14th Canadian Workshop on Inf.Theory (CWIT), June 2015, pp. 27–31.

[59] A. Makhdoumi and N. Fawaz, “Privacy-utility tradeoff under statistical uncertainty,” in Proc. 51st Allerton Conference on Communication, Control,and Computing, Oct 2013, pp. 1627–1634.

[60] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Estimation efficiency under privacy constraints,” IEEE Transactions on Information Theory, vol. 65,no. 3, pp. 1512–1534, 2019.

[61] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Privacy-aware guessing efficiency,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), June 2017.[62] S. Asoodeh, F. Alajaji, and T. Linder, “Privacy-aware MMSE estimation,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), July 2016, pp. 1989–1993.[63] F. d. P. Calmon, A. Makhdoumi, M. MÃ c©dard, M. Varia, M. Christiansen, and K. R. Duffy, “Principal inertia components and applications,” IEEE

Transactions on Information Theory, vol. 63, no. 8, pp. 5011–5038, 2017.[64] H. Wang, L. Vo, F. P. Calmon, M. Médard, K. R. Duffy, and M. Varia, “Privacy with estimation guarantees,” IEEE Transactions on Information Theory,

vol. 65, no. 12, pp. 8025–8042, 2019.[65] S. Asoodeh, “Information and estimation theoretic approaches to data privacy,” Ph.D. dissertation, Queen’s University, May 2017.[66] J. Liao, O. Kosut, L. Sankar, and F. du Pin Calmon, “Tunable measures for information leakage and applications to privacy-utility tradeoffs,” IEEE

Transactions on Information Theory, vol. 65, no. 12, pp. 8043–8066, 2019.[67] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Privacy aware learning,” Journal of the Association for Computing Machinery (ACM), vol. 61, no. 6,

Dec. 2014.[68] H. S. Witsenhausen, “On sequence of pairs of dependent random variables,” SIAM Journal on Applied Mathematics, vol. 28, no. 2, pp. 100–113, 1975.[69] B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” ser. Proceedings of Machine Learning

Research, vol. 97, 09–15 Jun 2019, pp. 5171–5180. [Online]. Available: http://proceedings.mlr.press/v97/poole19a.html[70] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,”

ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80, 2018, pp. 531–540. [Online]. Available:http://proceedings.mlr.press/v80/belghazi18a.html

[71] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018. [Online]. Available:http://arxiv.org/abs/1807.03748

[72] J. Song and S. Ermon, “Understanding the limitations of variational mutual information estimators,” in International Conference on LearningRepresentations, 2020. [Online]. Available: https://openreview.net/forum?id=B1x62TNtDS

[73] D. McAllester and K. Stratos, “Formal limitations on the measurement of mutual information,” ser. Proceedings of Machine Learning Research,S. Chiappa and R. Calandra, Eds., vol. 108, Online, 26–28 Aug 2020, pp. 875–884. [Online]. Available: http://proceedings.mlr.press/v108/mcallester20a.html

[74] B. Rassouli and D. Gunduz, “On perfect privacy,” in 2018 IEEE International Symposium on Information Theory (ISIT), 2018, pp. 2551–2555.[75] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, 2011.[76] H. Kim, W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Discovering potential correlations via hypercontractivity,” in Advances in Neural Information

Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc.,2017, pp. 4577–4587. [Online]. Available: http://papers.nips.cc/paper/7044-discovering-potential-correlations-via-hypercontractivity.pdf

[77] R. Ahlswede and P. Gács, “Spreading of sets in product spaces and hypercontraction of the markov operator,” The annals of probability, vol. 4, no. 6,pp. 925–939, 1976.

[78] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkipand Cover,” Preprint, arXiv:1304.6133v1, 2014.

[79] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,” IEEE Trans. Inf. Theory, vol. 62, no. 1, pp. 35–55, Jan 2016.[80] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss, “Information bottleneck for gaussian variables,” J. Mach. Learn. Res., vol. 6, pp. 165–188, Dec.

2005.[81] L. Contento, A. Ern, and R. Vermiglio, “A linear-time approximate convex envelope algorithm using the double Legendre-Fenchel transform with

application to phase separation,” Computational Optimization and Applications, vol. 60, no. 1, pp. 231–261, January 2015.[82] Y. Lucet, “Faster than the fast Legendre transform, the linear-time Legendre transform,” Numerical Algorithms, vol. 16, pp. 171–185, 1997.[83] Z. Goldfeld and Y. Polyanskiy, “The information bottleneck problem and its applications in machine learning,” IEEE Journal on Selected Areas in

Information Theory, vol. 1, no. 1, pp. 19–38, 2020.[84] A. Zaidi, I. Estella-Aguerri, and S. Shamai (Shitz), “On the information bottleneck problems: Models, connections, applications and information theoretic

views,” Entropy, vol. 22, no. 2, p. 151, Jan 2020.[85] H. Witsenhausen, “Indirect rate distortion problems,” IEEE Transactions on Information Theory, vol. 26, no. 5, pp. 518–521, 1980.[86] A. Wyner, “On source coding with side information at the decoder,” IEEE Trans. Inf. Theory, vol. 21, no. 3, pp. 294–300, May 1975.

http://proceedings.mlr.press/v97/goldfeld19a.html

https://openreview.net/forum?id=rke4HiAcY7

https://openreview.net/forum?id=B1l0wp4tvr

https://hal-centralesupelec.archives-ouvertes.fr/hal-01756003

https://arxiv.org/abs/1807.00906

http://proceedings.mlr.press/v97/cvitkovic19a.html

http://proceedings.mlr.press/v97/poole19a.html

http://proceedings.mlr.press/v80/belghazi18a.html


https://openreview.net/forum?id=B1x62TNtDS

http://proceedings.mlr.press/v108/mcallester20a.html

http://proceedings.mlr.press/v108/mcallester20a.html

http://papers.nips.cc/paper/7044-discovering-potential-correlations-via-hypercontractivity.pdf

arXiv:1304.6133v1

24

[87] T. A. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp.740–761, 2014.

[88] C. T. Li and A. El Gamal, “Extended Gray-Wyner system with complementary causal side information,” IEEE Transactions on Information Theory,vol. 64, no. 8, pp. 5862–5878, 2018.

[89] M. Vera, L. Rey Vega, and P. Piantanida, “Collaborative information bottleneck,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 787–815,2019.

[90] R. Gilad-Bachrach, A. Navot, and N. Tishby, “An information theoretic tradeoff between complexity and accuracy,” in Learning Theory and KernelMachines. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 595–609.

[91] G. Pichler and G. Koliander, “Information bottleneck on general alphabets,” in 2018 IEEE Int. Sym. Inf. Theory (ISIT), June 2018, pp. 526–530.[92] Y. H. Kim, A. Sutivong, and T. Cover, “State mplification,” IEEE Trans. Inf. Theory, vol. 54, no. 5, pp. 1850–1859, April 2008.[93] S. Merhav, N.; Shamai, “Information rates subject to state masking,” IEEE Trans. Inf. Theory, vol. 53, no. 6, pp. 2254–2261, June 2007.[94] H. Witsenhausen, “Some aspects of convexity useful in information theory,” IEEE Trans. Inf. Theory, vol. 26, no. 3, pp. 265–271, May 1980.[95] P. Harremoës and N. Tishby, “The information bottleneck revisited or how to choose a good distortion measure,” in Proc. of IEEE International

Symposium on Information Theory (ISIT), 2007, pp. 566–570.[96] C. Hirche and A. Winter, “An alphabet size bound for the information bottleneck function,” in Proc. of IEEE International Symposium on Information

Theory (ISIT), 2020.[97] F. Liese and I. Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10,

pp. 4394–4412, 2006.[98] S. Verdú, “α-mutual information,” in Proc. Information Theory and Applications Workshop (ITA), 2015, Feb. 2015, pp. 1–6.[99] S. Fehr and S. Berens, “On the conditional Rényi entropy,” IEEE Trans. Inf. Theory, vol. 60, no. 11, pp. 6801–6810, Nov. 2014.

[100] I. Csiszár, “Information-type measures of difference of probability distributions and indirect observation,” Studia Scientiarum MathematicarumHungarica, no. 2, pp. 229–318, 1967.

[101] I. Sason and S. Verdú, “f -divergence inequalities,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.[102] A. Guntuboyina, S. Saha, and G. Schiebinger, “Sharp inequalities for f -divergences,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp.

104–121, 2014.[103] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error in gaussian channels,” IEEE Transactions on Information

Theory, vol. 51, no. 4, pp. 1261–1282, 2005.[104] R. T. Rockafellar, Convex Analysis. Princeton Univerity Press, 1997.[105] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley-Interscience, 2006.[106] T. Linder and R. Zamir, “On the asymptotic tightness of the Shannon lower bound,” IEEE Trans. Inf. Theory, vol. 40, no. 4, pp. 2026–2031., Nov.

2008.[107] D. Guo, Y. Wu, S. S. Shitz, and S. Verdú, “Estimation in gaussian noise: Properties of the minimum mean-square error,” IEEE Transactions on

Information Theory, vol. 57, no. 4, pp. 2371–2385, 2011.[108] S. Jana, “Alphabet sizes of auxiliary random variables in canonical inner bounds,” in Proc. 43rd Annual Conf. on Information Sciences and Systems,

March 2009, pp. 67–71.

APPENDIX APROOFS FROM SECTION II

Proof of Theorem 1. • Note that R = 0 in optimization problem (4) implies that X and T are independent. Since Y,Xand T form Markov chain Y (−− Y (−− T , independent of X and T implies independence of Y and T and thusI(Y ;T ) = 0. Similarly for PF(0).

• Since I(X;T ) ≤ H(X) for any random variable T , we have T = X satisfies the information constraint I(X;T ) ≤ Rfor R ≥ H(X). Since I(Y ;T ) ≤ I(Y ;X), this choice is optimal. Similarly for PF, the constraint I(X;T ) ≥ r forr ≥ H(X) implies T = X . Hence, PF(r) = I(Y ;X).

• The upper bound on IB follows from the data processing inequality: I(Y ;T ) ≤ min{I(X;T ), I(X;Y )} for all T satisfyingthe Markov condition Y (−− X (−− T .

• To prove the lower bound on PF, note that

I(Y ;T ) = I(X;T )− I(X;T |Y ) ≥ I(X;T )−H(X|Y ).

• The concavity of R 7→ IB(R) follows from the fact it is the upper boundary of the convex set M, defined in (6). This inturn implies the continuity of IB(·). Monotonicity of R 7→ IB(R) follows from the definition. Strict monotonicity followsfrom the convexity and the fact that IB(H(X)) = I(X;Y ).

• Similar as above.• The differentiability of the map R 7→ IB(R) follows from [90, Lemma 6]. This result in fact implies the differentiability

of the map r 7→ PF(r) as well. Continuity of the derivative of IB and PF on (0, H(X)) is a straightforward applicationof [104, Theorem 25.5].

• Monotonicity of mappings R 7→ IB(R)R and r 7→ PF(r)

r follows from the concavity and convexity of IB(·) and PF(·),respectively.

• Strict monotonicity of IB(·) and PF(·) imply that the optimization problems in (4) and (5) occur when the inequality inthe constraints becomes equality.

Proof of Theorem 3. Recall that, according to Theorem 1, the mappings R 7→ IB(R) and r 7→ PF(r) are concave and convex,respectively. This implies that IB(R) (resp. PF(r)) lies above (resp. below) the chord connecting (0, 0) and (H(X, I(X;Y )).This proves the lower bound (resp. upper bound) IB(R) ≥ R I(X;Y )

H(X) (resp. PF(r) ≤ r I(X;Y )H(X) ).

25

In light of the convexity of PF and monotonicity of r 7→ PF(r)r , we can write

PF(r)

r≥ limr→0+

PF(r)

r= infr 6=0

PF(r)

r= infPT |X :Y(−−X(−−T

I(X;T )6=0

I(Y ;T )

I(X;T )= infP(X )3QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX),

where the last equality is due to [14, Lemma 4] and QY is the output distribution of the channel PY |X when the input isdistributed according to QX . Similarly, we can write

IB(R)

R≤ limR→0+

IB(R)

R= supR 6=0

IB(R)

R= supPT |X :Y(−−X(−−T

I(X;T ) 6=0

I(Y ;T )

I(X;T )= supP(X )3QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX),

where the last equality is due to [78, Theorem 4].

Proof of Theorem 5. Let Tn be an optimal summeries of Xn, that is, it satisfies Tn (−− Xn (−− Y n and I(Xn;Tn) = nR.We can write

I(Xn, Tn) = H(Xn)−H(Xn|Tn) =

n∑k=1

[H(Xk)−H(Xk|Xk−1, Tn)

]=

n∑k=1

I(Xk;Xk−1, Tn),

and hence, if Rk := I(Xk;Xk−1, Tn), then we have

R =1

n

n∑k=1

Rk. (67)

We can similarly write

I(Y n, Tn) = H(Y n)−H(Y n|Tn) =

n∑k=1

[H(Yk)−H(Yk|Y k−1, Tn)

]≤

n∑k=1

[H(Yk)−H(Yk|Y k−1, Xk−1, Tn)

]=

n∑k=1

[H(Yk)−H(Yk|Xk−1, Tn)

]=

n∑k=1

I(Yk;Xk−1, Tn).

Since we have (Tn, Xk−1) (−− Xk (−− Yk for every k ∈ [n], we conclude from the above inequality that

I(Y n, Tn) ≤n∑k=1

I(Yk;Xk−1, Tn) ≤n∑k=1

IB(PXY , Rk) ≤ nIB(PXY , R), (68)

where the last inequality follows from concavity of the map x 7→ IB(PXY , x) and (67). Consequently, we obtain

IB(PXnY n , nR) ≤ nIB(PXY , R). (69)

To prove the other direction, let PT |X be an optimal channel in the definition of IB, i.e., I(X;T ) = R and IB(PXY , R) =I(Y ;T ). Then using this channel n times for each pair (Xi, Yi), we obtain Tn = (T1, . . . , Tn) satisfying Tn (−− Xn (−− Y n.Since I(Xn;Tn) = nI(X;T ) = nR and I(Y n;Tn) = nI(Y ;T ), we have IB(PXnY n , nR) ≥ nIB(PXY , R). This, togetherwith (69), concludes the proof.

Proof of Theorem 4. First notice that

limR→0

IB(R)

R= supR>0

IB(R)

R= sup

PT |X :Y(−−X(−−T

I(Y ;T )

I(X;T )= sup

QX∈P(X)QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX),

where the last equality is due to [78, Theorem 4]. Similarly,

limr→0

PF(r)

r= infr>0

PF(r)

r= inf

PT |X :Y(−−X(−−T

I(Y ;T )

I(X;T )= inf

QX∈P(X)QX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX),

where the last equality is due to [14, Lemma 4].Fix x0 ∈ X with PX(x0) > 0 and let T be a Bernoulli random variable specified by the following channel

PT |X(1|x) = δ1{x=x0},

26

for some δ > 0. This channel induces T ∼ Bernoulli(δPX(x0)), PY |T (y|1) = PY |X(y|x0), and

PY |T (y|0) =PY (y)− PXY (x0, y)

1− δPX(x0).

It can be verified thatI(X;T ) = −δPX(x0) logPX(x0) + o(δ),

andI(Y ;T ) = δPX(x0)DKL(PY |X(·|x0)‖PY (·)) + o(δ).

Settingδ =

r

−PX(x0) logPX(x0),

we obtainI(Y ;T ) =

DKL(PY |X(·|x0)‖PY (·))− logPX(x0)

r + o(r),

and hencePF(r) ≤

DKL(PY |X(·|x0)‖PY (·))− logPX(x0)

r + o(r).

Since x0 is arbitrary, the result follows. The proof for IB follows similarly.

Proof of Lemma 1. When Y is an erasure of X , i.e., Y = X ∪ {⊥} with PY |X(x|x) = 1 − δ and PY |X(⊥ |x) = δ, it isstraightforward to verify that DKL(QY ‖PY ) = (1− δ)DKL(QX‖PX) for every PX and QX in P(X). Consequently, we have

infQX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX)= supQX 6=PX

DKL(QY ‖PY )

DKL(QX‖PX)= 1− δ.

Hence, Theorem 3 gives the desired result.To prove the second part, i.e., when X is an erasure of Y , we need an improved upper bound of PF. Notice that if perfect

privacy occurs for a given PXY , then the upper bound for PF(r) in Theorem 3 can be improved:

PF(r) ≤ (r − r0)I(X;Y )

H(X)− r0, (70)

where r0 is the largest r ≥ 0 such that PF(r) = 0. Here, we show that r0 = H(X|Y ). This suffices to prove the result as(70), together with Theorem 1, we have

max{r −H(X|Y ), 0} ≤ PF ≤ (r − r0)I(X;Y )

H(X)− r0= (r −H(X|Y )).

To show that PF(H(X|Y )) = 0, consider the channel PT |X(t|x) = 1|Y|1{t 6=⊥,x 6=⊥} and PT |X(⊥ | ⊥) = 1. It can be verified

that this channel induces T which is independent of Y and that

I(X;T ) = H(T )−H(T |X) = H

(1− δ|Y|

, . . . ,1− δ|Y|

, δ

)− (1− δ) log |Y| = hb(δ) = H(X|Y ),

where hb(δ) := −δ log δ − (1− δ) log(1− δ) is the binary entropy function.

Proof of Lemma 4. As mentioned earlier, (24) was proved in [3]. We thus give a proof only for (25).Consider the problem of minimizing the Lagrangian LPF(β) (20) for β ≥ βPF. Let X ′ ∼ QX = Bernoulli(q) for some

q ∈ (0, 1) and Y ′ be the result of passing X ′ through BSC(δ), i.e., Y ′ ∼ Bernoulli(q ∗ δ). Recall that Fβ(q) := Fβ(QX) =hb(q ∗ δ) − βhb(q). It suffices to compute K∩[Fβ(q)] the upper concave envelope of q 7→ Fβ(q). It can be verified thatβIB ≤ (1 − 2δ)2 and hence for all β ≥ (1 − 2δ)2, K∩[Fβ(q)] = Fβ(0). A straightforward computation shows that Fβ(q) issymmetric around q = 1

2 and is also concave in a region around q = 12 , where it reaches its local maximum. Hence, if β is

such that• Fβ( 1

2 ) < Fβ(0) (see Fig. 4a), then K∩[Fβ(q)] is given by the convex combination of Fβ(0) and Fβ(1).• Fβ( 1

2 ) = Fβ(0) (see Fig. 4b), then K∩[Fβ(q)] is given by the convex combination of Fβ(0) and Fβ( 12 ) and Fβ(1).

• Fβ( 12 ) > Fβ(0) (see Fig. 4c), then there exists qβ ∈ [0, 1

2 ] such that for q ≤ qβ , K∩[Fβ(q)] is given by the convexcombination of Fβ(0) and Fβ(qβ).

Hence, assuming p ≤ 12 , we can construct T ∗ that maximizes H(Y |T ) − βH(X|T ) in three different cases corresponding

three cases above:• In the first case, T ∗ is binary and we have PX|T∗=0 = Bernoulli(0) and PX|T∗=1 = Bernoulli(1) with PT∗ = Bernoulli(p).

27

• In the second case, T ∗ is ternary and we have PX|T∗=0 = Bernoulli(0), PX|T∗=1 = Bernoulli(1), and PX|T∗=2 =Bernoulli( 1

2 ) with PT∗ = (1− p− α2 , p−

α2 , α) for some α ∈ [0, 2p].

• In the third case, T ∗ is again binary and we have PX|T∗=0 = Bernoulli(0) and PX|T∗=1 = Bernoulli( pα ) with PT∗ =Bernoulli(α) for some α ∈ [2p, 1].

Combining these three cases, we obtain the result in (25).

Proof of Lemma 2. Let X = Y + σNG where σ > 0 and NG ∼ N (0, 1) is independent of Y . According to the improvedentropy power inequality proved in [23, Theorem 1], we can write

e2(H(X)−I(Y ;T )) ≥ e2(H(Y )−I(X;T )) + 2πeσ2,

for any random variable T forming Y (−− X (−− T . This, together with Theorem 5, implies the result.

Proof of Corollary 1. Since (X,Y ) are jointly Gaussian, we can write X = Y + σNG where σ = σY

√1−ρ2ρ and σ2

Y is thevariance of Y . Applying Lemma 2 and noticing that H(X) = 1

2 log(2πe(σ2Y + σ2)), we obtain

I(Y ;T ) ≤ 1

2log

σ2 + σ2Y

σ2 + σ2Y e−2I(X;T )

=1

1− ρ2 + ρ2e−2I(X;T ), (71)

for all channels PT |X satisfying Y (−− X (−− T . This bound is attained by Gaussian PT |X . Specifically, assuming T +X +

σMG where σ2 = σ2Y

e−2R

ρ2(1−e−2R)for R ≥ 0 and MG ∼ N (0, σ2) independent of X , it can be easily verified that I(X;T ) = R

and I(Y ;T ) = 11−ρ2+ρ2e−2R . This, together with (71), implies

IB(R) =1

1− ρ2 + ρ2e−2R.

Next, we wish to prove Theorem 6. However, we need the following preliminary lemma before we delve into its proof.

Lemma 10. Let X and Y be continuous correlated random variables with E[X2] <∞ and E[Y 2] <∞. Then the mappingsσ 7→ I(X;Tσ) and σ 7→ I(Y ;Tσ) are continuous, strictly decreasing, and

I(X;Tσ)→ 0, and I(Y ;Tσ)→ 0 as σ →∞.

Proof. The finiteness of E[X2] and E[Y 2] imply that H(X) and H(Y ) are finite. A straightforward application of the entropypower inequality (cf. [105, Theorem 17.7.3]) implies that H(Tσ) is also finite. Thus, I(X;Tσ) and I(Y ;Tσ) are well-defined.According to the data processing inequality, we have I(X;Tσ+δ) < I(X;Tσ) for all δ > 0 and also I(Y ;Tσ+δ) ≤ I(Y ;Tσ)where the equality occurs if and only if X and Y are independent. Since, bu assumption X and Y correlated, it followsI(Y ;Tσ+δ) < I(Y ;Tσ). Thus, both I(X;Tσ) and I(Y ;Tσ) are strictly decreasing.

For the proof of continuity, we consider two cases σ = 0 and σ > 0 separately. We first give the poof for I(X;Tσ). SinceH(σNG) = 1

2 log(2πeσ2), we have limσ→0H(σNG) = ∞ and thus limσ→0 I(X;Tσ) = ∞ that is equal to I(X;T0). Forσ > 0, let σn be a sequence of positive numbers converging to σ. In light of de Bruijn’s identity (cf. [105, Theorem 17.7.2]),we have H(Tσn)→ H(Tσ), implying the continuity of σ 7→ I(X;Tσ).

Next, we prove the continuity of σ 7→ I(Y ;Tσ). For the sequence of positive numbers σn converging to σ > 0, we haveI(Y ;Tσn) = H(Tσn) −H(Tσn |Y ). We only need to show H(Tσn |Y ) → H(Tσ|Y ). Invoking again de Brujin’s identity, weobtain H(Tσn |Y = y) → H(Tσ|Y = y) for each y ∈ Y . The desired result follows from dominated convergence theorem.Finally, the The continuity of σ 7→ I(Y ;Tσ) when σ = 0 follows from [106, Page 2028] stating that H(Tσn |Y = y) →H(X|Y = y) and then applying dominated convergence theorem.

Note that0 ≤ I(Y ;Tσ) ≤ I(X;Tσ) ≤ 1

2log

(1 +

σ2X

σ2

),

where σ2X is the variance of X and the last inequality follows from the fact that I(X;X + σNG) is maximized when X is

Gaussian. Since by assumption σX <∞, it follows that both I(X;Tσ) and I(Y ;Tσ) converge to zero as σ →∞.

In light of this lemma, there exists a unique σ ≥ 0 such that I(X;Tσ) = r. Let σr denote such σ. Therefore, we havePF(r) = I(Y ;Tσr ). This enables us to prove Theorem 6.

Prof of Theorem 6. The proof relies on the I-MMSE relation in information theory literature. We briefly describe it here forconvenience. Given any pair of random variables U and V , the minimum mean-squared error (MMSE) of estimating U givenV is given by

mmse(U |V ) := inff

E[(U − f(V ))2] = E[(U − E[U |V ])2] = E[var(U |V )],

28

where the infimum is taken over all measurable functions f and var(U |V ) = E[(U − E[U |V ])2|V ]. Guo et al. [103] provedthe following identity, which is referred to as I-MMSE formula, relating the input-output mutual information of the additiveGaussian channel Tσ = X + σNG, where NG ∼ N (0, 1) is independent of X , with the MMSE of the input given the output:

dd(σ2)

I(X;Tσ) = − 1

2σ4mmse(X|Tσ). (72)

Since Y , X , and Tσ form the Markov chain Y (−− X (−− Tσ , it follows that I(Y ;Tσ) = I(X;Tσ) − I(X;Tσ|Y ). Thus,two applications of (72) yields

dd(σ2)

I(Y ;Tσ) = − 1

2σ4[mmse(X|Tσ)−mmse(X|Tσ, Y )] . (73)

The second derivative of I(X;Tσ) and I(Y ;Tσ) are also known via the formula [107, Proposition 9]

dd(σ2)

mmse(X|Tσ) =1

σ4E[var2(X|Tσ)] and

dd(σ2)

mmse(X|Tσ, Y ) =1

σ4E[var2(X|Tσ, Y )]. (74)

With these results in mind, we now begin the proof. Recall that σr is the unique σ such that I(X;Tσ) = r, thus implyingPFG(r) = I(Y ;Tσr ). We have

ddr

PFG(r) =

[d

d(σ2)I(Y ;Tσ)

]σ=σr

ddrσ2r . (75)

To compute the derivative of PF(r), we therefore need to compute the derivative of σ2r with respect to r. To do so, notice that

from the identity I(X;Tσr ) = r we can obtain

1 =ddrI(X;Tσr ) =

[d

d(σ2)I(X;Tσr )

]σ=σr

ddrσ2r = − 1

2σ4mmse(X|Tσr )

ddrσ2r ,

implyingddrσ2r =

−2σ4

mmse(X|Tσr ).

Plugging this identity into (75) and invoking (73), we obtain

ddr

PFG(r) =mmse(X|Tσr )−mmse(X|Tσr , Y )

mmse(X|Tσr ). (76)

The second derivative can be obtained via (74)

d2

dr2PFG(r) = 2

E[var2(X|Tσr , Y )]

mmse2(X|Tσr )− 2E[var2(X|Tσr )]

mmse(X|Tσr , Y )

mmse3(X|Tσr ).

Since σr →∞ as r → 0, we can write

ddr

PFG(r)∣∣∣r=0

=σ2X − E[var(X|Y )]

σX=

var(E[X|Y ])

σ2X

= η(X,Y ),

where var(E[X|Y ]) is the variance of the conditional expectation X given Y and the last equality comes from the law of totalvariance. and

d2

dr2PFG(r)

∣∣∣r=0

=2

σ4X

[E[var2(X|Y )]− σ2

XE[var(X|Y )]].

Taylor expansion of PF(r) around r = 0 gives the result.

Proof of Theorem 9. The main ingredient of this proof is a result by Jana [108, Lemma 2.2] which provides a tight cardinalitybound for the auxiliary random variables in the canonical problems in network information theory (including noisy sourcecoding problem described in II-C). Consider a pair of random variables (X,Y ) ∼ PXY and let d : Y ×Y → R be an arbitrarydistortion measure defined for arbitrary reconstruction alphabet Y .

Theorem 13 ( [108]). Let A be the set of all pairs (R,D) satisfying

I(X;T ) ≤ R and E[d(Y, ψ(T ))] ≤ D,

for some mapping ψ : T → Y and some joint distributions PXY T = PXY PT |X . Then every extreme points of A correspondsto some choice of auxiliary variable T with alphabet size |T | ≤ |X |.

Measuring the distortion in the above theorem in terms of the logarithmic loss as in (27), we obtain that

A = {(R,D) ∈ R2+ : R ≥ Rnoisy(D)},

29

where Rnoisy(D) is given in (29). We observed in Section II-C that IB is fully characterized by the mapping D 7→ Rnoisy(D)and thus by A. In light of Theorem 13, all extreme points of A are achieved by a choice of T with cardinality size |T | ≤ |X |.Let {(Ri,Di)} be the set of extreme points of A each constructed by channel PTi|X and mapping ψi. Due to the convexityof A, each point (R,D) ∈ A is expressed as a convex combination of {(Ri,Di)} with coefficient {λi}; that is there existsa channel PT |X =

∑i λiPTi|X and a mapping ψ(T ) =

∑i λiψi(Ti) such that I(X;T ) = R and E[d(Y, ψ(T ))] = D. This

construction, often termed timesharing in information theory literature, implies that all points in A (including the boundarypoints) can be achieved with a variable T with |T | ≤ |X |. Since the boundary of A is specified by the mapping R 7→ IB(R),we conclude that IB(R) is achieved by a variable T with cardinality |T | ≤ |X | for very R < H(X).

Proof of Lemma 5. The following proof is inspired by [28, Proposition 1]. Let X = {1, . . . ,m}. We sort the elements in Xsuch that

PX(1)DKL(PY |X=1‖PY ) ≥ · · · ≥ PX(m)DKL(PY |X=m‖PY ).

Now consider the function f : X → [M ] given by f(x) = x if x < M and f(x) = M if x ≥ M where M = eR. LetZ = f(X). We have PZ(i) = PX(i) if i < M and PZ(M) =

∑j≥M PX(j). We can now write

I(Y ;Z) =

M−1∑i=1

PX(i)D(PY |X=i‖PY ) + PZ(M)D(PY |Z=M‖PY )

≥M−1∑i=1

PX(i)D(PY |X=i‖PY )

≥ M − 1

|X |∑i∈X

PX(i)D(PY |X=i‖PY )

=M − 1

|X |I(X;Y ).

Since f(X) takes values in [M ], it follows that H(f(X)) ≤ R. Consequently, we have

dIB(PXY , R) ≥ supf :X→[M ]

I(Y ; f(X)) ≥ M − 1

|X |I(X;Y ).

For the privacy funnel, the proof proceeds as follows. We sort the elements in X such that

PX(1)DKL(PY |X=1‖PY ) ≤ · · · ≤ PX(m)DKL(PY |X=m‖PY ).

Consider now the function f : X → [M ] given by f(x) = x if x < M and f(x) = M if x ≥ M . As before, let Z = f(X).Then, we can write,

I(Y ;Z) =

M−1∑i=1


≤ M − 1

|X |∑i∈X


=M − 1

|X |I(X;Y ) + PZ(M)

∑y∈Y

PY |Z(y|M) logPY |Z(y|M)

PY (y)

≤ M − 1

|X |I(X;Y ) + Pr(X ≥M)

∑y∈Y

[∑i

PY |X(y|i)PX(i)1{i≥M}

Pr(X ≥M)

]log

∑i PY |X(y|i)PX(i)1{i≥M}

Pr(X≥M)∑i PY |X(y|i)PX(i)

≤ M − 1

|X |I(X;Y ) +

∑y∈Y

∑i≥M

PY |X(y|i)PX(i) log1

Pr(X ≥M)

=M − 1

|X |I(X;Y ) + Pr(X ≥M) log

1

Pr(X ≥M)

where the last inequality is due to the log-sum inequality.

Proof of Lemma 6. Employing the same argument as in the proof of [28, Theorem 3], we obtain that there exists a functionf : X → [M ] such that

I(Y ; f(X)) ≥ ηI(X;Y ) (77)

30

for any η ∈ (0, 1) and

M ≤ 4 +4

(1− η) log 2log

2α

(1− η)I(X;Y ).

Since h−1b (x) ≤ x log 2

log 1x

for all x ∈ (0, 1], it follows from above that (noticing that I(X;Y ) ≤ α)

M ≤ 4 +I(X;Y )

2α

1

h−1b (ζ)

,

where ζ := (1−η)I(X;Y )2α . Rearranging this, we obtain

h−1b (ζ) ≤ I(X;Y )

2α(M − 4).

Assuming M ≥ 5, we have I(X;Y )2α(M−4) ≤

12 and hence

ζ ≤ hb( I(X;Y )

2α(M − 4)

),

implying

η ≥ 1− 2α

I(X;Y )hb

( I(X;Y )

2α(M − 4)

).

Plugging this into (77), we obtain

I(Y ; f(X)) ≥ I(X;Y )− 2αhb

( I(X;Y )

2α(M − 4)

).

As before, if M = eR, then H(f(X)) ≤ R. Hence,

dIB(PXY , R) ≥ I(X;Y )− 2αhb

( I(X;Y )

2α(eR − 4)

),

for all R ≥ log 5.

APPENDIX BPROOFS FROM SECTION III

Proof of Lemma 8. To prove the upper bound on PF(∞,1), recall that r 7→ ePF(∞,1)(r) is convex. Thus, it lies below the chord

connecting points (0, 0) and (H(X), eI∞(X;Y )). The lower bound on IB(∞,1) is similarly obtained using the concavity ofR 7→ eIB

(∞,1)(R). This is achievable by an erasure channel. To see this consider the random variable Tδ taking values inX ∪ {⊥} that is obtained by conditional distributions PTδ|X(t|x) = δIt=x and PTδ|X(⊥ |x) = δ for some δ ≥ 0. It can beverified that I(X;Tδ) = δH(X) and Pc(Y |Tδ) = δPc(Y |X) + δPc(Y ). By taking δ = 1 − R

H(X) , this channel meets theconstraint I(X;Tδ) = R. Hence,

IB(∞,1)(R) ≥ log

[Pc(Y |Tδ)Pc(Y )

]= log

[1− R

H(X)+

R

H(X)

Pc(Y |X)

Pc(Y )

].

Proof of Theorem 11. We begin by PF(∞,1). As described in Section III-A, and similar to Mrs. Gerber’s Lemma (Lemma 4),we need to construct the lower convex envelope K∪[F

(∞,1)β ] of F (∞,1)

β (q) = Pc(Y′) + βH(X ′) where X ′ ∼ Bernoulli(q) and

Y ′ is the result of passing X ′ through BSC(δ), i.e., Y ′ ∼ Bernoulli(δ ∗ q). In this case, Pc(Y′) = max{δ ∗ q, 1− δ ∗ q}. Hence,

we need to determine the lower convex envelope of the map

q 7→ F(∞,1)β (q) = max{δ ∗ q, 1− δ ∗ q}+ βhb(q). (78)

A straightforward computation shows that F (∞,1)β (q) is symmetric around q = 1

2 and is also concave in q on q ∈ [0, 12 ] for

any β. Hence, K∪[F(∞,1)β ] is obtained as follows depending on the values of β:

• F(∞,1)β ( 1

2 ) < F(∞,1)β (0) (see Fig. 6a), then K∪[F

(∞,1)β ] is given by the convex combination of F (∞,1)

β (0), F (∞,1)β (1), and

F(∞,1)β ( 1

2 ).• F

(∞,1)β ( 1

2 ) = F(∞,1)β (0) (see Fig. 6b), then K∪[F


β (0), F (∞,1)β ( 1

2 ),and F (∞,1)

β (1).• F

(∞,1)β ( 1

2 ) > F(∞,1)β (0) (see Fig. 6c), then K∪[F


β (0) and F (∞,1)β (1).

31

Hence, assuming p ≤ 12 , we can construct T ∗ that minimizes Pc(Y |T ) − βH(X|T ). Considering the first two cases, we

obtain that T ∗ is ternary with PX|T∗=0 = Bernoulli(0), PX|T∗=1 = Bernoulli(1), and PX|T∗=2 = Bernoulli( 12 ) with marginal

PT∗ = [1− p− α2 , p−

α2 , α] for some α ∈ [0, 2p]. This leads to Pc(Y |T ∗) = αδ + 1

2α and I(X;T ∗) = hb(p)− α. Note thatPc(Y |T ∗) covers all possible domain [Pc(Y ), δ] by varying α on [0, 2p]. Replacing I(X;T ∗) by r, we obtain α = hb(p)− rleading to Pc(Y |T ∗) = δ − (hb(p)− r)( 1

2 − δ). Since Pc(Y ) = 1− δ ∗ p, the desired result follows.To derive the expression for IB(∞,1), recall that we need to derive K∩[F

(∞,1)β ] the upper concave envelope of F (∞,1)

β . It isclear from Fig. 6 that K∩[F

(∞,1)β ] is obtained by replacing F (∞,1)

β (q) on the interval [qβ , 1− qβ ] by its maximum value overq where

qβ :=1

1 + e1−2δβ

,

is the maximizer of F (∞,1)β (q) on [0, 1

2 ]. In other words,

K∩[F(∞,1)β (q)] =

{F

(∞,1)β (qβ), for q ∈ [qβ , 1− qβ ],

F(∞,1)β (q), otherwise.

Note that if p < qβ then K∩[F(∞,1)β ] evaluated at p coincides with F (∞,1)

β (p). This corresponds to all trivial PT |X such thatPc(Y |T ) + βH(X|T ) = Pc(Y ) + βH(X). If, on the other hand, p ≥ qβ , then K∩[F

(∞,1)β is the convex combination of

F(∞,1)β (qβ) and F

(∞,1)β (1 − qβ). Hence, taking qβ as a parameter (say, α), the optimal binary T ∗ is constructed as follows:

PX|T∗=0 = Bernoulli(α) and PX|T∗=1 = Bernoulli(α) for α ≤ p. Such channel induces

Pc(Y |T ∗) = max{α ∗ δ, 1− α ∗ δ} = 1− α ∗ δ,

as α ≤ p ≤ 12 , and also

I(X;T ∗) = hb(p)− hb(α).

Combining these two, we obtainPc(Y |T ∗) = 1− δ ∗ h−1

b (hb(p)−R).

Proof of Theorem 12. Let Uα and Lα denote the UΦ,Ψ and LΦ,Ψ, respectively, when Ψ(QX) = Φ(QX) = ‖QX‖α. In light of(59) and (60), it is sufficient to compute Lα and Uα. To do so, we need to construct the lower convex envelope K∪[F

(α)β ] and

upper concave envelope K∩[F(α)β ] of the map F (α)

β (q) given by q 7→ ‖QY ‖α − β‖QX‖α where X ′ ∼ Bernoulli(q) and Y ′ isthe result of passing X ′ through BSC(δ), i.e., Y ′ ∼ Bernoulli(δ ∗ q). In this case, we have

q 7→ F(α)β (q) = ‖q ∗ δ‖α − β||q||α, (79)

where ‖a‖α is to mean ‖[a, a]‖α for any a ∈ [0, 1].We begin by Lα for which we aim at obtaining K∪[F

(α)β ]. A straightforward computation shows that F (α)

β (q) is convex forβ ≤ (1 − 2δ)2 and α ≥ 2. For β > (1 − 2δ)2 and α ≥ 2, it can be shown that F (α)

β (q) is concave an interval [qβ , 1 − qβ ]

where qβ solves ddqF

(α)β (q) = 0. (The shape of q 7→ F

(α)β (r) in is similar to what was depicted in Fig. 4.) By symmetry,

K∪[F(α)β ] is therefore obtained by replacing F (α)

β (q) on this interval by F (α)β (qβ). Hence, if p < qβ , K∪[F

(α)β ] at p coincides

with F (α)β (p) which results in trivial PT |X (see the proof of Theorem 11 for more details). If, on the other hand, p ≥ qβ , then

K∪[F(α)β ] evaluated at p is given by a convex combination of F (α)

β (qβ) and F (α)β (1− qβ). Relabeling qβ as a parameter (say,

q), we can write an optimal binary T ∗ via the following: PX|T∗=0 = Bernoulli(1− q) and PX|T∗=1 = Bernoulli(q) for q ≤ p.This channel induces Ψ(Y |T ∗) = ‖q ∗ δ‖α and Φ(X|T ∗) = ‖q‖α. Hence, the graph of Lα is given by{

(‖q‖α, ‖q ∗ δ‖α), 0 ≤ q ≤ p}.

Therefore,supPT |X

Hα(X|T )≥ζ

Hα(Y |T ) =α

1− αlog ‖q ∗ δ‖α,

where q ≤ p solves α1−α log ‖q‖α = ζ. Since the map q 7→ ‖q‖α is strictly decreasing for q ∈ [0, 0.5], this equation has a

unique solution.Next, we compute Uα or equivalently K∩[F

(α)β ] the upper concave envelop of F (α)

β defined in (79). As mentioned earlier,q 7→ F

(α)β (q) is convex for β ≤ (1 − 2δ)2 and α ≥ 2. For β > (1 − 2δ)2, we need to consider three cases: (1) K∩[F

(α)β ]

is given by the convex combination of F (α)β (0) and F

(α)β (1), (2) K∩[F

(α)β ] is given by the convex combination of F (α)

β (0),

32

F(α)β ( 1

2 ), and F (α)β (1), (3) K∩[F

(α)β ] is given by the convex combination of F (α)

β (0) and F (α)β (q†) where q† is a point ∈ [0, 1

2 ].Without loss of generality, we can ignore the first case. The other two cases correspond to the following solutions• T ∗ is a ternary variable given by PX|T∗=0 = Bernoulli(0), PX|T∗=1 = Bernoulli(1), and PX|T∗=2 = Bernoulli( 1

2 ) withmarginal T ∗ ∼ Bernoulli(1− p− λ

2 , p−λ2 , λ) for some λ ∈ [0, 2p]. This produces

Ψ(Y |T ∗) = λ‖δ‖α + λ‖1

2‖α,

andΦ(X|T ∗) = λ+ λ‖1

2‖α.

• T ∗ is a binary variable given by PX|T∗=0 = Bernoulli(0) and PX|T∗=1 = Bernoulli( pλ ) with marginal T ∗ ∼ Bernoulli(λ)for some λ ∈ [2p, 1]. This produces

Ψ(Y |T ∗) = λ‖δ‖α + λ‖δ ∗ pλ‖α,

andΦ(X|T ∗) = λ+ λ‖ p

λ‖α.

Combining these two cases, can writeUα(ζ) = λ‖δ‖α + λ‖q

z∗ δ‖α,

whereζ = λ+ λ‖q

z‖α,

and z = max{2p, λ}. Plugging this into (59) completes the proof.

Proof of Lemma 9. The facts that γ 7→ Hγ(X|T ) is non-increasing on [1,∞] [99, Proposition 5] and (∑i |xi|γ)

1/γ ≥ maxi |xi|for all p ≥ 0 imply

γ − 1

γHγ(X|T ) ≤ H∞(X|T ) ≤ Hγ(X|T ). (80)

Since I∞(X;T ) = H∞(X)−H∞(X|T ), the above lower bound yields

Iγ(X;T ) ≥ γ

γ − 1I∞(X;T )− γ

γ − 1H∞(X) +Hγ(X), (81)

where the last inequality follows from the fact that γ 7→ Hγ(X) is non-increasing. The upper bound in (80) (after replacingX with Y and γ with α) implies

Iα(Y ;T ) ≤ I∞(Y ;T ) +Hα(Y )−H∞(Y ). (82)

Combining (81) and (82), we obtain the desired upper bound for PF(α,γ). The other bounds can be proved similarly byinterchanging X with Y and α with γ in (81) and (82).

Documents

1 Bottleneck Problems: Information and Estimation-Theoretic View · 2020. 10. 16. · technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms