19
Introducing a differentiable measure of pointwise shared information Abdullah Makkeh, * Aaron J. Gutknecht, and Michael Wibral Campus Institute for Dynamics of Biological Networks, Georg-August Univeristy, Goettingen, Germany (Dated: March 31, 2021) Partial information decomposition (PID) of the multivariate mutual information describes the distinct ways in which a set of source variables contains information about a target variable. The groundbreaking work of Williams and Beer has shown that this decomposition cannot be determined from classic information theory without making additional assumptions, and several candidate mea- sures have been proposed, often drawing on principles from related fields such as decision theory. None of these measures is differentiable with respect to the underlying probability mass function. We here present a novel measure that satisfies this property, emerges solely from information-theoretic principles, and has the form of a local mutual information. We show how the measure can be understood from the perspective of exclusions of probability mass, a principle that is foundational to the original definition of the mutual information by Fano. Since our measure is well-defined for individual realizations of the random variables it lends itself for example to local learning in artificial neural networks. We also show that it has a meaningful Moebius inversion on a redundancy lattice and obeys a target chain rule. We give an operational interpretation of the measure based on the decisions that an agent should take if given only the shared information. I. INTRODUCTION What are the distinct ways in which a set of source variables may contain information about a target vari- able? How much information do input variables provide uniquely about the output, such that this information about the output variable cannot be obtained by any other input variable, or collections thereof? How much information is provided in a shared way, i.e., redundantly, by multiple input variables, or multiple collections of these? And how much information about the output is provided synergistically such that it can only be obtained by considering many or all input variables together? An- swering questions of this nature is the scope of partial information decomposition (PID). A solution to this problem has been long desired in studying complex systems [13] but seemed out of reach until the groundbreaking study of Williams and Beer [4]. This study provided first insights by establishing that information theory is lacking axioms to uniquely solve the PID problem. Such axioms have to be chosen in a way that satisfies our intuition about shared, unique, and synergistic information (at least in simple corner cases). However, further studies in [5, 6] quickly revealed that not all intuitively desirable properties, like positiv- ity, zero redundant information for statistically indepen- dent input, a chain rule for composite output variables, etc., were compatible, and the initial measure proposed by Williams and Beer was rejected on the grounds of not fulfilling certain desiderata favored in the community. Nevertheless, the work of Williams and Beer clarified that indeed an axiomatic approach is necessary and also high- * [email protected] [email protected]; Also at MEG Unit, Brain Imaging Center, Goethe University, Frankfurt, Germany [email protected] lighted the possibility that the higher order terms (or questions) that arose when considering more than two input variables could be elegantly organized into contri- butions on the lattice of antichains (see more below). Approaches that do not fulfill the Williams and Beer desiderata have been suggested, e.g., [7, 8]. However, these approaches fail to quantify all the desired quanti- ties and, therefore, answer a question different from that posed by PID. Subsequently, multiple PID frameworks have been pro- posed, and each of them has merits in the application case indicated by its operational interpretation (Bertschinger et al. [9], e.g., justify their measure of unique informa- tion in a decision-theoretic setting). However, all mea- sures lacked the property of being well defined on indi- vidual realizations of inputs and outputs (localizability), as well as continuity and differentiability in the underly- ing joint probability distribution. These properties are key desiderata for the settings of interest to neurosci- entists and physicists, e.g., for distributed computation, where locality is needed to unfold computations in space and time [1013]; for learning in neural networks [14, 15] where differentiability is needed for gradient descent and localizability for learning from single samples and mini- batches; for neural coding [15, 16] where localizability is important to evaluate the information value of individual inputs that are encoded by a system; and for problems from the domain of complex systems in physics as dis- cussed in [17]. While the first two properties have very recently been provided by the pointwise partial information decompo- sition (PPID) of Finn and Lizier [18], differentiability is still missing, as is the extension of most measures to continuous variables. Differentiability, however, seems pivotal to exploit PID measures for learning in neural networks – as suggested for example in [14], and also in physics problems. Therefore, we here rework the definition of Finn and arXiv:2002.03356v5 [cs.IT] 30 Mar 2021

arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

Introducing a differentiable measure of pointwise shared information

Abdullah Makkeh,∗ Aaron J. Gutknecht,† and Michael Wibral‡

Campus Institute for Dynamics of Biological Networks, Georg-August Univeristy, Goettingen, Germany(Dated: March 31, 2021)

Partial information decomposition (PID) of the multivariate mutual information describes thedistinct ways in which a set of source variables contains information about a target variable. Thegroundbreaking work of Williams and Beer has shown that this decomposition cannot be determinedfrom classic information theory without making additional assumptions, and several candidate mea-sures have been proposed, often drawing on principles from related fields such as decision theory.None of these measures is differentiable with respect to the underlying probability mass function. Wehere present a novel measure that satisfies this property, emerges solely from information-theoreticprinciples, and has the form of a local mutual information. We show how the measure can beunderstood from the perspective of exclusions of probability mass, a principle that is foundationalto the original definition of the mutual information by Fano. Since our measure is well-defined forindividual realizations of the random variables it lends itself for example to local learning in artificialneural networks. We also show that it has a meaningful Moebius inversion on a redundancy latticeand obeys a target chain rule. We give an operational interpretation of the measure based on thedecisions that an agent should take if given only the shared information.

I. INTRODUCTION

What are the distinct ways in which a set of sourcevariables may contain information about a target vari-able? How much information do input variables provideuniquely about the output, such that this informationabout the output variable cannot be obtained by anyother input variable, or collections thereof? How muchinformation is provided in a shared way, i.e., redundantly,by multiple input variables, or multiple collections ofthese? And how much information about the output isprovided synergistically such that it can only be obtainedby considering many or all input variables together? An-swering questions of this nature is the scope of partialinformation decomposition (PID).

A solution to this problem has been long desired instudying complex systems [1–3] but seemed out of reachuntil the groundbreaking study of Williams and Beer [4].This study provided first insights by establishing thatinformation theory is lacking axioms to uniquely solvethe PID problem. Such axioms have to be chosen ina way that satisfies our intuition about shared, unique,and synergistic information (at least in simple cornercases). However, further studies in [5, 6] quickly revealedthat not all intuitively desirable properties, like positiv-ity, zero redundant information for statistically indepen-dent input, a chain rule for composite output variables,etc., were compatible, and the initial measure proposedby Williams and Beer was rejected on the grounds ofnot fulfilling certain desiderata favored in the community.Nevertheless, the work of Williams and Beer clarified thatindeed an axiomatic approach is necessary and also high-

[email protected][email protected]; Also at MEG Unit, Brain ImagingCenter, Goethe University, Frankfurt, Germany‡ [email protected]

lighted the possibility that the higher order terms (orquestions) that arose when considering more than twoinput variables could be elegantly organized into contri-butions on the lattice of antichains (see more below).Approaches that do not fulfill the Williams and Beerdesiderata have been suggested, e.g., [7, 8]. However,these approaches fail to quantify all the desired quanti-ties and, therefore, answer a question different from thatposed by PID.

Subsequently, multiple PID frameworks have been pro-posed, and each of them has merits in the application caseindicated by its operational interpretation (Bertschingeret al. [9], e.g., justify their measure of unique informa-tion in a decision-theoretic setting). However, all mea-sures lacked the property of being well defined on indi-vidual realizations of inputs and outputs (localizability),as well as continuity and differentiability in the underly-ing joint probability distribution. These properties arekey desiderata for the settings of interest to neurosci-entists and physicists, e.g., for distributed computation,where locality is needed to unfold computations in spaceand time [10–13]; for learning in neural networks [14, 15]where differentiability is needed for gradient descent andlocalizability for learning from single samples and mini-batches; for neural coding [15, 16] where localizability isimportant to evaluate the information value of individualinputs that are encoded by a system; and for problemsfrom the domain of complex systems in physics as dis-cussed in [17].

While the first two properties have very recently beenprovided by the pointwise partial information decompo-sition (PPID) of Finn and Lizier [18], differentiabilityis still missing, as is the extension of most measures tocontinuous variables. Differentiability, however, seemspivotal to exploit PID measures for learning in neuralnetworks – as suggested for example in [14], and also inphysics problems.

Therefore, we here rework the definition of Finn and

arX

iv:2

002.

0335

6v5

[cs

.IT

] 3

0 M

ar 2

021

Page 2: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

2

Lizier [18] in order to define a novel PID measure ofshared mutual information that is localizable and also dif-ferentiable. We aim for a measure that adheres as closelyas possible to the original definition of (local) mutual in-formation – in the hope that our measure will inheritmost of the operational interpretation of local mutual in-formation. We also seek to avoid invoking assumptionsor desiderata from outside the scope of information the-ory, e.g., we explicitly seek to avoid invoking desideratafrom decision or game theory. We note that adheringas closely as possible to information-theoretic conceptsshould also simplify finding localizable and differentiablemeasures.

Our goals above suggest that we have to abandon pos-itivity for the parts (called atoms in [4]) of the decom-position, simply because the local mutual informationcan be already negative [19] With respect to a negativeshared information in the PID we aim to preserve the in-terpretation of negative terms as being misinformative, inthe sense that obtaining negative information will makea rational agent more likely to make the wrong predic-tion about the value of a target variable. Our goals alsostrongly suggest to avoid computing the minimum (ormaximum) of multiple information expressions anywherein the definition of the measure. This is because takinga minimum or maximum would almost certainly collidewith differentiability and also a later extension to contin-uous variables.

The paper proceeds as follows. First, Section II, in-troduces our measure of shared information isx∩ . Then,section III lays out how isx∩ can be understood based onthe concept of shared probability mass exclusions. Sec-tion IV utilizes isx∩ to obtain a full PID and establishes itsdifferentiability. Then, Section V discusses some impli-cations of isx∩ being a local mutual information, its oper-ational interpretation, and some key applications of isx∩ .Finally, Section VI concludes by several examples.

II. DEFINITION OF THE MEASURE isx∩ OFPOINTWISE SHARED INFORMATION

We begin by considering discrete random variablesS1, . . . , Sn and T where the Si are called the sources andT is the target. Suppose now that these random vari-ables have taken on particular realizations s1, . . . , sn andt. Our goal is to quantify the pointwise shared informa-tion that the source realizations carry about the targetrealization. We will proceed in three steps: (1) we definethe information shared by all source realizations aboutthe target realization, (2) we define pointwise shared in-formation for any subset of source realizations, and (3) weprovide the complete definition of the information sharedby multiple subsets of source realizations.

So how much information about the target realizationt is redundantly contained in all source realizations si?We propose that this information can be quantified asthe information about the target realization provided by

the truth of the statement

Ws1,...,sn =((S1 = s1) ∨ . . . ∨ (Sn = sn)

)(1)

i.e., by the inclusive OR of the statements that eachsource variable has taken on its specific realization. Thisinformation in turn can be understood as a regular point-wise mutual information between the target realization tand the indicator random variable [20] of the statementWs1,...,sn assuming the value 1:

isx∩ (t : s1; . . . ; sn) := log2

p(t | IWs1,...,sn= 1)

p(t)(2)

= log2

p(t | Ws1,...,sn = true)

p(t). (3)

The superscript “sx” stands for “shared exclusion” andwill be explained in more detail in the next section. Thereason for the choice of Ws1,...,sn is the following: thetruth of this statement can be verified by knowing therealization of any single source variable, i.e., knowingthat Si = si for at least one i. Thus, whatever in-formation can be obtained from Ws1,...,sn can also beobtained from any individual statement Si = si. Inother words, the statement Ws1,...,sn only contains in-formation that is redundant to all source realizations.Conversely, whatever information can be obtained fromall individual statements Si = si can also be obtainedfrom Ws1,...,sn because it implies that at least one of thestatements Si = si has to be true. In other words, allof the information shared by the source realizations iscontained in the statement Ws1,...,sn . Accordingly, thestatement Ws1,...,sn exactly captures the information re-dundantly contained in the source realizations. Any log-ically stronger or weaker statement would either containsome nonredundant information or miss out on some re-dundant information respectively. For a more compre-hensive and foundational version of this argument, con-necting principles from mereology (the study of parthoodrelations) and formal logic, see [21].

Now, this definition is not entirely complete yet sinceit only quantifies the information shared by all sourcerealizations s1, . . . , sn. However, a full-fledged measureof shared information also has to specify the informationshared by (1) any subset of source realizations (e.g.,theinformation shared by s1 and s3) and (2) multiple sub-sets of source realizations (e.g., the information sharedby (s1, s2) and (s2, s3)) [4]. The definition for a subseta ⊆ 1, . . . , n is straightforward: the information sharedby the corresponding realizations (si | i ∈ a) is the infor-mation provided by the statement

Wa =

(∨i∈a

Si = si

)(4)

i.e., by the logical OR of statements Si = si where iis in the subset in question. Note that in the followingwe will refer to sets of source realizations by their index

Page 3: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

3

sets for brevity. So we will generally say “the set ofsource realizations a” instead of “the source realizations(si | i ∈ a)”. There are formal reasons why it is preferableto work with index sets that will become apparent inSection IV.

Now, how about the case of multiple subsets? Notefirst that the pointwise mutual information provided bya given subset a of source realizations about the target re-alization is the information provided by the logical ANDof the corresponding statements Si = si:

i (t : (si)i∈a) = log2

p(t | (∧i∈a Si = si) = true)

p(t). (5)

Accordingly, the information shared by multiple subsetsof source realizations a1, . . . ,am can be quantified as theinformation provided by the logical OR of the associatedlogical AND statements, i.e., as the information providedby the statement

Wa1,...,am =

m∨i=1

∧j∈ai

Sj = sj

. (6)

The underlying reasoning is exactly as described above:whatever information can be obtained from theWa1,...,am

can also be obtained from all of the conjunctions∧j∈ai Sj = sj because as soon as the truth of one of the

conjunctions is known the truth of Wa1,...,am is knownas well. Conversely, whatever information can be ob-tained from all conjunctions can also be obtained fromWa1,...,am since this statement implies that at least oneconjunction must be true. This leads us to the final def-inition of the information shared by arbitrary subsets ofsource realizations a1, . . . ,am:

isx∩ (t : a1; . . . ;am) := log2

p(t | IWa1,...,an= 1)

p(t)(7)

= log2

p(t | Wa1,...,an = true)

p(t). (8)

Note that this general definition agrees with the abovedefinition of the information shared by all source realiza-tions or subsets thereof. We would also like to emphasizehere again that isx∩ has the form of a local mutual in-formation. This feature is of particular importance inthe following section where we aim to provide further in-tuition for the measure by showing that it can also bemotivated from the perspective of probability mass ex-clusions as discussed in [22].

III. SHARED MUTUAL INFORMATION FROMSHARED EXCLUSIONS OF PROBABILITY

MASS

Shannon information can be seen as being induced byexclusion of probability mass (e.g, [16, Sec. 2.1.3]), and

the same perspective can actually be applied to the mu-tual information as well – as explicitly derived by Finnand Lizier [22]. In our approach to shared informa-tion, we suggest to keep intact this central information-theoretic principle that binds the exclusion of probabil-ity mass to information and mutual information. Wenow first review the probability exclusion perspective onlocal mutual information. Subsequently, we show howthe measure isx∩ of shared information, itself being a lo-cal mutual information, can be motivated from the sameperspective as well.

A. Mutual information from exclusions ofprobability mass

The local mutual information [23] obtained from a re-alization (t, s) of two random variables T and S is

i(t : s) = log2

p(t | s)p(t)

. (9)

This means that i(t : s) compares the probability of ob-serving t after observing s to the prior p(t). Thus, s issaid to be informative (resp. misinformative) about t ifthe chance of t occurring increases (resp. decreases) afterobserving s compared to the prior probability p(t), i.e.,if i(t : s) > 0 (resp. i(t : s) < 0).

The definition of i(t : s) can be understood in terms ofexcluding certain probability mass [22] by rewriting it as

i(t, s) = log2

P(t)− P(t ∩ s)

1− P(s)− log2 P(t) , (10)

where s is the set complement of the event s = S = sand t = T = t. Looking at it in this way, pointwisemutual information can be conceptualized as follows (il-lustrated in FIG. 1): (i) “removing” all points from theinitial sample space Ω that are incompatible with the ob-servation of a specific s by giving them measure zero–forthe event t this has the consequence that a part of it isalso removed, i.e., P(t)−P(t∩ s); (ii) rescaling the proba-bility measure to again have properly normalized proba-bilities, i.e., dividing by 1−P(s); and (iii) comparing thesize of t after observing s to the prior P(t) on a logarith-mic scale. The remove-rescale procedure is a conceptualway of thinking about the changes to Ω (after observings) that are reflected in the conditional measure P(· | s).

This derivation of local mutual information can be gen-eralized to any number of sources. For instance, the jointlocal mutual information of s1, s2 about t is

i(t : s1, s2) = log2

P(t)− P(t ∩ (s1 ∪ s2))

1− P(s1 ∪ s2)− log2 P(t).

(11)The two conserved key principles here are that (i) themutual information is always induced by exclusion of the

Page 4: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

4

A

B

C

D

E

F

FIG. 1. Depiction of deriving the local mutual infor-mation i(t : s) by excluding the probability mass of theimpossible event s after observing s. (A) Two events t, tpartition the sample space Ω. (B) Two event partition s, sof the source variable S in the sample space Ω. The occur-rence of s renders s impossible (red (dark gray) stripes). (C)t may intersect with s (gray region) and s (red (dark gray)hashed region). The relative size of the two intersections de-termines whether we obtain information or misinformation,i.e. whether t becomes relatively more likely after consider-ing s, or not (D), considering the necessary rescaling of theprobability measure (E). Note that if the gray region in (E)is larger (resp. smaller) than that in (A), then s is informa-tive (resp. misinformative) about t since observing s hintsthat t is more (reps. less) likely to occur compared to an ig-norant prior. (F) shows why the misinformative exclusionP(t∩ s) (intersection of red (dark gray) hashes with gray re-gion) cannot be cleanly separated from the informative exclu-sion, P(t∩ s) (dotted outline in (C)), as stated already in [22].This is because these overlaps appear together in a sum insidethe logarithm, but this logarithm in turn guarantees the addi-tivity of information terms. Thus the additivity of (mutual)information terms is incompatible with an additive separa-tion of informative and misinformative exclusions inside thelogarithms of the information measures.

probability mass related to events that are impossibleafter the observation of s1, . . . , sn, i.e., s1, . . . , sn, and (ii)the probabilities are rescaled by taking into account thesevery same exclusions. These core information-theoreticprinciples can be utilized to motivate the measure isx∩ asexplained in the next section.

B. isx∩ from shared exclusions of probability mass

The core idea is now that just as mutual informa-tion is connected to the exclusion of probability mass,shared information should be connected to shared exclu-sions of probability mass, i.e., to possibilities being ex-cluded redundantly by all (joint) source realizations inquestion. Now, what is excluded by a given joint sourcerealization aj is precisely the complement of the eventaj =

⋂i∈ajSi = si. Thus, to evaluate the informa-

tion shared by the joint source realizations a1, . . . ,am,

=

=

=

=

=

=

=

1

2

3

4

5

6

7

1,21,3

1,22,3

1,32,3

123

12

13

23

s2s3

s1

12,3

21,3

31,2

1 2 3

1,21,32,3

1,2

2,3

1,3

1,2,3

FIG. 2. Shared exclusions in the three-source variablecase. Upper left: A sample space with three events s1, s2, s3from three source variables (their complements events are de-picted in (4)). For clarity, t is not shown, but may arbitrarilyintersect with any intersections/unions of si. The remainingpanels show the induced exclusions by different combinationsof ai. These exclusions arise by taking the correspondingunions and intersections of sets. Which unions and intersec-tions were taken can be deduced by the shapes of the remain-ing, nonexcluded regions. For (1)-(3) we show the shared ex-clusions for combination of singletons ((1) and (2)) and thoseof singletons and coalitions, such as the events of the collec-tions (left) and the shared exclusions (right). For (4)-(7) weonly show shared exclusions. The online version uses the ad-ditional, nonessential color-based mark-up of unions and in-tersections: An intersection exclusion is indicated by the mixof the individual colors, e.g., the 12 exclusion is s1 ∩ s2and mixes red and blue to purple, and a union exclusion isindicated by a pattern of the individual colors, e.g., the 1, 2exclusion is s1 ∪ s2 and takes a red-blue pattern.

we need to remove and rescale by the intersection of thecomplement events aj . This intersection contains pointsthat are excluded by all joint source realizations in ques-tion. Hence, we arrive at

isx∩ (t : a1; a2; . . . ; an) := log2

P(t)− P(t∩(a1 ∩ a2 ∩ . . . ∩ an))

1− P(a1 ∩ a2 ∩ . . . ∩ an)

− log2 P(t).

(12)

It is straightforward to show that this definition co-

Page 5: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

5

incides with the one given in Section II. FIG 2 depictsall possible exclusions in the case of three sources. Thisconcludes our exposition of the measure of shared in-formation isx∩ . In the next section, we show how thismeasure induces a meaningful and differentiable partialinformation decomposition.

IV. LATTICE STRUCTURE ANDDIFFERENTIABILITY

We now present a lattice structure that yields a point-wise partial information decomposition (PPID) when en-dowed with isx∩ and show that all of the resulting PPIDterms are differentiable. The lattice structure was origi-nally introduced by Williams and Beer [4] on the basis ofa range of axioms they placed on the concept of redun-dant information (see below). As we showed in [21] itcan also be derived from elementary parthood relation-ships between the PID terms (also called PID atoms) andmutual information terms.

A. Lattice structure

Williams and Beer in their seminal work [4] showedthat in order to capture all the information contribu-tions that a set of sources has about a target, we need tolook at the level of collections of sources. That is, eachcombination of collections of sources captures a PPIDterm (an information contribution / information atom).Their argument was based on an analysis of the conceptof redundant information, i.e., the information sharedby multiple collections of sources. In particular, theyargued that any measure of shared information shouldsatisfy certain desiderata, referred to as W&B axioms(see Axioms IV.1, IV.2, and IV.3). These axioms im-ply that the domain of the shared information functioncan be restricted to the antichain combinations, i.e., anycombination of collections of sources such that none ofthe collections is a subset of another. The reason is thefollowing: consider collections a, b, and c, and supposethat a ⊂ b (while a 6⊂ c and c 6⊂ a). Then the informa-tion shared by all three collections is simply that sharedby a and c since any information in a is automaticallyalso contained in b. In this way the information sharedby multiple collections always reduces to the informationassociated with an antichain combination by removingall supersets. The measure isx∩ agrees with this result be-cause the truth conditions of the statementWa1,...,am areunaffected by superset removal.

Mathematically, the antichain-combinations form alattice structure, i.e., there exists an ordering of theseantichain combinations such that for any pair of antichaincombinations there is a unique infimum and supremum.In [4], this lattice of antichain combinations is called theredundancy lattice since it models inclusion of redundan-cies: redundant information terms associated with lower

level antichains are included in redundancies associatedwith higher level antichains. Williams and Beer then in-troduced the PID terms implicitly via a Mobius Inversionover the lattice (more details in Appendix A 1). We canproceed in just the same way on a pointwise level andintroduce the PPID terms via a Mobius-Inversion of isx∩ ,i.e., via inverting the relationship

isx∩ (t : α) =∑βα

πsx(t : β) (13)

where α and β are antichain combinations. In this wayeach PPID term πsx measures the information “incre-ment” as we move up the lattice, i.e., the PPID term ofa given node is that part of the corresponding shared in-formation that is not already contained in any lower levelshared information.

It should be mentioned at this point that the measureisx∩ actually violates one of the W&B axioms for sharedinformation: it is not monotonically decreasing as morecollections of source realizations are included. On firstsight this appears to be a problem because one would ex-pect, for instance, that the information shared by sourcerealizations s1, s2 and s3 should be smaller than or equalto the information shared by s1 and s2. After all, the in-formation shared by all three source realizations shouldbe contained in the information shared by the first two.However, the violation of the monotonicity property hasa natural interpretation in terms of informative and mis-informative contributions to redundant information [18]:whereas each of these components individually should in-deed satisfy the monotonicity axiom, this is not true ofthe total redundant information. Using the above exam-ple, the information shared by s1, s2, and s3 can actuallybe larger than the information shared by s1 and s2 if theextra information in the latter shared information term(i.e., the information shared by s1 and s2 but not by s3)is misinformative.

As shown in [22] it is possible to uniquely decom-pose the pointwise mutual information into an informa-tive and a misinformative component. Since isx∩ is itselfa pointwise mutual information the same decompositioncan be applied in order to obtain an informative point-wise shared information isx +

∩ (15a) and a misinformativepointwise shared information isx−∩ (15b). We may thenshow that each of these components individually satisfiesthe W&B axioms. The decomposition reads

isx∩ (t : a1; a2; . . . ; am) = isx+∩ (t : a1; a2; . . . ; am)

− isx−∩ (t : a1; a2; . . . ; am), (14a)

isx+∩ (t : a1; a2; . . . ; am) := log2

1

P(a1 ∪ a2 ∪ . . . ∪ am), (15a)

isx−∩ (t : a1; a2; . . . ; am) := log2

P(t)

P(t∩(a1 ∪ a2 ∪ . . . ∪ am)).

(15b)

Here, the first term of (14a) is considered to be the infor-mative part as it is what can be inferred from the sources

Page 6: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

6

(recall that ai are indices of collections of sources) andwe refer to it by isx +

∩ (15a). The second term of (14a)quantifies the (misinformative) relative loss of p(t), theprobability mass of the event t (which actually happened)when excluding the mass of a1∩ a2∩ . . .∩ an and we referto it by isx−∩ (15b).

Now, isx±∩ should individually fulfill a pointwise ver-sion of the Williams and Beer axioms. These PPID ax-ioms were described by Finn and Lizier [18].

Axiom IV.1 (Symmetry). i+∩ and i−∩ are invariant un-der any permutation σ of collections of source events:

i+∩ (t : a1;a2; . . . ;am) = i+∩ (t : σ(a1);σ(a2); . . . ;σ(am)),

i−∩ (t : a1;a2; . . . ;am) = i−∩ (t : σ(a1);σ(a2); . . . ;σ(am)).

Axiom IV.2 (Monotonicity). i+∩ and i−∩ decreasesmonotonically as more source events are included,

i+∩ (t : a1; . . . ;am;am+1) ≤ i+∩ (t : a1; . . . ;am),

i−∩ (t : a1; . . . ;am;am+1) ≤ i−∩ (t : a1; . . . ;am),

with equality if there exists i ∈ [m] such that ai ⊆ am+1 .

Axiom IV.3 (Self-redundancy). i+∩ and i−∩ for a singlesource event a equal i+ and i−, respectively:

i+∩ (t : a) = h(a) = i+(t : a),

i−∩ (t : a) = h(a | t) = i−(t : a).

Therefore, i∩(t : a) = i(t : a).

Note that i(t : a) = i+(t;a) − i−(t;a), which is theinformative–misinformative decomposition of the point-wise mutual information derived by Finn and Lizier [22].The following theorem states that isx±∩ result in a con-sistent PPID by showing that isx +

∩ and isx−∩ individuallyfulfill the PPID axioms [18] (the proof is deferred to ap-pendix A.

Theorem IV.1. isx +∩ and isx−∩ satisfy Axioms IV.1,

IV.2, and IV.3.

In this way the violation of monotonicity of the totalshared information isx∩ can be completely explained interms of misinformative contributions. In fact, thereis a another form of monotonicity that should hold aswell: monotonicity over the redundancy lattice. As notedabove the redundancy lattice models inclusion of redun-dancies. So we would expect lower level redundanciesto be smaller than higher level redundancies. Again thisform of monotonicity does not hold for isx∩ itself but for itsinformative and misinformative components as expressedin the following theorem:

Theorem IV.2. isx±∩ increase monotonically on the re-dundancy lattice.

There is another apparent problem that can be ad-dressed using the separation into informative and mis-informative components, namely, the fact that both isx∩

as well as πsx can be negative. This can be interpretedin terms of misinformation as well. To this end we de-fine misinformative and informative PPID terms πsx

± via

Mobius Inversions of isx±∩ . These informative and misin-formative components of the PPID terms can be obtainedrecursively from isx±∩ (see appendix A). They stand in therelation πsx = πsx

+ − πsx− to the PPID terms. Now, even

though πsx may be negative, its components πsx+ and πsx

−are non-negative.

Theorem IV.3. The atoms πsx± are non-negative.

In appendix A, we will provide the necessary tools toprove the above theorems, in particular, theorem IV.3.To sum up, this section shows that isx∩ results in a consis-tent and meaningful PPID. The apparent problems of vi-olating monotonicity and non-negativity can be resolvedby separating misinformative and informative compo-nents and showing that these components do satisfy thedesired properties (for more discussion on the idea of mis-information within local Shannon information theory seeDiscussion).

This concludes our discussion of the PPID induced bythe isx∩ . The global, variable-level PID can be obtainedby simply averaging the local quantities over all possiblerealizations of the source and target random variables.For a complete worked example of the XOR probabil-ity distribution see Figure 3, subfigure H in particular.In the next section we establish the differentiability ofisx∩ and πsx, an important advantage of these measurescompared to other approaches.

B. Differentiability of isx∩ and πsx±

We will discuss the differentiability of the PPID ob-tained by isx∩ . This is a desirable property [14] that isproven to be lacking in some measures [24–26] or evi-dently lacking for other measures since their definitionsare based on the maximum or (minimum) of multipleinformation quantities.

Let A ([n]) be the redundancy lattice (see section A),(T, S1, . . . , Sn) be discrete and finite random variables,and let us represent their joint probability distributionas a vector in [0, 1]|AT |×|AS1

|×···×|ASn |. Thus, the setof all joint probability distributions of (T, S1, . . . , Sn)forms a simplex that we denote by ∆P . Note that isx∩and πsx

± are functions of the probability distributions of(T, S1, . . . , Sn) and so they can be differentiable w.r.t.the probability distributions. Formally, for a given(T, S1, . . . , Sn), we show that isx∩ and πsx

± are differen-tiable over the interior of ∆P .

Since log2 is continuously differentiable over the opendomain R+, then using definitions (15a) and (15b), isx +

∩and isx−∩ are both continuously differentiable over theinterior of ∆p. Now, for α ∈ A ([n]), using theorem A.1

Page 7: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

7

and proposition A.3

πsx+ (t : α) =

∑γ∈P(α−\γ1)

(−1)|γ| log2

(p(γ) + d1p(γ)

), (16)

where α− = γ1, γ2, . . . , γk are the children of α orderedincreasingly w.r.t. their probability mass and α− := β ∈A ([n]) | β ≺ α, β γ ≺ α ⇒ β = γ. Hence, πsx

+ iscontinuously differentiable over the interior of ∆P sincethe function x+d1/x and its inverse are continuously dif-ferentiable over the open domain R+ . Similarly, πsx

− iscontinuously differentiable over the interior ∆P .

V. DISCUSSION

In this section, we first present further properties ofisx∩ . Then, we provide an operational interpretation of isx∩ ,and suggest an approach to compare this operational in-terpretation with that of other measures. Following this,we give the intuition behind the “intrinsic dependence”of PID atoms for joint source-target distributions wherethe number of these atoms is larger than these distribu-tions’ alphabet size. Finally, we provide two applicationswhere isx∩ is particularly well suited and discuss the com-putational complexity of isx∩ .

A. Direct consequences of isx∩ being a local mutualinformation

The fact that isx∩ has the form of a regular local mutualinformation has several interesting consequences.

a. Implied entropy decomposition Since the local en-tropy of a realization of a set of variables can be writtenas a self-mutual information our decomposition also di-rectly implies an entropy decomposition that inherits theproperties of the lattices described in section IV. We startby the local entropy h(a1, . . . ,am) of a set of collectionsof realizations of variables Si = si. Note that these collec-tions have to be considered jointly, hence the comma [27].Thus, we can equally well write the entropy that is to bedecomposed as h(si | i ∈

⋃aj). Thus, we can consider

the si together as a joint random variable whose entropyis to be decomposed. This can be done by realizing firsth(si | i ∈

⋃aj) = i(si | i ∈

⋃aj : si | i ∈

⋃aj),

and then applying our PID formalism. In this decom-position then terms of the form isx∩ (si | i ∈

⋃aj :

a1; . . . ;am) =: hsx∩ (a1; . . . ;am) appear. In other words,

on the target side of the arguments of isx∩ we will alwaysfind the joint random variable, whereas the collectionsappear as usual on the source side.

b. Target chain rule and average measures Anotherconsequence is that isx∩ satisfies a target chain rule for acomposite target variable T = t1, t2:

isx∩ (t1, t2 : a1; a2; . . . ; am) = isx∩ (t1 : a1; a2; . . . ; am)

+ isx∩ (t2 : a1; a2; . . . ; am | t1),

A

B

C

D

E

F

G

H

FIG. 3. Worked example of isx∩ for the classical Xor.Let T = XOR(S1, S2) and S1, S2 ∈ 0, 1 be independentuniformly distributed and consider the realization (s1, s2, t) =(1, 1, 0). (A-B) The sample space Ω and the realized event(gold (gray) frame). (C) The exclusion of events induced bylearning that S1 = 1, i.e. s1 = 0 (gray). (D) Same fors2 = 0. (E) The union of exclusions fully determines theevent (1, 1, 0) and yields 1 bit of i(t = 0 : s1 = 1, s2 = 1). (F)The shared exclusions by s1 = 0 and s2 = 0, i.e., s1 ∩ s2exclude only (0, 0, 0). This is a misinformative exclusion, asit raises the probability of events that did not happen (t = 1)relative to those that did happen (t = 0) compared to the caseof complete ignorance. (G) Learning about one full variable,i.e., obtaining the statement that s1 = 0 adds additionalprobability mass to the exclusion (green (light gray)). Theshared exclusion (red (dark gray)) and the additional uniqueexclusion (green (light gray)) induced by s1 create an exclu-sion that is uninformative, i.e., the probabilities for t = 0 andt = 1 remain unchanged by learning s1 = 1. At the level of theπsx atoms, the shared and the unique information atom canceleach other. (H) Lattice with isx∩ and πsx terms for this real-ization. Other realizations are equivalent by the symmetry ofXOR, thus, the averages yield the same numbers. Note thatthe necessity to cancel the negative shared information twiceto obtain both i(t = 0 : s1 = 1) = 0 and i(t = 0 : s2 = 1) = 0,results in a synergy < 1 bit. Also note that while addingthe shared exclusion from (F) and the unique exclusions fors1 and s2 results in the full exclusion from (E), informationatoms add differently due to the nonlinear transformation ofexcluded probability mass into information via − log2 p(·) –compare (H).

where the second term is log2P(t2|t1)−P(t2,a1,a2,...,am|t1)

1−P(a1,a2,...,am|t1) −log2 P(t2 | t1). Moreover, by linearity of the averaging acorresponding target chain rule is satisfied for the averageshared information, Isx

∩ , defined by.

Page 8: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

8

Isx∩ (T : A1; . . . ; Am) :=∑

t,s1,...,sn

p(t, s1, . . . , sn)isx∩ (t : a1; . . . ; am)

=∑

t,s1,...,sn

p(t, s1, . . . , sn)i(t : Wa1,...,am = 1),

(17)

where probabilities related to the indicator variableWa1,...,am have to be recomputed for each possible com-bination of source and target realizations. Note thatthis indicator variable simply indicates the truth of thestatement Wa1,...,am from section II. Also note thatin Eq. (17) the averaging still runs over all combina-tions of t, s1, . . . , sn, and the weights are still given byp(t, s1, . . . , sn), not p(t,Wa1,...,am = 1). Having differ-ent variables in the averaging weights and the local mu-tual information terms makes the average shared infor-mation structurally different from a mutual information[28]. One consequence of this is that in principle the aver-age Isx

∩ can be negative. This also holds for the averagesof the other information atoms on the lattice (see nextsection for the lattice structure). Thus, the local sharedinformation may be expressed as a local mutual informa-tion with an auxiliary variable constructed for that pur-pose, and multiple such variables have to be constructedfor a definition of a global shared information.

c. Upper bounds. First, we can assess the self-sharedinformation of a collection of variables:

isx∩ (a1; . . . ; am : a1; . . . ; am) := i(Wa1,...,am = 1 : Wa1,...,am = 1)

= h(Wa1,...,am = 1) ,

(18)

where the notation a1; . . . ;am means the event definedby the complement of the intersection of exclusions in-duced by the ai, as before. This quantity is greater thanor equal to zero and is the upper bound of shared in-formation that the source variables can have about anyrealization u of any target variable U , i.e.,

isx∩ (a1; a2; . . . ; am : a1; a2; . . . ; am) ≥ isx∩ (u : a1; a2; . . . ; am)

for any u ∈ AU . This upper bound has conceptual linksto maximum extractable shared information from [29].Moreover, this upper bound may be nonzero even forindependent sources, showing how the so-called mecha-nistic shared information arises.

B. Operational interpretation of isx∩

Being a local mutual information, isx∩ keeps all the op-erational interpretations of that measure. For example,in keeping with Woodward [30] it measures the informa-tion available in the statement W for inference aboutthe value t of the target. Specifically, a negative valueof the local shared information indicates that an agentwho is only in possession of the shared information ismore likely to mispredict the outcome of the target (e.g.,

FIGs 3, 4) than without the shared information; a posi-tive value means that the shared information makes theagent more likely to choose the correct outcome. Theunsigned magnitude of the shared information informs usabout how relatively certain the agent should be abouttheir prediction.

What remains to be clarified then is the meaning ofthe average expression Isx

∩ . As detailed above the aver-age is taken with respect to the probabilities of the real-izations of the source variables and the target variable,not with respect to the dummy variables encoding thetruth value of the respective statements W — as an av-erage mutual information would require. To understandthe meaning of this particular average it is instructive tostart by ruling out two false interpretations. Again, con-sider an agent who tries to predict the correct value oftarget t. In order to do so, the agent utilizes a particularinformation channel.

For the first false interpretation, consider a channelthat takes the realizations of sources and target and pro-duces the statementsW carrying the shared information.If the receiver of this channel used it multiple times inthe case of a negative Isx

∩ , then this receiver would learnthat the shared information received is negative on av-erage and could modify their judgment. This leads usto a second false interpretation: the average could beunderstood as an average over an ensemble of agents,where each agent uses the above channel only once, thusavoiding the issue just described. Even in this scenariohowever there is a problem: if the agent knew that theinformation provided by W is shared by the true sourcerealizations, then the agent could derive the truth of allsub-statements of W. Accordingly, the agents would re-ceive more than only the shared information.

In order to obtain the appropriate interpretation ofshared information we have to consider a channel thatmasks the metainformation that all substatements of Ware true, and also makes learning impossible. This isachieved by a channel that produces true statements Vabout the source variables which have the logical struc-ture of W, but do not always carry shared information.Consider the information shared by all sources. In thiscase the channel would randomly produce (true) state-ments of the form Vs1,...,sn =

((S1 = s1)∨. . .∨(Sn = sn)

)but where some of the substatements might be false.Then V does not always carry shared information (onlyin case all substatements happen to be true). The re-ceiver knows the joint distribution of sources and targetand performs inference on t in a Bayes optimal way. Sucha channel would provide non-negative average mutual in-formation. However, for a channel of this kind the av-erage taken to compute Isx

∩ , is only over those channeluses where V actually did encode shared information. Incertain cases this average can be negative (see Table I).

As already alluded to above, the setting of our oper-ational interpretation contrasts with that of other ap-proaches to PID that take the perspective of multipleagents having full access to individual source variables

Page 9: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

9

TABLE I. V-channel for Xor. Left: probability masses foreach realization. Middle: Equiprobable V-statements associ-ated with each realization such that respective statement car-rying shared information is listed first (marked by W) Right:predicted target inferred from V and where Xrefers to cor-rect predictions and 7 refers to incorrect ones. Using V areceiver obtains positive average mutual information, but thecontribution of W statements is negative. Bottom: the signof IV , the average information provided by all V-statements,and that of Isx∩ .

Realization Channel Output Inference

p s1 s2 t V-statement predicted t Correct?1/4 0 0 0 (S1 = 0) ∨ (S2 = 0) (W) 1 7

(S1 = 0) ∨ (S2 = 1) 0 X(S1 = 1) ∨ (S2 = 0) 0 X

1/4 0 1 1 (S1 = 0) ∨ (S2 = 1) (W) 0 7(S1 = 0) ∨ (S2 = 0) 1 X(S1 = 1) ∨ (S2 = 1) 1 X

1/4 1 0 1 (S1 = 1) ∨ (S2 = 0) (W) 0 7(S1 = 1) ∨ (S2 = 1) 1 X(S1 = 0) ∨ (S2 = 0) 1 X

1/4 1 1 0 (S1 = 1) ∨ (S2 = 1) (W) 1 7(S1 = 1) ∨ (S2 = 0) 0 X(S1 = 0) ∨ (S2 = 1) 0 X

IV (T : S1;S2) > 0 (4 7 and 8 X) and Isx∩ (T : S1;S2) < 0 (4 7 and 0 X)

(or collections thereof), and that then design measuresof unique and redundant information based on actionsthese agents can take or rewards they obtain in decision-or game-theoretic settings based on their access to fullsource variables (e.g., in [9, 18, 26]). While certainly use-ful in the scenarios invoked in [9, 18, 26], we feel thatthese operational interpretations may almost inevitablymix inference problems (i.e., information theory proper)with decision theory. Also, they typically bring withthem the use of minimization or maximization opera-tions to satisfy the competitive settings of decision orgame theory. This, in turn, renders it difficult to obtaina differentiable measure of local shared information.

In sum, we feel that the question of how to decom-pose the information provided by multiple source vari-ables about a target variable may indeed not be a singlequestion, but multiple questions in disguise. The mostuseful answer will therefore depend on the scenario wherethe question arose. Our answer seems to be useful incommunication settings, and where quantitative state-ments about dependencies between variables are impor-tant (e.g., the field of statistical inference, where the PIDenumerates all possible types of dependencies of the de-pendent (target) variable on the independent (source)variables).

C. Evaluation of Isx∩ on P and on optimizationdistributions obtained in other frameworks.

Since our approach to PID relies only on the origi-nal joint distribution P it can be applied to other PIDframeworks where distributions Q(P ) are derived from

the original P of the problem – e.g., via optimization pro-cedures, as it is done for example in [9, 26]. This yieldssome additional insights into the operational interpreta-tion of our approach compared to others, by highlightinghow the optimization from P to Q(P ) shifts informationbetween PID atoms in our framework.

D. Number of PID atoms vs alphabet size of thejoint distribution

The number of lattice nodes rises very rapidly withincreasing numbers of sources. Thus, the number of lat-tice nodes may outgrow the joint symbol count of therandom variables, i.e., the number of entries in the jointprobability distribution. One may ask, therefore, aboutthe independence of the atoms on the lattice in thosecases (remember that the atoms were introduced in or-der to have the “independent” information contributionsof respective variable configurations at the lattice nodes).As shown in Fig. 5 and 6 our framework reveals multi-ple additional constraints at the level of exclusions viathe family of mappings from Proposition A.3. This ex-plains mechanistically why not all atoms are independentin cases where the number of atoms is larger than thenumber of symbols in the joint distribution.

E. Key applications

Due to the fact that PID solves a basic information-theoretic problem, its applications seem to cover almostall fields where information theory can be applied. Here,we focus on two applications for which our measure issuited particularly well: the first application requires lo-calizability and differentiability; the second applicationdoes not require differentiability, but requires at leastcontinuity of the measure on the space of the underlyingprobability distributions.

1. Learning neural goal functions

In [14] we argued that information theory, and in par-ticular the PID framework, lends itself to unify vari-ous neural goal functions, e.g., infomax and others. Wealso showed how to apply this to learning in neural net-works via the coherent infomax framework of Kay andPhillips [15]. Yet, this framework was restricted to goalfunctions expressible using combinations, albeit complexones, of terms from classic information theory, due tothe lack of a differentiable PID measure. Goal functionsthat were only expressible using PID proper could not belearned in the Kay and Phillips framework, and in thosecases PID would only serve to assess the approximationloss.

Our new measure removes this obstacle and neural net-works or even individual neurons can now be devised to

Page 10: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

10

learn pure PID goal functions. A possible key applica-tion is in hierarchical neural networks with a hierarchy ofmodules, where each module contains two populations ofneurons. These two populations represent supra- andinfragranular neurons and coarsely mimic their differ-ent functional roles. One population represents so-calledlayer 5 pyramidal cells. It serves to send the shared in-formation between their bottom-up (e.g., sensory) inputsand their top-down (contextual) inputs downwards in thehierarchy; the other population represents layer 3 pyra-midal cells and sends the synergy between the bottom-upinputs and the top-down inputs upwards in the hierarchy.For the first population the extraction of shared informa-tion between higher and lower levels in the hierarchy canbe roughly equated to learning an internal model, whilefor the second population the extraction of synergy isakin to computing a generalized error (see [31, 32] andreferences therein for the neuroanatomic background ofthis idea). Thus, a hierarchical network of this kind canperform an elementary type of predictive coding. Thefull details of this application scenario are the topic ofanother study, however.

2. Information modification in distributed computation incomplex systems

If one desires to frame distributed computation incomplex systems in terms of the elementary operationson information performed by a Turing machine, i.e.,the storage, transfer, and modification of information,information-theoretic measures for each of these compo-nent operations are required. For storage and transferwell established measures are available, i.e., the activeinformation storage [10] and the transfer entropy [11–13]. For modification, in contrast, no established mea-sures exist, yet an appropriate measure of synergisticmutual information from a partial information decom-position has been proposed as a candidate measure ofinformation modification [33]. An appropriate measurein this context has to be localizable (i.e., it must be pos-sible to evaluate the measure for a single event) in orderto serve as an analysis of computation locally in spaceand time, and it has to be continuous in terms of theunderlying probability distribution. Both of these condi-tions were already met for the PPID measure of Finn andLizier [18]; our novel measure here adds the possibilityto differentiate the measure on the interior of the prob-ability simplex, which makes it even more like a classicinformation measure. This is important to determine theinput distribution that maximizes synergy in a system,i.e., the input distribution that reveals the informationmodification capacity of the computational mechanismin a system as suggested in [34].

F. Computational complexity of the PID using isx∩

Real-world applications of PID will not necessarily beconfined to the standard two-input variable case – hencethe importance of the organization scheme for higher or-der terms that are provided by the lattice structure. Forsuch real-world problems the computational complexityof the computation of each atom on the lattice becomesimportant – not least because of the potentially largenumber of atoms (see below). This holds in particularwhen additional nonparametric statistical tests of PIDmeasures obtained from data require many recomputa-tions of the measures. We, therefore, discuss the compu-tational complexity of our approach.

For each realization s = (s1, . . . , sn) and t, our PPIDis obtained by computing the atoms πsx

± (t : α) for eachα ∈ A ([n]). In Appendix A, we show that any πsx

± (t : α)is evaluated as follows:

πsx± (t : α) = isx±∩ (t : α)−

∑β≺α

πsx± (t : β) ∀ α, β ∈ A ([n]),

where computing any isx∩ (t : α) is linear in the sizeof AT,S , the alphabet of the joint random variable(T, S1, . . . , Sn).Moreover, using isx∩ as a redundancy mea-sure, the closed form of πsx

± derived in (16) shows thatthe computation of our PID is trivially parallelizable overatoms and realizations, which is crucial for larger num-ber of sources. The importance of parallelization is dueto the rapid growth of PID terms M when the numberof sources gets larger for any PID lattice-based measure.This M grows super exponentially as the n-th Dedekindnumber d(n) − 2. At present even enumerating M ispractically intractable beyond n > 8.

VI. EXAMPLES

In this section, we present the PID provided by ourisx∩ measure for some exemplary probability distribu-tions. Most of the distributions are chosen from Finn andLizier [18] and previous examples in the PID literature.The code for computing πsx is available on the IDTxltoolbox http://github.com/pwollstadt/IDTxl [35].

A. Probability distribution PwUnq

We start by the pointwise unique distribution(PwUnq) introduced by Finn and Lizier [18]. This dis-tribution is constructed such that for each realization,only one of the sources holds complete information aboutthe target while the other holds no information. The aimwas to structure a distribution where at no point (realiza-tion) the two sources give the same information about thetarget. Hence, Finn and Lizier argue that, for such distri-bution, there should be no shared information. Also, thisdistribution highlights the need for a pointwise analysisof the PID problem.

Page 11: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

11

TABLE II. PwUnq Example. Left: probability mass dia-grams for each realization. Right: the pointwise partial infor-mation decomposition for the informative and misinformative.Bottom: the average partial information decomposition.

Realization πsx+ πsx

p s1 s2 t 12 1 2 1, 2 12 1 2 1, 21/4 0 1 1 1 0 1 0 1 0 0 01/4 1 0 1 1 1 0 0 1 0 0 01/4 0 2 2 1 0 1 0 1 0 0 01/4 2 0 2 1 1 0 0 1 0 0 0

Average Values 1 1/2 1/2 0 1 0 0 0

Πsx(T : 12) = 0 Πsx(T : 1) = 1/2 Πsx(T : 2) = 1/2 Πsx(T : 1, 2) = 0

Since in all of the realizations, the shared exclusiondoes not alter the likelihood of any of the target eventscompared to the case of total ignorance, isx∩ will indeedgive zero redundant information. Thus, the PID termsresulting from isx∩ are the same as the those resultantfrom rmin [18] and Iccs [26] measures (see table II).

Recall Assumption (∗) of Bertschinger et al. [9] whichstates that the unique and shared information shouldonly depend on the marginal distributions P (S1, T ) andP (S2, T ). Finn and Lizier [18] showed that all measureswhich satisfy Assumption (∗) result in no unique infor-mation, i.e., nonzero redundant information wheneverP (S1, T ) is isomorphic to P (S2, T ). The PwUnq distri-bution falls into this category for which Imin [4], Ired [6],

UI [9], and SVK [36] do not register unique informationof S1 and S2. This is due to Assumption (∗) not takinginto consideration the pointwise nature of information.Specifically, a measure that satisfies Assumption (∗) isagnostic to the fact that at each realization T = j isuniquely determined by S1 or S2 but never both. On thecontrary such a measure registers this as a mixture ofshared and synergistic contribution since neither S1 norS2 can fully determine T = j on its own but sharedthey partly determine T = j.

B. Probability distribution XOR

Using our formulation of isx∩ results in negative localshared information for the classic XOR example. To seethis, assume that S1 and S2 are independent, uniformlydistributed random bits, and T = XOR(S1, S2), and con-sider the realization (s1, s2, t) = (1, 1, 0). From Eq. (12)we get

isx∩ (t = 0 : s1 = 1; s2 = 1) = log2

1/2− 1/4

1− 1/4+ log2

1

1/2< 0.

We argue that this result reflects that an agent receivingthe shared information is misinformed (see, e.g., [22] forthe concept of misinformation) about t. To understandthe source of this misinformation, consider that the agentis only provided with the shared information, i.e., theagent knows only that Ws1,...,sn is true. This means theagent is being told the following: “One of the two sourceshas outcome 1, and we do not know which one.” This

TABLE III. RndErr Example. Left: probability mass dia-grams for each realization. Right: the pointwise partial infor-mation decomposition for the informative and misinformativeis evaluated. Bottom: the average partial information decom-position. We set a = log2(8/5), b = log2(8/7), c = log2(5/4), d =log2(7/4), e = log2(16/15), f = log2(16/17), and g = log2(4/3).

Realization πsx+ πsx

p s1 s2 t 12 1 2 1, 2 12 1 2 1, 23/8 0 0 0 a c c e 0 0 g 03/8 1 1 1 a c c e 0 0 g 01/8 0 1 0 b d d f 0 0 2 01/8 1 0 1 b d d f 0 0 2 0

Average Values 0.557 0.443 0.443 0.367 0 0 0.811 0

Πsx(T : 12) = 0.557 Πsx(T : 1) = 0.443 Πsx(T : 2) = −0.367 Πsx(T :

1, 2) = 0.367

will let the agent predict that the joint realization is oneout of three realizations with equal probability: (1, 1, 0),(0, 1, 1), or (1, 0, 1) (see FIG 3). Of these three realiza-tions, only one points to the correct target realizationt = 0, while the other two point to the “wrong” t = 1leading to odds of 1:2 — whereas t = 0 and t = 1 wereequally probable before the agent received the shared in-formation from the sources. As a consequence, the localshared information becomes negative [37]. Finally, theXOR gate demonstrates an example of negative sharedinformation; we note that in general unique (e.g., ta-ble III) and synergistic information can as well be nega-tive.

C. Probability distribution RndErr

Recall Rnd, the redundant probability distribution,where both sources are fully informative about the targetand exhibit the same information. More precisely, the re-dundant realizations, s1 = s2 = t = 0 and s1 = s2 = t =1, are the only two realizations that occur equally likely.Derived from Rnd, the RndErr is a noisy redundantdistribution of two sources where one source occasion-ally misinforms about the target while the other remainsfully informative about the target. Moreover, if S2 isthe source that occasionally misinforms about the target,then the faulty realizations, namely, s2 6= s1 = t = 0 ands2 6= s1 = t = 1, are equally likely, but less likely thanthe redundant ones. We stick to the probability massesgiven in [18] for the redundant realizations 3/8 and forthe faulty realizations 1/8 and speculate that S2 will holdmisinformative (negative) unique information about T.

For this distribution, our measure results in the follow-ing PID: misinformative unique information by S2, infor-mative unique information by S1, informative shared in-formation, and informative synergistic information thatbalances the misinformation of S2 (see table III).

Page 12: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

12

D. Probability distribution XorDuplicate

In this distribution, we extend the Xor distributionby adding a third source S3 such that (i) S3 is a copyof any of the two original sources and (ii) S3 does nothave an additional effect on the target, e.g., if S3 isa copy of S1 then T := Xor(S1, S2) = Xor(S2, S3).Let S1 and S2 be two independent, uniformly dis-tributed random bits, S3 be a copy of S1, and T =Xor(S1, S2). This distribution (S1, S2, S3, T ) is calledXorDuplicate where the only nonzero realizations are(0, 0, 0, 0), (0, 1, 0, 1), (1, 0, 1, 1), (1, 1, 1, 0).

The key point is that the target T in the classicalXor is specified only by (S1, S2), whereas in XorDu-plicate the target is equally specified by the coali-tions (S1, S2) and (S2, S3). This means that the synergyΠsx(T : 1, 2) in Xor should be captured by the termΠsx(T : 1, 22, 3) in XorDuplicate.

The XorDuplicate distribution was suggested byGriffith et al. [36]. The authors speculated that their def-inition of synergy SVK must be invariant to duplicates forthis distribution, Πsx(T : 1, 22, 3) = Πsx(T : 1, 2),since the mutual information is invariant to duplicates,I(T : S1, S2, S3) = I(T :, S1, S2). Also, they proved thatSVK is invariant to duplicates in general [36].

For the shared exclusion measure isx∩ , it is evident thatthe invariant property will hold since the shared infor-mation is indeed a mutual information and it is easyto see that isx∩ (t : s1; s2; s3) = isx∩ (t : s1; s2). In fact,we show below that all the PID terms are invariant tothe duplication. That is, the unique information of S2

is invariant and captured by Πsx(T : 2). Also, theunique information of S1 is invariant but is captured bythe atom Πsx(T : 13) since it is shared informationby S1 and S3 as S3 is a copy of S1. Finally, the syn-ergistic information is invariant, however, it is capturedby Πsx(T : 1, 22, 3) since the coalitions (S1, S2) and(S2, S3) can equally specify the target. These claimsare shown below by replacing s3 by s1 and applying themonotonicity axiom IV.2 on isx +

∩ and isx−∩ . Note that dueto symmetry all the realizations have equal PID termsand the difference between the informative and misinfor-mative is computed implicitly.

For any (t, s1, s2, s3) with nonzero probability mass, wehave

isx∩ (t : s1; s2; s3) = isx∩ (t : s1; s2) = isx∩ (t : s2; s3) = −0.5849

isx∩ (t : s1; s3) = isx∩ (t : s1) = isx∩ (t : s3) = 0

implying that

πsx(t : 12) = πsx(t : 23) = 0

πsx(t : 13) = −πsx(t : 123) = 0.5849.

But, isx∩ (t : s2; s1, s3) = isx∩ (t : s2; s3) = isx∩ (t : s1; s2) =−0.5849 meaning that

πsx(t : 21, 3) = 0

πsx(t : 2) = isx(t : s2)− isx(t : s2; s1, s3) = 0.5849.

Furthermore,

isx∩ (t : s1; s2, s3) = isx∩ (t : s1; s1, s2) = isx∩ (t : s1) = 0

isx∩ (t : s3; s1, s2) = isx∩ (t : s1; s1, s2) = isx∩ (t : s1) = 0

isx∩ (t : s1, s2; s1, s3; s2, s3) = isx∩ (t : s1) = 0

and so

πsx(t : 12, 3) = πsx(t : 31, 2) = 0

πsx(t : 1, 22, 3) = 0.415

πsx(t : 1, 21, 3) = πsx(t : 1, 22, 3) = 0.

Finally, we have

isx∩ (t : s1, s2, s3) = isx∩ (t : s1, s2) = isx∩ (t : s2, s3) = 1

isx∩ (t : s1, s3) = 0

and thus it easy to see that their corresponding atomsare zero.

E. Probability distribution 3-bit parity

Let S1, S2 and S3 be independent, uniformly dis-tributed random bits, and T =

∑3i=1 Si mod 2. This dis-

tribution is the 3-bit parity, where T indicates the parityof the total number of 1-bits in (S1, S2, S3). Note that allpossible realizations occur with probability 1/8 and resultin the same PPID as well as the average PID due to thesymmetry of the variables. Table IV shows the informa-tive and misinformative component, and their differencefor any realization. In addition, we illustrate in Figure 4the results of πsx(t : 1, 23, 4) for the 4-bit parity dis-tribution.

Appendix A: Lattice structure: supporting proofsand further details

We show how the redundancy lattice can be endowedby isx±∩ separately to obtain consistent PID terms πsx

± .Subsequently, we show that πsx

± are nonnegative and thusthe PID terms are meaningful.

1. Informative and misinformative lattices

We start by explaining the redundancy lattice pro-posed by Williams and Beer. Then, we explain in detailhow to apply isx∩ to obtain a PID.

As explained in section IV, there is a one-to-one corre-spondence between the PID terms and the antichain com-binations. Since isx∩ is defined locally, then for every real-ization the antichain combinations are associated to thesource events. This way the PPID terms are computedand their average amount to the desired PID terms.

We use specific index sets and call them antichainsto represent the antichain combination since antichaincombinations are uniquely identified by the indices of

Page 13: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

13

TABLE IV. 3-bit Parity Example. Left: the average informative partial information decomposition is evaluated. Right: theaverage misinformative partial information decomposition is evaluated. Center: the average partial information decompositionis evaluated.

Πsx+ Πsx

1, 2, 3 1, 2, 30.2451 0

1, 2 1, 3 2, 3 1, 2 1, 3 2, 30.1699 0.1699 0.1699 0 0 0

1, 21, 3 1, 22, 3 1, 32, 3 1, 21, 3 1, 22, 3 1, 32, 30.0931 0.0931 0.0931 0 0 0

1 2 3 1, 21, 32, 3 1 2 3 1, 21, 32, 30.3219 0.3219 0.3219 0.0182 0.3219 0.3219 0.3219 0.2451

12, 3 21, 3 31, 2 12, 3 21, 3 31, 20.0406 0.0406 0.0406 0.1699 0.1699 0.1699

12 13 23 12 13 230.2224 0.2224 0.2224 0.415 0.415 0.415

123 1230.1926 0

Πsx

1, 2, 30.2451

1, 2 1, 3 2, 30.1699 0.1699 0.1699

1, 21, 3 1, 22, 3 1, 32, 30.0931 0.0931 0.0931

1 2 3 1, 21, 32, 30.3219 0.3219 0.3219 -0.2268

12, 3 21, 3 31, 2-0.1293 -0.1293 -0.1293

12 13 23-0.1926 -0.1926 -0.1926

1230.1926

their source events. For instance, an antichain α =a1, . . . ,an such that ai ⊂ [n] where [n] is the in-dex set of the realization s = (s1, . . . , sn). Moreover,ai ∈ α should be pairwise incomparable under inclu-sion since antichain combinations are as such (see Sec-tion IV). E.g., 1, 2, 1, 3 represents the source event(s1 ∩ s2) ∪ (s1 ∩ s3) and the combination of (s1, s2) and(s1, s3).

Let A ([n]) be the set of all antichains; Crampton etal. [38] showed that there exists the following partial or-dering over A ([n]):

α β ⇔ ∀ b ∈ β,∃ a ∈ α | a ⊆ b ∀ α, β ∈ A ([n]).

This partial ordering implies that any α, β ∈ A ([n])have an infimum α ∧ β ∈ A ([n]) and a supremumα ∨ β ∈ A ([n]) and so 〈A ([n]),〉 is called a lattice.Now when endowing 〈A ([n]),〉 with a function f (say ashared information) such that f(α) =

∑βα π(β) where

π(β) are desired quantities (say PID terms) that havea one-to-one correspondence with β ∈ A ([n]), then wecan compute these π using f . Hence, we reduced theproblem of defining different conceptual quantities thateach antichain represents by defining a single conceptualquantity for each antichain that is the shared mutual in-formation.

Williams and Beer coined this idea of endowing〈A ([n]),〉 with a redundancy measure I∩ and hencethe name “redundancy lattice.” For this, they had aset of axioms that ensured (i) the one-to-one correspon-dence between A ([n]) and the PID terms and (ii) that

I∩(α) =∑βα Π(β). However, their definition was not

local (for every realization) and thus Finn and Lizier [18]adapted the axioms for the local case. However, the lo-cal shared measure i∩ can take negative values and theproblem persists upon averaging. Thus, they proposed todecompose i∩ = i+∩ − i−∩ where i±∩ take only nonnegativeterms and can be interpreted as informative and misinfor-mative components of i∩. Altogether, for each realizationwe will endow 〈A ([n]),〉 with isx +

∩ (informative lattice)and isx−∩ (misinformative lattice) individually to obtainπsx

+ and πsx− PPID terms.

First, for any α ∈ A [n], we define isx±∩ as follows:

P(α) = P(⋃a∈α

⋂i∈a

si)

P(t, α) = P(⋃a∈α

⋂i∈a

(t∩ si))

isx∩ (t : α) = log2

1

P(α)− log2

P(t)

P(t∩α)

= isx +∩ (t : α)− isx−∩ (t : α).

Now to show that this endowing of isx±∩ is consistent,we prove Theorem IV.1, that shows that isx±∩ satisfy thePPID axioms.

proof of Theorem IV.1. By the symmetry of intersection,isx±∩ defined in (14) satisfy the symmetry Axiom IV.1.For any collection a, using (14), the informative and mis-

Page 14: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

14

(s1,s2,s3,s4,t) = 0,0,1,0,1

(0,0,0,0,0) (0,0,0,1,1) (0,0,1,0,1) (0,0,1,1,0)

(0,1,0,0,1) (0,1,0,1,0) (0,1,1,0,0) (0,1,1,1,1)

(1,0,0,0,1) (1,0,0,1,0) (1,0,1,0,0) (1,0,1,1,1)

(1,1,0,0,0) (1,1,0,1,1) (1,1,1,0,1) (1,1,1,1,0)

(0,0,0,0,0) (0,0,0,1,1) (0,0,1,0,1) (0,0,1,1,0)

(0,1,0,0,1) (0,1,0,1,0) (0,1,1,0,0) (0,1,1,1,1)

(1,0,0,0,1) (1,0,0,1,0) (1,0,1,0,0) (1,0,1,1,1)

(1,1,0,0,0) (1,1,0,1,1) (1,1,1,0,1) (1,1,1,1,0)

excluded by excluded by shared exclusion

A

B

FIG. 4. Worked example of isx∩ for a four source-variables case. We evaluate the shared information isx∩ (t :a1; a2) with a1 = 1, 2, a2 = 3, 4, s = (s1, s2, s3, s4) =(0, 0, 1, 0), and t = Parity(s) = 1. (A) Sample space – therelevant event is marked by the blue (gray) outline. (B) ex-clusions induced by the two collections of source realizationindices a1 (brown (dark gray)), a2 (yellow (light gray)), andthe shared exclusion relevant for isx∩ (gold (gray)). After re-moving and rescaling, the probability for the target event thatwas actually realized, i.e., t = 1, is reduced from 1/2 to 3/7.Hence the shared exclusion leads to negative shared informa-tion. Hence, πsx(t : 1, 23, 4) = −0.0145 bit .

informative shared information are

isx +∩ (t : a) = log2

1

p(a)= h(a)

isx−∩ (t : a) = log2

p(t)

p(t,a)= h(a | t).

and so they satisfy Axiom IV.3. For Axiom IV.2, notethat

P(a1, a2, . . . , am, am+1) ≤ P(a1, a2, . . . , am)

This implies that isx±∩ decrease monotonically if joint

source realizations are added, where equality holds ifthere exists i ∈ [m] such that am+1 ⊇ ai , i.e., if thereexists i ∈ [m] such that am+1 ⊆ ai ⇔ ai ⊆ am+1 .

Then, we assume that

isx±∩ (t : α) =∑βα

πsx± (t : β) ∀ α, β ∈ A ([n]). (A1)

Note that, this assumption is logically sound and is dis-cussed thoroughly in [21]. Finally, to obtain πsx

± , we showthat Eq. (A1) is invertible via a so-called Mobius inver-sion given by the following theorem.

Theorem A.1. Let isx±∩ be measures on the redundancylattice, then we have the following closed form for eachatom πsx

± :

πsx± (t : α) = isx±∩ (t : α)−

∑∅6=B⊆α−

(−1)|B|−1isx±∩ (t :∧B).

(A2)

The proof of the above theorem follows from that of [18,Theorem A1].

2. Nonnegativity of πsx±

In order for our information decomposition to be in-terpretative, the informative and misinformative atoms,πsx± , must be nonnegative. First, we recall these results

from convex analysis that will come in handy later.

Theorem A.2 (Theorem 2.67 [39]). Let f : Rn → R bea continuously differentiable function. Then, f is convexif and only if for all x and y

f(y) ≥ f(x) +∇T f(x)(y − x).

Proposition A.1. Let f : Rn → R be a continuouslydifferentiable convex function and y0 − x0 = c1 wherec ≥ 0. If f(x0) ≥ f(y0), then

−∑i

∂f

∂xi(y0) ≤ −

∑i

∂f

∂xi(x0).

Proof. For any x, y ∈ Rn, using theorem A.2 by inter-changing the roles of x and y,

−∇T f(y)(y − x) ≤ f(x)− f(y) ≤ −∇T f(x)(y − x).

Now consider x0, y0 ∈ Rn such that y0 − x0 = c1, then

−c∇T f(y0)1 ≤ −c∇T f(x0)1

−∑i

∂f

∂xi(y0) ≤ −

∑i

∂f

∂xi(x0).

We write down the proof of theorem IV.2 and thenshow that isx±∩ are nonnegative.

Page 15: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

15

proof of theorem IV.2. Let α, β ∈ A ([n]) and α β.Then α and β are of the form α = a1, ....,akα andβ = b1, ....,bkβ. Because α β there is a functionf : β → α such that f(b) ⊆ b [40]. Now we have for allb ∈ β ⋂

i∈b

si ⊆⋂

i∈f(b)

si

Hence,

P(β) = P

⋃b∈β

⋂i∈b

si

≤ P

⋃b∈β

⋂i∈f(b)

si

≤ P

(⋃a∈α

⋂i∈a

si

)= P(α).

(A3)

The last inequality is true because the term on itsL.H.S. is the probability of a union of intersections re-lated to collections a ∈ α (the f(b)), i.e., it is the prob-ability of a union of events of the type

⋂i∈a si. The

probability of such a union can only get bigger if we takeit over all events of this type. Using (A3), it immediatelyfollows that isx +

∩ (t : α) ≤ isx +∩ (t : β) and isx +

∩ is mono-tonically increasing. Using the same argument, isx−∩ ismonotonically increasing.

Proposition A.2. isx±∩ are nonnegative.

Proof. isx +∩ (t : a1;a2; . . . ;am) = log2

1P(a1 ∪ a2 ∪...∪am) ≥

0.Similarly, the misinformative isx−∩ (t : a1;a2; . . . ;am) =

log2P(t)

P(t∩[(∩i∈a1si)∪(∩i∈a2

si)∪...∪(∩i∈am si)])≥ 0.

We construct a family of mappings from P(α−) whereα− is the set of children of α to the A ([n]) (see FIG 5).This family of mappings plays a key role in the desiredproof of nonnegativity.

Proposition A.3. Let α ∈ A ([n]) and α− =γ1, . . . , γk ordered increasingly w.r.t. the probabilitymass be the set of children of α on 〈A ([n]),〉 . Then,for any 1 ≤ i ≤ k

fi :P1(α−\γi) ∪ α−→ A ([n])

B −→∧β∈B

β ∧ γi

is a mapping such that P(fi(B)) = P(∧β∈B β) + di

where di = P(γi) − P(α) and the complement is takenw.r.t. P(α−), the powerset of α−.

Proof. Since γi ∈ α− and β ∈ α− for any β ∈ B, then(∧β∈B β)∨γi = α. Now, for any B ∈P(α−\γi), using

the inclusion-exclusion, β ∧ γi = β ∪ γi and β ∨ γi =↑ β∩ ↑ γi,

P(fi(B)) = P(∧β∈B

β ∧ γi) = P(∧β∈B

β) + P(γi)− P(∧β∈B

β ∨ γi)

= P(β) + P(γi)− P(α).

FIG. 5. The family of mappings introduced in propo-sition A.3 that preserve the probability mass differ-ence. Let α be the top node of A ([3]). The orange (graydotted) region is α−, the set of children of α. Each color de-picts one mapping in the family based on some γ ∈ α−. Thedark red (solid line) mapping is based on γ1, the red mapping(dash-dotted line) is based on γ2 and the salmon (dotted line)mapping is based on γ3.

-

-

-

-

=

=

=

=

=

=

=

=1,32,3

1,2,3

1,3

2,3

1,2

1,22,31,3

1,22,3

1,21,3

FIG. 6. Depiction of set differences correspondingto the probability mass difference d1 introduced inproposition A.3 and shown in Fig. 5, for the sets fromFig. 2.

The following lemma shows that for any node α ∈A ([n]), the recursive Eq. (A2) should be nonnegativewhich is the main point in the desired proof of nonnega-tivity.

Lemma A.1. Let α ∈ A ([n]); then

− log2 P(α) +∑

∅6=B⊆α−(−1)|B|−1 log2 P(

∧B) ≥ 0. (A4)

Proof. Suppose that |α−| = k and w.l.o.g. that α− =γ1, . . . , γk is ordered increasingly w.r.t. the probabilitymass. The proof will follow by induction over k = |α−|.We will demonstrate the inequality (A4) for k = 3, 4 toshow the induction basis. For k = 3, the L.H.S. of (A4)

Page 16: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

16

can be written as

log2

P(γ1)P(γ2)P(γ3)P(γ1 ∧ γ2 ∧ γ3)

P(α)P(γ1 ∧ γ2)P(γ1 ∧ γ3)P(γ2 ∧ γ3)

= log2

P(α)+d1P(α)

(P(α)+d2)+d1(P(α)+d2)

− log2

P(α)+d3+d1P(α)+d3

(P(α)+d3+d2)+d1(P(α)+d3+d2)

= [h3(P(α))− h3(P(α) + d2)]

− [h3(P(α) + d3)− h3(P(α) + d3 + d2)],

where h3(x) = log2(1 + d1/x), di := P(γi) − P(α) fori ∈ 1, 2, 3, and d3 ≥ d2 ≥ d1 ≥ 0. Note that h3 is a con-tinuously differentiable convex function that is monoton-ically decreasing. Now, take x = P(α) and y = P(α)+d3,then

h3(P(α))− h3(P(α) + d2)

Thm. A.2≥ −d2h

′3(P(α) + d2)

Prop. A.1

≥ −d2h′3(P(α) + d3)

Thm. A.2≥ h3(P(α) + d3)− h3(P(α) + d3 + d2)

and so the inequality (A4) holds when k = 3. Fork = 4, we have α− = γ1, γ2, γ3, γ4 ordered increas-ingly w.r.t. the probability mass. By Proposition A.3,the L.H.S. of (A4) can be written as[h3

(P(α)

)− h3

(P(α) + d2

)−(h3(P(α) + d3)

− h3

(P(α) + d3 + d2

))]−[h3

(P(α) + d4

)− h3

(P(α) + d4 + d2

)−(h3

(P(α) + d4 + d3

)− h3

(P(α) + d4 + d3 + d2

))]=

[h4

(P(α),P(α) + d2

)− h4

(P(α) + d3,P(α) + d3 + d2

)]−[h4

(P(α) + d4,P(α) + d4 + d2

)− h4

(P(α) + d4 + d3,P(α) + d4 + d3 + d2

)],

where di := P(γi) − P(α) for i ∈ 2, 3, 4, d4 ≥ d3 ≥d2 ≥ 0, and h4(x1, x2) = log2(1 + d1(x2−x1)/x1(x2+d1)) =h3(x1) − h3(x2). Let δ ≥ 0 and x, y ∈ Hδ

4 := x ∈R2∗

+ | x2 = x1 + δ where x1 ≤ y1, then h4(x) ≥ h4(y)since (A4) holds for k = 3. Moreover, h4 is convex sincefor any x, y ∈ Hδ

4 and θ ∈ [0, 1]

θh4(x) + (1− θ)h4(y)− h4(θx+ (1− θ)y)

= θ(h3(x1)− h3(x2)) + (1− θ)(h3(y1)

− h3(y2))− h3(θx1 + (1− θ)y1) + h3(θx2 + (1− θ)y2)

= [θh3(x1) + (1− θ)h3(y1)− h3(θx1 + (1− θ)y1)]

− [θh3(x1 + δ) + (1− θ)h3(y1 + δ)

− h3(θx1 + (1− θ)y1 + δ)] ≥ 0.

Now, take x = (P(α),P(α) + d2) and y = (P(α) +d4,P(α) + d4 + d2), then

h4(P(α),P(α) + d2)− h3(p(α) + d3,P(α) + d3 + d2)

Thm. A.2

≥ −∇Th4(P(α) + d3,P(α) + d3 + d2)(d3, d3)

Prop. A.1

≥ −∇Th4(P(α) + d4,P(α) + d4 + d2)(d3, d3)

Thm. A.2

≥ h4(P(α) + d4,P(α) + d4 + d2)

− h4(P(α) + d4 + d3,P(α) + d4 + d3 + d2),

and so the inequality (A4) holds.Suppose that the inequality holds for k and let us proof

it for k + 1. Here α− = γ1, γ2, . . . , γk+1 and usingProposition A.3, the L.H.S. of (A4) can be written as[hk(ak−2

)− hk

(ak−2 + dk−11k−2

)−(hk(ak−2 + dk1k−2

)− hk

(ak−2 + (dk + dk−1)1k−2

))]−[hk(ak−2 + dk+11k−2

)− hk

(ak−2 + (dk+1 + dk−1)1k−2

)−(hk(ak−2 + (dk+1 + dk)1k−2

)− hk

(ak−2 + (dk+1 + dk + dk−1)1k−2

))]=

[hk+1

(ak−2, ak−2 + dk−11k−2

)−(hk+1

(ak−2 + dk1k−2, ak−2 + (dk + dk−1)1k−2

))]−[hk+1

(ak−2 + dk+11k−2, ak−2 + (dk+1 + dk−1

)1k−2)

− hk+1

(ak−2 + (dk+1 + dk)1k−2, ak−2 + (dk+1 + dk

+ dk−1)1k−2

)]where ak−2 := (P(α), . . . ,P(α) +

∑k−2i=2 di) ∈ R2k−2

,di := P(γi) − P(α) for i ∈ 2, . . . , k + 1, dk+1 ≥ · · · ≥d2 ≥ 0, and hk+1(x1, . . . , x2k−1) = hk(x1, . . . , x2k−2) −hk(x2k−2+1, . . . , x2k−1).

Let δ ≥ 0 and x, y ∈ Hδk+1 := x ∈ R2k−1

| xi = xj +

δ, i = jmod 2k−2 where xi ≤ yi for all i, then hk+1(x) ≥h(y) because the Ineq. (A4) holds for k. Moreover, hk+1

is convex since for any x, y ∈ Hδk+1 and θ ∈ [0, 1]

θhk+1(x1, . . . , x2k−1) + (1− θ)hk+1(y1, . . . , y2k−1)

− hk+1(θx1 + (1− θ)y1, . . . , θx2k−1 + (1− θ)y2k−1)

=

[θhk(x1, . . . , x2k−2) + (1− θ)hk(y1, . . . , y2k−2)

− hk(θx1 + (1− θ)y1, . . . , θx2k−1 + (1− θ)y2k−2)

]−[

θhk(x1 + δ, . . . , x2k−2 + δ) + (1− θ)hk(y1 + δ, . . . , y2k−2 + δ)

− hk(θx1 + (1− θ)y1 + δ, . . . , θx2k−2 + (1− θ)y2k−2 + δ)

].

is nonnegative. Now, take x = (ak−2, ak−2 + dk−11k−2)and y = (ak−2 + dk+11k−2, ak−2 + (dk+1 + dk−1)1k−2),

Page 17: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

17

then

hk+1(ak−2, ak−2 + dk−11k−2)− hk+1(ak−2 + dk1k−2,

ak−2 + (dk + dk−1)1k−2)

≥ −dk∇Thk+1(ak−2 + dk1k−2, ak−2 + (dk + dk−1)1k−2)1k−1

≥ −dk∇Thk+1(ak−2 + dk+11k−2,

ak−2 + (dk+1 + dk−1)1k−2)1k−1

≥ hk+1(ak−2 + dk+11k−2, ak−2 + (dk+1 + dk−1)1k−2)

− hk+1(ak−2 + (dk+1 + dk)1k−2,

ak−2 + (dk+1 + dk + dk−1)1k−2),

where the first and third inequalities hold using theo-rem A.2 and the second inequality holds using Proposi-tion A.1 and so the inequality (A4) holds for k + 1.

Finally we write down the proof of theorem IV.3 toconclude that isx∩ yields meaningful PPID terms.

proof of theorem IV.3. For any α ∈ A ([n]),

πsx+ (t : α) = isx+

∩ (t : α)−∑

∅6=B⊆α−(−1)|B|−1isx+

∩ (t :∧B)

= − log2 P(α) +∑

∅6=B⊆α−(−1)|B|−1 log2 P(

∧B).

So, by Lemma A.1 πsx+ (t : α) ≥ 0. Similarly, πsx

− (t : α) ≥0 since intersecting with t has no effect on the nonnega-tivity shown in Lemma A.1.

Appendix B: Definition of isx∩ starting from a generalprobability space

Let (Ω,A,P) be a probability space and S1, ..., Sn, T bediscrete and finite random variables on that space, i.e.,

Si : Ω→ ASi , (A,P(ASi))−measurable

T : Ω→ AT , (A,P(AT ))−measurable,

where ASi and AT are the finite alphabets of the cor-responding random variables and P(ASi) and P(AT )are the power sets of these alphabets. Given a subsetof source realization indices a ⊆ 1, ..., n the local mu-tual information of source realizations (si)i∈a about thetarget realization t is defined as

i(t : (si)i∈a) = i(t : a) = log2

P(t|⋂i∈a si

)P(t)

.

The local shared information of an antichain α =a1, . . . ,am (representing a set of collections of sourcerealizations) about the target realization t ∈ AT is de-fined in terms of the original probability measure P as afunction isx∩ : AT ×A (s)→ R with

isx∩ (t : α) = isx∩ (t : a1; . . . ;am) := log2

P (t|⋃mi=1 ai)

P(t).

A special case of this quantity is the local shared in-formation of a complete sequence of source realizations(s1, . . . , sn) about the target realization t. This is ob-tained by setting ai = i and m = n:

isx∩ (t : 1; . . . ; n) = log2

P (t|⋃ni=1 si)

P(t).

In contrast to other shared information terms, this isan atomic quantity corresponding to the very bottomof the lattice of antichains. Rewriting isx∩ allows us todecompose it into the difference of two positive parts:

isx∩ (t : a1, ...,am) = log2

P(t ∩⋃mi=1 ai

)P(t)P

(⋃mi=1 ai

) = log2

1

P(⋃m

i=1 ai)

− log2

P(t)

P(t ∩⋃mi=1 ai

) ,using standard rules for the logarithm. We call

isx +∩ (t : a1, . . . ,am) := log2

1

P (⋃mi=1 ai)

the informative local shared information and

isx−∩ (t : a1, . . . ,am) := log2

P(t)

P (t ∩⋃mi=1 ai)

the misinformative local shared information.

ACKNOWLEDGMENTS

We would like to thank Nils Bertschinger, Joe Lizier,Conor Finn and Robin Ince for fruitful discussions onPID. We would also like to thank Patricia Wollstadt,Viola Priesemann, Raul Vicente, Johannes Zierenberg,Lucas Rudelt and Fabian Mikulasch for their valuablecomments on this paper.

MW received support from SFB Project No. 1193, Sub-project No. C04 funded by the Deutsche Forschungsge-meinschaft. MW, AM, and AG are employed at theCampus Institute for Dynamics of Biological Networks(CIDBN) funded by the Volkswagen Stiftung. MW andAM received support from the Volkswagenstiftung underthe program “Big Data in den Lebenswissenschaften”.This work was supported by a funding from the Min-istry for Science and Education of Lower Saxony and theVolkswagen Foundation through the “NiedersachsischesVorab.” MW is grateful to Jurgen Jost for hosting himat his department at the Max Planck Institute for Math-ematics in the Sciences in Leipzig for a research stayfunded by the Max Planck Society.

Page 18: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

18

[1] N. Brenner, W. Bialek, and R. d. R. Van Steveninck,Adaptive rescaling maximizes information transmission,Neuron 26, 695 (2000).

[2] P. E. Latham and S. Nirenberg, Synergy, redundancy,and independence in population codes, revisited, Journalof Neuroscience 25, 5195 (2005).

[3] A. A. Margolin, I. Nemenman, K. Basso, C. Wiggins,G. Stolovitzky, R. Dalla Favera, and A. Califano, Aracne:an algorithm for the reconstruction of gene regulatorynetworks in a mammalian cellular context, in BMC bioin-formatics, Vol. 7 (Springer, 2006) p. S7.

[4] P. L. Williams and R. D. Beer, Nonnegative decom-position of multivariate information, arXiv preprintarXiv:1004.2515 (2010).

[5] N. Bertschinger, J. Rauh, E. Olbrich, and J. Jost, Sharedinformation—new insights and problems in decomposinginformation in complex systems, in Proceedings of theEuropean conference on complex systems 2012 (Springer,2013) pp. 251–269.

[6] M. Harder, C. Salge, and D. Polani, Bivariate measureof redundant information, Physical Review E 87, 012130(2013).

[7] R. Quax, O. Har-Shemesh, and P. Sloot, Quantifying syn-ergistic information using intermediate stochastic vari-ables, Entropy 19, 85 (2017).

[8] P. Perrone and N. Ay, Hierarchical quantification of syn-ergy in channels, Frontiers in Robotics and AI 2, 35(2016).

[9] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, andN. Ay, Quantifying unique information, Entropy 16, 2161(2014).

[10] J. T. Lizier, M. Prokopenko, and A. Y. Zomaya, Localmeasures of information storage in complex distributedcomputation, Information Sciences 208, 39 (2012).

[11] T. Schreiber, Measuring information transfer, Physicalreview letters 85, 461 (2000).

[12] M. Wibral, N. Pampu, V. Priesemann, F. Siebenhuhner,H. Seiwert, M. Lindner, J. T. Lizier, and R. Vicente, Mea-suring information-transfer delays, PloS one 8, e55809(2013).

[13] J. T. Lizier, M. Prokopenko, and A. Y. Zomaya, Local in-formation transfer as a spatiotemporal filter for complexsystems, Physical Review E 77, 026110 (2008).

[14] M. Wibral, V. Priesemann, J. W. Kay, J. T. Lizier, andW. A. Phillips, Partial information decomposition as aunified approach to the specification of neural goal func-tions, Brain and cognition 112, 25 (2017).

[15] J. W. Kay and W. Phillips, Coherent infomax as a com-putational goal for neural systems, Bulletin of mathemat-ical biology 73, 344 (2011).

[16] M. Wibral, J. T. Lizier, and V. Priesemann, Bits fromBrains for Biologically Inspired Computing, Frontiers inRobotics and AI 2, 5 (2015).

[17] G. Deco and B. Schurmann, Information dynamics:foundations and applications (Springer Science & Busi-ness Media, 2012).

[18] C. Finn and J. Lizier, Pointwise partial information de-composition using the specificity and ambiguity lattices,Entropy 20, 297 (2018).

[19] This can be seen as follows: Assuming that the negativelocal MI consists only of shared information, then this

local shared information must be negative, enforcing theexistence of negative local shared information. Now as-suming that this shared information does not differ fromrealization to realization – something we should considerpossible at this point – while the other contributions vary,then this leads to a shared information that is also neg-ative on average, also see [18].

[20] Note that the idea of using an auxiliary random variable(IW in our case) is not novel per se. Quax et al. [7] hasdefined synergy using auxiliary random variable. How-ever, their auxiliary random variable is conceptually dif-ferent from IW and their approach yielded a ‘stand-alone’measure of synergistic information without providing anydecomposition.

[21] A. J. Gutknecht, M. Wibral, and A. Makkeh, Bits andpieces: Understanding information decomposition frompart-whole relationships and formal logic, arXiv preprintarXiv:2008.09535 (2020).

[22] C. Finn and J. Lizier, Probability mass exclusions andthe directed components of mutual information, Entropy20, 826 (2018).

[23] R. M. Fano, Transmission of information: A statisticaltheory of communications, American Journal of Physics29, 793 (1961).

[24] A. Makkeh, D. O. Theis, and R. Vicente, Bivariate par-tial information decomposition: The optimization per-spective, Entropy 19, 530 (2017).

[25] A. Makkeh and D. O. Theis, Optimizing bivari-ate partial information decomposition, arXiv preprintarXiv:1802.03947 (2018).

[26] R. Ince, Measuring multivariate redundant informationwith pointwise common change in surprisal, Entropy 19,318 (2017).

[27] If the collections where considered in an OR relation,there would be no random variable on which the averageentropy is defined (see discussion of the local indicatorvariable wa1,...,am).

[28] As was to be expected from the difficulties encountered inthe past trying to define measures of shared information.

[29] J. Rauh, P. Banerjee, E. Olbrich, J. Jost, andN. Bertschinger, On extractable shared information, En-tropy 19, 328 (2017).

[30] P. M. Woodward and I. L. Davies, Information theoryand inverse probability in telecommunication, Proceed-ings of the IEE-Part III: Radio and Communication En-gineering 99, 37 (1952).

[31] A. M. Bastos, W. M. Usrey, R. A. Adams, G. R. Mangun,P. Fries, and K. J. Friston, Canonical microcircuits forpredictive coding, Neuron 76, 695 (2012).

[32] M. Larkum, A cellular mechanism for cortical associa-tions: an organizing principle for the cerebral cortex,Trends in neurosciences 36, 141 (2013).

[33] J. T. Lizier, B. Flecker, and P. L. Williams, Towards asynergy-based approach to measuring information modi-fication, in 2013 IEEE Symposium on Artificial Life (AL-IFE) (IEEE, 2013) pp. 43–51.

[34] M. Wibral, C. Finn, P. Wollstadt, J. T. Lizier, andV. Priesemann, Quantifying information modification indeveloping neural networks via partial information de-composition, Entropy 19, 494 (2017).

[35] P. Wollstadt, J. T. Lizier, R. Vicente, C. Finn,

Page 19: arXiv:2002.03356v3 [cs.IT] 10 Sep 2020The prime candidate for a statement of this type is W s 1;:::;s n = (S 1 = s 1) _:::_(S n = s n), i.e. the inclusive OR of the statements that

19

M. Martınez-Zarzuela, P. Mediano, L. Novelli, andM. Wibral, Idtxl: The information dynamics toolkit xl:a python package for the efficient analysis of multivari-ate information dynamics in networks, arXiv preprintarXiv:1807.10459 (2018).

[36] V. Griffith and C. Koch, Quantifying synergistic mu-tual information, in Guided Self-Organization: Inception(Springer, 2014) pp. 159–190.

[37] Due to i(t : sj) = 0 for j = 1, 2 in the XOR exam-ple, this negative shared information is then compensatedby positive unique information – however this happenstwice, i.e. once for each marginal local mutual informa-tion. As a consequence, the synergy is reduced from 1bit to 1 minus once this unique information. This mayseem counter-intuitive when still thinking about the PIDatoms as areas, in the sense of “How come if we sub-tract two mutual information of zero bit from the jointmutual information of 1 bit, that we do not get 1 bit asa result?”. The key insight is that the two local mutual

information terms of zero bit have a negative “overlap”with each other, making their sum positive. We simplysee here again that the interpretation of PID atoms as(semi-positive) areas has to be given up in the pointwiseframework, due to the fact that already the regular localmutual information can be negative.

[38] J. Crampton and G. Loizou, Embedding a poset in a lat-tice,”, Tech. Rep. (Tech. Rep. BBKCS-0001, BirkbeckCollege, University of London, 2000).

[39] A. P. Ruszczynski and A. Ruszczynski, Nonlinear opti-mization, Vol. 13 (Princeton university press, 2006).

[40] This function does not have to be surjective: Supposeα = 1, 2, 4, 3 and β = 1, 2, 3, 4. Then nec-essarily two sets in α will not be in the image of f. Italso does not have to be injective. Consider α = 1 andβ = 1, 2, 1, 3. Then both elements of β have to bemapped to the only element of α.