14
Conditional probing: measuring usable information beyond a baseline John Hewitt Kawin Ethayarajh Percy Liang Christopher D. Manning Department of Computer Science Stanford University {johnhew,kawin,pliang,manning}@cs.stanford.edu Abstract Probing experiments investigate the ex- tent to which neural representations make properties—like part-of-speech—predictable. One suggests that a representation encodes a property if probing that representation produces higher accuracy than probing a baseline representation like non-contextual word embeddings. Instead of using baselines as a point of comparison, we’re interested in measuring information that is contained in the representation but not in the baseline. For example, current methods can detect when a representation is more useful than the word identity (a baseline) for predicting part-of- speech; however, they cannot detect when the representation is predictive of just the aspects of part-of-speech not explainable by the word identity. In this work, we extend a theory of usable information called V -information and propose conditional probing, which explicitly conditions on the information in the baseline. In a case study, we find that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought. 1 Introduction Neural language models have become the foun- dation for modern NLP systems (Devlin et al., 2019; Radford et al., 2018), but what they under- stand about language, and how they represent that knowledge, is still poorly understood (Belinkov and Glass, 2019; Rogers et al., 2020). The prob- ing methodology grapples with these questions by relating neural representations to well-understood properties. Probing analyzes a representation by us- ing it as input into a supervised classifier, which is trained to predict a property, such as part-of-speech (Shi et al., 2016; Ettinger et al., 2016; Alain and Bengio, 2016; Adi et al., 2017; Belinkov, 2021). One suggests that a representation encodes a property of interest if probing that representation produces higher accuracy than probing a baseline representation like non-contextual word embed- dings. However, consider a representation that encodes only the part-of-speech tags that aren’t determined by the word identity. Probing would report that this representation encodes less about part-of-speech than the non-contextual word base- line, since ambiguity is relatively rare. Yet, this representation clearly encodes interesting aspects of part-of-speech. How can we capture this? In this work, we present a simple probing method to explicitly condition on a baseline. 1 For a representation and a baseline, our method trains two probes: (1) on just the baseline, and (2) on the concatenation of the baseline and the represen- tation. The performance of probe (1) is then sub- tracted from that of probe (2). We call this process conditional probing. Intuitively, the representation is not penalized for lacking aspects of the property accessible in the baseline. We then theoretically ground our probing methodology in V -information, a theory of usable information introduced by Xu et al. (2020) that we additionally extend to multiple predictive vari- ables. We use V -information instead of mutual in- formation (Shannon, 1948; Pimentel et al., 2020b) because any injective deterministic transformation of the input has the same mutual information as the input. For example, a representation that maps each unique sentence to a unique integer must have the same mutual information with any property as does BERT’s representation of that sentence, yet the latter is more useful. In contrast, V -information is defined with respect to a family of functions V that map one random variable to (a probability distribution over) another. V -information can be constructed by deterministic transformations that make a property more accessible to the functions in the family. We show that conditional probing 1 Our code is available at https://github.com/ john-hewitt/conditional-probing. arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

Conditional probing: measuring usable information beyond a baseline

John Hewitt Kawin Ethayarajh Percy Liang Christopher D. ManningDepartment of Computer Science

Stanford Universityjohnhew,kawin,pliang,[email protected]

AbstractProbing experiments investigate the ex-tent to which neural representations makeproperties—like part-of-speech—predictable.One suggests that a representation encodesa property if probing that representationproduces higher accuracy than probing abaseline representation like non-contextualword embeddings. Instead of using baselinesas a point of comparison, we’re interested inmeasuring information that is contained inthe representation but not in the baseline. Forexample, current methods can detect when arepresentation is more useful than the wordidentity (a baseline) for predicting part-of-speech; however, they cannot detect when therepresentation is predictive of just the aspectsof part-of-speech not explainable by the wordidentity. In this work, we extend a theoryof usable information called V-informationand propose conditional probing, whichexplicitly conditions on the information inthe baseline. In a case study, we find thatafter conditioning on non-contextual wordembeddings, properties like part-of-speech areaccessible at deeper layers of a network thanpreviously thought.

1 Introduction

Neural language models have become the foun-dation for modern NLP systems (Devlin et al.,2019; Radford et al., 2018), but what they under-stand about language, and how they represent thatknowledge, is still poorly understood (Belinkovand Glass, 2019; Rogers et al., 2020). The prob-ing methodology grapples with these questions byrelating neural representations to well-understoodproperties. Probing analyzes a representation by us-ing it as input into a supervised classifier, which istrained to predict a property, such as part-of-speech(Shi et al., 2016; Ettinger et al., 2016; Alain andBengio, 2016; Adi et al., 2017; Belinkov, 2021).

One suggests that a representation encodes aproperty of interest if probing that representation

produces higher accuracy than probing a baselinerepresentation like non-contextual word embed-dings. However, consider a representation thatencodes only the part-of-speech tags that aren’tdetermined by the word identity. Probing wouldreport that this representation encodes less aboutpart-of-speech than the non-contextual word base-line, since ambiguity is relatively rare. Yet, thisrepresentation clearly encodes interesting aspectsof part-of-speech. How can we capture this?

In this work, we present a simple probingmethod to explicitly condition on a baseline.1 Fora representation and a baseline, our method trainstwo probes: (1) on just the baseline, and (2) onthe concatenation of the baseline and the represen-tation. The performance of probe (1) is then sub-tracted from that of probe (2). We call this processconditional probing. Intuitively, the representationis not penalized for lacking aspects of the propertyaccessible in the baseline.

We then theoretically ground our probingmethodology in V-information, a theory of usableinformation introduced by Xu et al. (2020) thatwe additionally extend to multiple predictive vari-ables. We use V-information instead of mutual in-formation (Shannon, 1948; Pimentel et al., 2020b)because any injective deterministic transformationof the input has the same mutual information asthe input. For example, a representation that mapseach unique sentence to a unique integer must havethe same mutual information with any property asdoes BERT’s representation of that sentence, yetthe latter is more useful. In contrast, V-informationis defined with respect to a family of functionsV that map one random variable to (a probabilitydistribution over) another. V-information can beconstructed by deterministic transformations thatmake a property more accessible to the functionsin the family. We show that conditional probing

1Our code is available at https://github.com/john-hewitt/conditional-probing.

arX

iv:2

109.

0923

4v1

[cs

.CL

] 1

9 Se

p 20

21

Page 2: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

provides an estimate of conditional V-informationIV(repr→ property | baseline).

In a case study, we answer an open questionposed by Hewitt and Liang (2019): how are theaspects of linguistic properties that aren’t explain-able by the input layer accessible across the restof the layers of the network? We find that the part-of-speech information not attributable to the inputlayer remains accessible much deeper into the lay-ers of ELMo (Peters et al., 2018a) and RoBERTa(Liu et al., 2019) than the overall property, a factpreviously obscured by the gradual loss across lay-ers of the aspects attributable to the input layer. Forthe other properties, conditioning on the input layerdoes not change the trends across layers.

2 Conditional V-information Probing

In this section, we describe probing methods andintroduce conditional probing. We then review V-information and use it to ground probing.

2.1 Probing setup

We start with some notation. Let X ∈ X be arandom variable taking the value of a sequenceof tokens. Let φ(X) be a representation resultingfrom a deterministic function of X; for example,the representation of a single token from the se-quence in a layer of BERT (Devlin et al., 2019).Let Y ∈ Y be a property (e.g., part-of-speech of aparticular token), and V a probe family, that is, a setof functions fθ : θ ∈ Rp, where fθ : z → P(Y)maps inputs z to probability distributions over thespace of the label.2 The input z ∈ Rm may be inthe space of φ(X), that is, Rd, or another space,e.g., if the probe takes the concatenation of two rep-resentations. In each experiment, a training datasetDtr = (xi, yi)i is used to estimate θ, and theprobe and representation are evaluated on a sep-arate dataset Dte = (xi, yi)i. We refer to theresult of this evaluation on some representation Ras Perf(R).

2.2 Baselined probing

Let B ∈ Rd be a random variable representing abaseline (e.g., non-contextual word embedding ofa particular token.) A common strategy in probingis to take the difference between a probe perfor-mance on the representation and on the baseline

2We discuss mild constraints on the form that V can takein the Appendix. Common probe families including linearmodels and feed-forward networks meet the constraints.

(Zhang and Bowman, 2018); we call this baselinedprobing performance:

Perf(φ(X))− Perf(B). (1)

This difference in performances estimates howmuch more accessible Y is in φ(X) than in thebaseline B, under probe family V .

But what if B and φ(X) capture distinct aspectsof Y ? For example, consider if φ(X) capturesparts-of-speech that aren’t the most common labelfor a given word identity, while B captures parts-of-speech that are the most common for the wordidentity. Baselined probing will indicate that φ(X)explains less about Y than the baseline, a “nega-tive” probing result. But clearly φ(X) captures aninteresting aspect of Y ; we aim to design a methodthat measures just what φ(X) contributes beyondB in predicting Y , not what B has and φ(X) lacks.

2.3 Our proposal: conditional probingIn our proposed method, we again train two probes;each is the concatenation of two representations ofsize d, so we let z ∈ R2d. The first probe takesas input [B;φ(X)], that is, the concatenation ofB to the representation φ(X) that we’re studying.The second probe takes as input [B;0], that is, theconcatenation ofB to the 0 vector. The conditionalprobing method takes the difference of the twoprobe performances, which we call conditionalprobing performance:

Perf([B;φ(X)])− Perf([B;0]). (2)

Including B in the probe with φ(X) means thatφ(X) only needs to contribute what is missingfrom B. In the second probe, the 0 is used as aplaceholder, representing the lack of knowledge ofφ(X); its performance is subtracted so that φ(X)isn’t given credit for what’s explainable by B.3

2.4 V-informationV-information is a theory of usable information—that is, how much knowledge of random variable Ycan be extracted from r.v. R when using functionsin V , called a predictive family (Xu et al., 2020).Intuitively, by explicitly considering computationalconstraints, V-information can be constructed bycomputation, in particular when said computationmakes a variable easier to predict. If V is the setof all functions from the space of R to the set of

3The value 0 is arbitrary; any constant can be used, or onecan train the probe on just B.

Page 3: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

probability distributions over the space of Y , thenV-information is mutual information (Xu et al.,2020). However, if the predictive family is theset of all functions, then no representation is moreuseful than another provided they are related by abijection. By specifying a V , one makes a hypoth-esis about the functional form of the relationshipbetween the random variables R and Y . One couldlet V be, for example, the set of log-linear models.

Using this predictive family V , one can definethe uncertainty we have in Y after observing R asthe V-entropy:

HV(Y |R) = inff∈V

E[− log f [r](y)

], (3)

where f [r] produces a probability distribution overthe labels. Information terms like IV(R→ Y ) aredefined analogous to Shannon information, that is,IV(R → Y ) = HV(Y ) −HV(Y |R). For brevity,we leave a full formal description, as well as ourredefinition of V-information to multiple predictivevariables, to the appendix.

2.5 Probing estimates V-information

With a particular performance metric, baselinedprobing estimates a difference of V-informationquantities. Intuitively, probing specifies a func-tion family V , training data is used to find f ∈ Vthat best predicts Y from φ(X) (the infimum inEquation 7), and we then evaluate how well Y ispredicted. If we use the negative cross-entropyloss as the Perf function, then baselined probingestimates

IV(φ(X)→ Y )− IV(B → Y ),

the difference of two V-information quantities.This theory provides methodological best practicesas well: the form of the family V should be chosenfor theory-external reasons,4 and since the probetraining process is approximating the infimum inEquation 3, we’re not concerned with sample effi-ciency.

Baselined probing appears in existinginformation-theoretic probing work: Pimentel et al.(2020b) define conditional mutual informationquantities wherein a lossy transformation c(·)is performed on the sentence (like choosing asingle word), and an estimate of the gain from

4There are also PAC bounds (Valiant, 1984) on the estima-tion error for V-information (Xu et al., 2020); simpler familiesV with lower Rademacher complexity result in better bounds.

knowing the rest of the sentence is provided;I(φ(X);Y |c(φ(X))) = I(X;Y |c(X)).5 Method-ologically, despite being a conditional information,this is identical to baselined probing, training oneprobe on just φ(X) and another on just c(φ(X)).6

2.6 Estimating conditional information

Inspired by the transparent connections betweenV-information and probes, we ask what the V-information analogue of conditioning on a variablein a mutual information, that is, I(X,Y |B). To dothis, we extend V-information to multiple predic-tive variables, and design conditional probing (aspresented) to estimate

IV(φ(X)→ Y |B)

= HV(Y |B)−HV(Y |B,φ(X)),

thus having the interpretation of probing whatφ(X) explains about Y apart from what’s alreadyexplained by B (as can be accessed by functions inV). Methodologically, the innovation is in provid-ingB to the probe on φ(X), so that the informationaccessible in B need not be accessible in φ(X).

3 Related Work

Probing—mechanically simple, but philosophicallyhard to interpret (Belinkov, 2021)—has led to anumber of information-theoretic interpretations.

Pimentel et al. (2020b) claimed that probingshould be seen as estimating mutual informationI(φ(X);Y ) between representations and labels.This raises an issue, which Pimentel et al. (2020b)notes: due to the data processing inequality, the MIbetween the representation of a sentence (from e.g.,BERT) and a label is upper-bounded by the MIbetween the sentence itself and the label. Both anencrypted documentX and an unencrypted versionφ(X) provide the same mutual information withthe topic of the document Y . This is because MIallows unbounded work in using X to predict Y ,including the enormous amount of work (likely)required to decrypt it without the secret key. In-tuitively, we understand that φ(X) is more usefulthan X , and that this is because the function φ per-forms useful “work” for us. Likewise, BERT canperform useful work to make interesting properties

5Equality depends on the injectivity of φ; otherwise know-ing the representation φ(X) may be strictly less informativethan knowing X .

6This is because of the data processing inequality and thefact that c(φ(X)) is a deterministic function of φ(X).

Page 4: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

more accessible. While Pimentel et al. (2020b) con-clude from the data processing inequality that prob-ing is not meaningful, we conclude that estimatingmutual information is not the goal of probing.

Voita and Titov (2020) propose a new probing-like methodology, minimum description length(MDL) probing, to measure the number of bitsrequired to transmit both the specification of theprobe and the specification of labels. Intuitively, arepresentation that allows for more efficient com-munication of labels (and probes used to help per-form that communication) has done useful “work”for us. Voita and Titov (2020) found that by usingtheir methods, probing practitioners could pay lessattention to the exact functional form of the probe.V-information and MDL probing complement eachother; V-information does not measure sample ef-ficiency of learning a mapping from φ(X) to Y ,instead focusing solely on how well any functionfrom a specific family (like linear models) allowsone to predict Y from φ(X). Further, in prac-tice, one must choose a family to optimize overeven in MDL probing; the complexity penalty ofcommunicating the member of the family is anal-ogous to choosing V . Further, our contribution ofconditional probing is orthogonal to the choice ofprobing methodology; it could be used with MDLprobing as well.

V-information places the functional form of theprobe front-and-center as a hypothesis about howstructure is encoded. This intuition is already pop-ular in probing. For example, Hewitt and Manning(2019) proposed that syntax trees may emerge assquared Euclidean distance under a linear trans-formation. Further work refined this, showingthat a better structural hypothesis may be hyper-bolic (Chen et al., 2021) axis-aligned after scaling(Limisiewicz and Marecek, 2021), or an attention-inspired kernel space (White et al., 2021).

In this work, we intentionally avoid claims asto the “correct” functional family V to be used inconditional probing. Some work has argued forsimple probe families (Hewitt and Liang, 2019;Alain and Bengio, 2016), others for complex fami-lies (Pimentel et al., 2020b; Hou and Sachan, 2021).Pimentel et al. (2020a) argues for choosing mul-tiple points along an axis of expressivity, whileCao et al. (2021) define the family through theweights of the neural network. Other work per-forms structural analysis of representations withoutdirect supervision (Saphra and Lopez, 2019; Wu

et al., 2020).Hewitt and Liang (2019) suggested that dif-

ferences in ease of identifying the word identityacross layers could impede comparisons betweenthe layers; our conditional probing provides adirect solution to this issue by conditioning onthe word identity. Kuncoro et al. (2018) andShapiro et al. (2021) use control tasks, and Rosaet al. (2020) measures word-level memorizationin probes. Finally, under the possible goals ofprobing proposed by Ivanova et al. (2021), wesee V-information as most useful in discoveringemergent structure, that is, parsimonious andsurprisingly simple relationships between neuralrepresentations and complex properties.

4 Experiments

In our experiments, we aim for a case studyin understanding how conditioning on the non-contextual embeddings changes trends in the acces-sibility of linguistic properties across the layers ofdeep networks.

4.1 Tasks, models, and data

Tasks. We train probes to predict five linguisticproperties, roughly arranged in order from lower-level, more concrete properties to higher-level,more abstract properties. We predict five linguis-tic properties Y : (i) upos: coarse-grained (17-tag)part-of-speech tags (Nivre et al., 2020), (ii) xpos:fine-grained English-specific part-of-speech tags,(iii) dep rel: the label on the Universal Dependen-cies edge that governs the word, (iv) ner: namedentities, and (v) sst2: sentiment.

Data. All of our datasets are composed of En-glish text. For all tasks except sentiment, weuse the Ontonotes v5 corpus (Weischedel et al.,2013), recreating the splits used in the CoNLL2012 shared task, as verified against the split statis-tics provided by Strubell et al. (2017).78 SinceOntonotes is annotated with constituency parses,not Universal Dependencies, we use the converterprovided in CoreNLP (Schuster and Manning,

7In order to provide word vectors for each token in thecorpus, we heuristically align the subword tokenizations ofRoBERTa with the corpus-specified tokens through character-level alignments, following Tenney et al. (2019).

8Ontonotes uses the destructive Penn Treebank tokeniza-tion (like replacing brackets with -LCB- (Marcus et al.,1993)). We perform a heuristic de-tokenization process beforesubword tokenization to recover some naturalness of the text.

Page 5: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

1 2 3 4 5 6 7 8 9 10 11 12Model layer

0.18

0.19

0.20

0.21

0.22

0.23Bi

tsupos : RoBERTa-768

BaselinedConditional

1 2 3 4 5 6 7 8 9 10 11 12Model layer

0.65

0.70

0.75

0.80

0.85

Bits

dep rel : RoBERTa-768

BaselinedConditional

1 2 3 4 5 6 7 8 9 10 11 12Model layer

0.18

0.19

0.20

0.21

0.22

0.23

Bits

ner : RoBERTa-768

BaselinedConditional

1 2 3 4 5 6 7 8 9 10 11 12Model layer

0.00

0.05

0.10

0.15

0.20

Bits

sst2 : RoBERTa-768BaselinedConditional

Figure 1: Probing results on RoBERTa. Results arereported in bits of V-information; higher is better.

2016; Manning et al., 2014). For the sentiment an-notation, we use the binary GLUE version (Wanget al., 2019) of the the Stanford Sentiment Tree-bank corpus (Socher et al., 2013). All results arereported on the development sets.

Models. We evaluate the popular RoBERTamodel (Liu et al., 2019), as provided by the Hug-gingFace Transformers package (Wolf et al., 2020),as well as the ELMo model (Peters et al., 2018a),as provided by the AllenNLP package (Gardneret al., 2017). When multiple RoBERTa subwordsare aligned to a single corpus token, we averagethe subword vector representations.

Probe families. For all of our experiments, wechoose V to be the set of affine functions followedby softmax.9 For word-level tasks, we have

fθ(φi(X)j) = softmax(Wφi(X)j + b) (4)

where i indexes the layer in the network and jindexes the word in the sentence. For the sentence-level sentiment task, we average over the word-level representations, as

fθ(φi(X)) = softmax(W avg(φi(X)) + b) (5)

4.2 ResultsResults on ELMo. ELMo has a non-contextualembedding layer φ0, and two contextual layers φ1

and φ2, the output of each of two bidirectionalLSTMs (Hochreiter and Schmidhuber, 1997). Pre-vious work has found that φ1 contains more syn-tactic information than φ2 (Peters et al., 2018b;

9We used the Adam optimizer (Kingma and Ba, 2014) withstarting learning rate 0.001, and multiply the learning rate by0.5 after each epoch wherein a new lowest validation loss isnot achieved.

Baselined Conditional

φ1 φ2 φ1 φ2

upos 0.20 0.16 0.22 0.20xpos 0.20 0.16 0.21 0.20dep rel 0.99 0.81 1.00 0.87ner 0.24 0.23 0.25 0.24sst2 0.18 0.13 0.17 0.13

Table 1: Results on ELMo, reported in bits of V-information; higher is better. φi refers to layer i.

Zhang and Bowman, 2018). Baselined probingperformance, in Table 1, replicates this finding.But Hewitt and Liang (2019) conjecture that thismay be due to accessibility of information from φ0.Conditional probing answers shows that when onlymeasuring information not available in φ0, there isstill more syntactic information in φ1 than φ2, butthe difference is much smaller.

Results on RoBERTa. RoBERTa-base is a pre-trained Transformer consisting of a word-level em-bedding layer φ0 and twelve contextual layers φi,each the output of a Transformer encoder block(Vaswani et al., 2017). We compare baselinedprobing performance to conditional probing per-formance for each layer. In Figure 1, baselinedprobing indicates that part-of-speech informationdecays in later layers. However, conditional prob-ing shows that information not available in φ0 ismaintained into deeper layers in RoBERTa, andonly the information already available in φ0 de-cays. In contrast for dependency labels, we findthat the difference between layers is lessened afterconditioning on φ0, and for NER and sentiment,conditioning on φ0 does not change the results.

5 Conclusion

In this work, we proposed conditional probing, asimple method for conditioning on baselines inprobing studies, and grounded the method theoreti-cally in V-information. In a case study, we foundthat after conditioning on the input layer, usablepart-of-speech information remains much deeperinto the layers of ELMo and RoBERTa than pre-viously thought, answering an open question fromHewitt and Liang (2019). Conditional probing is atool that practitioners can easily use to gain addi-tional insight into representations.10

10An executable version of the experiments inthis paper is on CodaLab, at this link: https://worksheets.codalab.org/worksheets/0x46190ef741004a43a2676a3b46ea0c76.

Page 6: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

ReferencesYossi Adi, Einat Kermany, Yonatan Belinkov, Ofer

Lavi, and Yoav Goldberg. 2017. Fine-grained anal-ysis of sentence embeddings using auxiliary predic-tion tasks. In International Conference on LearningRepresentations.

Guillaume Alain and Yoshua Bengio. 2016. Under-standing intermediate layers using linear classifierprobes. In International Conference on LearningRepresentations.

Yonatan Belinkov. 2021. Probing classifiers:Promises, shortcomings, and alternatives. CoRR,abs/2102.12452.

Yonatan Belinkov and James Glass. 2019. Analysismethods in neural language processing: A survey.Transactions of the Association for ComputationalLinguistics, 7:49–72.

Steven Cao, Victor Sanh, and Alexander Rush. 2021.Low-complexity probing via finding subnetworks.In Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 960–966, Online. Association for Computa-tional Linguistics.

Boli Chen, Yao Fu, Guangwei Xu, Pengjun Xie,Chuanqi Tan, Mosha Chen, and Liping Jing. 2021.Probing bert in hyperbolic spaces. In Interna-tional Conference on Learning Representations.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers). Association forComputational Linguistics.

Allyson Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for semantic evidence of compositionby means of simple classification tasks. In Proceed-ings of the 1st Workshop on Evaluating Vector-SpaceRepresentations for NLP, pages 134–139, Berlin,Germany. Association for Computational Linguis-tics.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. AllenNLP: A deep semantic natural languageprocessing platform. In Proceedings of Workshopfor NLP Open Source Software (NLP-OSS).

John Hewitt and Percy Liang. 2019. Designing andinterpreting probes with control tasks. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), Hong Kong, China. As-sociation for Computational Linguistics.

John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word represen-tations. In North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies (NAACL). Association for Com-putational Linguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Yifan Hou and Mrinmaya Sachan. 2021. Bird’s eye:Probing for linguistic graph structures with a simpleinformation-theoretic approach. In Proceedings ofthe 59th Annual Meeting of the Association for Com-putational Linguistics and the 11th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers), pages 1844–1859, Online.Association for Computational Linguistics.

Anna A. Ivanova, John Hewitt, and Noga Zaslavsky.2021. Probing artificial neural networks: insightsfrom neuroscience. In Proceedings of the Brain2AIWorkshop at the Ninth International Conference onLearning Representations.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-gatama, Stephen Clark, and Phil Blunsom. 2018.LSTMs can learn syntax-sensitive dependencieswell, but modeling structure makes them better. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), volume 1, pages 1426–1436.

Tomasz Limisiewicz and David Marecek. 2021. In-troducing orthogonal constraint in structural probes.In Proceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages428–442, Online. Association for ComputationalLinguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A Robustly Optimized BERT Pretrain-ing Approach.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Mitchell P Marcus, Mary Ann Marcinkiewicz, andBeatrice Santorini. 1993. Building a large annotatedcorpus of English: The Penn Treebank. Computa-tional linguistics, 19(2):313–330.

Page 7: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Jan Hajic, Christopher D. Manning, SampoPyysalo, Sebastian Schuster, Francis Tyers, andDaniel Zeman. 2020. Universal Dependencies v2:An evergrowing multilingual treebank collection.In Proceedings of the 12th Language Resourcesand Evaluation Conference, pages 4034–4043, Mar-seille, France. European Language Resources Asso-ciation.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018a. Deep contextualized word rep-resentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), volume 1,pages 2227–2237.

Matthew Peters, Mark Neumann, Luke Zettlemoyer,and Wen-tau Yih. 2018b. Dissecting contextualword embeddings: Architecture and representation.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages1499–1509.

Tiago Pimentel, Naomi Saphra, Adina Williams, andRyan Cotterell. 2020a. Pareto probing: Trading offaccuracy for complexity. In Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 3138–3153, On-line. Association for Computational Linguistics.

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay,Ran Zmigrod, Adina Williams, and Ryan Cotterell.2020b. Information-theoretic probing for linguisticstructure. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 4609–4622, Online. Association for Computa-tional Linguistics.

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,and Christopher D. Manning. 2020. Stanza: APython natural language processing toolkit for manyhuman languages. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics: System Demonstrations.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. Open AI Tech-nical Report.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in BERTology: What we knowabout how BERT works. Transactions of the Associ-ation for Computational Linguistics, 8:842–866.

Rudolf Rosa, Tomáš Musil, and David Marecek. 2020.Measuring memorization effect in word-level neuralnetworks probing. In Text, Speech, and Dialogue,pages 180–188, Cham. Springer International Pub-lishing.

Naomi Saphra and Adam Lopez. 2019. Understandinglearning dynamics of language models with svcca.In North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL). Association for ComputationalLinguistics.

Sebastian Schuster and Christopher D. Manning. 2016.Enhanced English Universal Dependencies: An im-proved representation for natural language under-standing tasks. In Language Resources and Evalu-ation (LREC).

Claude E Shannon. 1948. A mathematical theory ofcommunication. The Bell system technical journal,27(3):379–423.

Naomi Tachikawa Shapiro, Amandalynne Paullada,and Shane Steinert-Threlkeld. 2021. A multil-abel approach to morphosyntactic probing. arXivpreprint arXiv:2104.08464.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural mt learn source syntax? In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 1526–1534.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing,pages 1631–1642, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.

Emma Strubell, Patrick Verga, David Belanger, andAndrew McCallum. 2017. Fast and accurate en-tity recognition with iterated dilated convolutions.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages2670–2680, Copenhagen, Denmark. Association forComputational Linguistics.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,and Ellie Pavlick. 2019. What do you learn fromcontext? Probing for sentence structure in contextu-alized word representations. In International Con-ference on Learning Representations.

Leslie G Valiant. 1984. A theory of the learnable. Com-munications of the ACM, 27(11):1134–1142.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length.In Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),

Page 8: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

pages 183–196, Online. Association for Computa-tional Linguistics.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Inter-national Conference on Learning Representations.

Ralph Weischedel, Martha Palmer, Mitchell Marcus,Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Ni-anwen Xue, Ann Taylor, Jeff Kaufman, MichelleFranchini, et al. 2013. OntoNotes release 5.LDC2013T19.

Jennifer C. White, Tiago Pimentel, Naomi Saphra, andRyan Cotterell. 2021. A non-linear structural probe.In Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 132–138, Online. Association for Computa-tional Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020.Perturbed masking: Parameter-free probing for ana-lyzing and interpreting BERT. In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics, pages 4166–4176, Online. As-sociation for Computational Linguistics.

Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stew-art, and Stefano Ermon. 2020. A theory of usable in-formation under computational constraints. In Inter-national Conference on Learning Representations.

Kelly W Zhang and Samuel R Bowman. 2018. Lan-guage modeling teaches you more syntax than trans-lation does: Lessons learned through auxiliary taskanalysis. arXiv preprint arXiv:1809.10040.

A Multivariable V-information

In this section we introduce Multivariable V-information. V-information as introduced by Xuet al. (2020) was defined in terms of a single pre-dictive variable X , and is unwieldy to extend tomultiple variables due to its use of a “null” inputoutside the sample space of X (Section D.3).11

11In particular, the null input encodes not knowing the valueof X; technical conditions in the definition of V-informationas to the behavior of this null value increase in number expo-nentially with the number of predictive variables.

Our multivariable V-information removes the useof null variables and naturally captures the mul-tivariable case. Consider an agent attempting topredict Y ∈ Y from some information sourcesX1, . . . , Xn, where Xi ∈ Xi. Let P(Y) be the setof all probability distributions over Y .

At a given time, the agent may only have accessto a subset of the information sources. Let theknown set C ∈ C and unknown set C ∈ C be abinary partition of X1, . . . , Xn. Though the agentisn’t given the true value of C when predicting Y ,it is instead provided with a constant value a ∈ C,which does not vary with Y .12

We first specify constraints on the set of func-tions that the agent has at its disposal for predictingY from X:

Definition 1 (Multivariable Predictive Family). LetΩ = f : X1 × · · · × Xn → P(Y). We say thatV ⊆ Ω is a predictive family if, for any partition ofX1, . . . ,Xn into C, C, we have

∀f, x1, . . . , xn,∈ V × X1 × · · · × Xn,∃f ′ ∈ V : ∀c′ ∈ C, f(c, c) = f ′(c, c′),

(6)

where we overload f(c, c) to equal f(x1, . . . , xn)for the values of x1, . . . , xn specified by c, c.

Intuitively, the constraint on V states that for anybinary partition of the X1, . . . , Xn into known andunknown sets, if a function is expressible givensome constant assignment to the unknown vari-ables, the same function is expressible if the un-known variables are allowed to vary arbitrarily. In-tuitively, this means one can assign zero weight tothose variables, so their values don’t matter. Thisconstraint, which we refer to as multivariable op-tional ignorance in reference to Xu et al. (2020),will be used to ensure non-negativity of informa-tion; when some X` is moved from C to C as anew predictive variable for the agent to use, op-tional ignorance ensures the agent can still act as ifthat variable were held constant.

Example 1. Let X1, . . . , Xn ∈ Rd1 , . . . ,Rdn andY ∈ Y be random variables. Let Ω be defined asin Definition 1. Then V = f : f(x1, . . . , xn) =softmax(W2 σ(W1[x1; · · · ;xn] + b) + b), the setof 1-layer multi-layer perceptrons, is a predictivefamily. Ignorance of some xi can be achieved bysetting the corresponding rows of W1 to zero.

12The exact value of a will not matter, as a result of Defini-tion 1.

Page 9: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

Given the predictive family of functions theagent has access to, we define the multivariableV-information analogue of entropy:

Definition 2 (Multivariable Predictive V-entropy).Let X1, . . . , Xn ∈ X1, . . . ,Xn. Let C ∈ C andC ∈ C form a binary partition of X1, . . . , Xn. Leta ∈ C. Then the V-entropy of Y conditioned on Cis defined as

HV(Y |C) = inff∈V

Ec,y[− log f(c, a)[y]

]. (7)

Note that a does not vary with y; thus it is ‘infor-mationless’. The notation f(c, a) takes the knownvalue of C ⊆ X1, . . . , Xn, and the constantvalue a, and produces a distribution over Y , andf(c, a)[y] evaluates the density at y.

If we let V = Ω, the set of all functions from theXi to distributions over Y , then V-entropy becomesexactly Shannon entropy (Xu et al., 2020). And justlike for Shannon information, the multivariable V-information from some variable X` to Y is definedas the reduction in entropy when its value becomesknown. In our notation, this means some X` ismoving from C (the unknown variables) to C (theknown variables), so this definition encompassesthe notion of conditional mutual information if Cis non-empty to start.

Definition 3 (Multivariable V-information). LetX1, . . . , Xn ∈ X1, . . . ,Xn and Y ∈ Y be ran-dom variables. Let V be a multivariable pre-dictive family. Then the conditional multivari-able V-information from X` to Y , where ` ∈1, . . . , n, conditioned on prior knowledge ofC ⊂ X1, . . . , Xn, is defined as

IV(X` → Y |C) = HV(Y |C)−HV(Y |C ∪ X`)(8)

A.1 Properties of multivariableV-information

The crucial property of multivariable V-information as a descriptor of probing isthat it can be constructed through computation. Inthe example of the agent attempting to predict thesentiment (Y ) of an encrypted message (X), if theagent has V equal to the set of linear functions,then IV(X → Y ) is small13. A function φ thatdecrypts the message constructs V-informationabout Y , since IV(φ(X) → Y ) is larger. Inprobing, φ is interpreted to be the contextual

13Where IV(X → Y ) is defined to be IV(X → Y |)

representation learner, which is interpreted asconstructing V-information about linguisticproperties.V-information also has some desirable elemen-

tary properties, including preserving some ofthe properties of mutual information, like non-negativity. (Knowing some X` should not reducethe agent’s ability to predict Y ).

Proposition 1. LetX1, . . . , Xn ∈ X1, . . . ,Xn andY ∈ Y be random variables, and V and U bepredictive families. Let C, C be a binary partitionof X1, . . . , Xn.

1. Independence If Y,C are jointly independentof X`, then IV(X` → Y |C) = 0.

2. Monotonicity If U ⊆ V , then HV(Y |C) ≤HU (Y |C).

3. Non-negativity IV(X` → Y |C) ≥ 0.

B Probing as MultivariableV-information Estimation

We’ve described the V-information framework, anddiscussed how it captures the intuition that us-able information about linguistic properties is con-structed through contextualization. In this section,we demonstrate how a small step from existingprobing methodology leads to probing estimatingV-information quantities.

B.1 Estimating V-entropyIn probing, gradient descent is used to pick thefunction in V that minimizes the cross-entropy loss,

1

Dtr

∑x,y∈Dtr

− log p(y|φi(x); θ), (9)

where θ are the trainable parameters of functions inV . Recalling the definition of V-entropy, this mini-mization performed through gradient descent is ap-proximating the inf over V , since− log p(y|x; θ) isequal to − log fθ(x)[y]. To summarize, this statesthat the supervision used in probe training can beinterpreted as approximating the inf in the defi-nition of V-entropy. In traditional probing, theperformance of the probe is measured on the testset Dte using the traditional metric of the task, likeaccuracy of F1 score. In V-information probing,we use Dte to approximate the expectation in thedefinition of V-entropy. Thus, the performance of asingle probe on representation R, where the perfor-mance metric is cross-entropy loss, is an estimate

Page 10: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

of HV(Y |R). This brings us to our framing of aprobing experiment as estimating a V-informationquantity.

B.2 Baselined probingLet baselined probing be defined as in the mainpaper. Then if the performance metric is definedas the negative cross-entropy loss, we have thatPerf(B) estimates −HV(Y |B), Perf(φ(X)) esti-mates −HV(Y |φ(X)), and so baselined probingperformance is an estimate of

HV(Y |B)−HV(Y |φi(X))= IV(φi(X)→ Y )− IV(B → Y )

(10)

B.3 Conditional probingLet conditional probing be defined as in themain paper. Then if the performance metricis defined as the negative cross-entropy loss,we have that Perf([B;0]) estimates −HV(Y |B),Perf([B;φ(X)]) estimates−HV(Y |B,φ(X)), andso conditional probing performance is an estimateof

HV(Y |B)−HV(Y |B,φi(X))= IV(φi(X)→ Y |B)

(11)

The first is estimated with a probe just onB—underthe definition of predictive family, this means pro-viding the agent with the real values of the base-line, and some constant value like the zero vectorinstead of φi(X). That is, holding a ∈ φi(X )iconstant and sampling b, y ∼ B, Y , the probabilityassigned to y is f(b, a)[y] for f ∈ V . The secondterm is estimate with a probe on both B and φi(X).So, sampling b, x, y ∼ B,X, Y , the probability as-signed to y is f(b, φi(x))[y] for f ∈ V . Intuitively,conditional probing measures the new informationin φi(X) because in both probes, the agent hasaccess to B, so no benefit is gained from φi(X)supplying the same information.

C Proof of Proposition 1

Monotonicity If U ⊆ V , then HV(Y |C) ≤HU (Y |C). Proof:

HU (Y |C) = inff∈U

Ec,y [− log f [c, a](y)]

≥ inff∈V

Ec,y [− log f [c, a](y)]

= HV(Y |C)

(12)

This holds because we are taking the infimum overV such that if f ∈ U then f ∈ V .

Non-Negativity IV(X` → Y |C) ≥ 0. WhereVC ⊂ V is the subset of functions that satisfiesf [c, c] = f [c, c′] ∀ c′ ∈ C, and a, a/` denote theconstant values of the unknown set with and with-out X`, the proof is as follows:

HV(Y |C) = inff∈V

Ec,x`,y [− log f [c, a](y)]

= inff∈VC

Ec,x`,y[− log f [c, x`, a/`](y)

]≥ inf

f∈VEc,x`,y

[− log f [c, x`, a/`](y)

]= HV(Y |C ∪ X`)

(13)

By definition, IV(X` → Y |C) = HV(Y |C) −HV(Y |C ∪ X`) ≥ 0.

Independence If Y,C are jointly independent ofX`, then IV(X → Y |C) = 0. Proof:

HV(Y |C ∪ X`)= inf

f∈VEc,x`,y

[− log f [c, x`, a/`](y)

]= inf

f∈VEx`Ec,y

[− log f [c, x`, a/`](y)

]≥ Ex`

[inff∈V

Ec,y[− log f [c, x`, a/`](y)

]]= Ex`

[inff∈VC

Ec,y[− log f [c, x`, a/`](y)

]]= inf

f∈VC

Ec,y [− log f [c, a](y)]

≥ inff∈V

Ec,y [− log f [c, a](y)]

= HV(Y |C)

(14)

In the second line, we break down the expectationbased on conditional independence. Then we ap-ply Jensen’s inequality and optional ignorance toremove the expectation w.r.t. x. Since VC ⊂ V , theformer’s infimum is at least as large as the latter’s.Then

IV(X` → Y |C) = HV(C)−HV(Y |C∪X`) ≤ 0

Combined with non-negativity (i.e., IV(X` →Y |C) > 0) we have inequality in both directions,so IV(X` → Y |C) = 0.

D Equivalence of Xu et al. (2020) andour V-information

In order to define conditional probing, we neededa theory of V-information that considered arbi-

Page 11: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

trarily many predictive variables X1, . . . ,Xn. V-information as presented by Xu et al. (2020) con-siders only a single predictive variable X . It be-comes extremely cumbersome, due to the use ofnull variables in their presentation, to expand thisto more, let alone arbitrarily many variables. So,we redefined and extended V-information to morenaturally capture the case with an arbitrary (finite)number of variables. In this section, we show that,in the single predictive variable case considered byXu et al. (2020), our V-information definition isequivalent to theirs. For the sake of this section,we’ll call the V-information of Xu et al. (2020)Xu-V-information, and ours V-information.

In particular, we show that there is a trans-formation from any predictive family of Xu-V-information to predictive family for V-informationunder which predictive V-entropies are the same(and the same in the opposite direction.)

D.1 From Xu et al. (2020) to oursWe recreate the definition of predictive family fromXu et al. (2020) here:

Definition 4 (Xu predictive family). Let Υ = f :X ∪ ∅ → P(Y). We say that U ⊆ Υ is a Xupredictive family if it satisfies

∀f ∈ U , ∀P ∈ range(f),∃f ′ ∈ U , s.t. (15)

∀x ∈ X , f [x] = P, f ′[∅] = P (16)

Now, we construct one of our predictive familiesfrom the Xu predictive family. Let U ⊆ Υ be a Xupredictive family. We now construct a predictivefamily under our framework, V ⊆ Ω. For eachf ∈ U , f : X ∪ ∅ → P(Y), construct thefollowing two functions: first, g, which recreatesthe behavior of f on the domain of X :

g : X → P(Y) (17)

g : x 7→ f(x) (18)

and second, g′, which recreates the behavior of fon ∅, given any input from X :

g′ : X → P(Y) (19)

g′ : x 7→ f(∅) (20)

Then we define our predictive family as the unionof g, g′ for all f ∈ V:

V =⋃f∈Ug, g′, (21)

where V ⊆ Ω and Ω = f : X → P(Y). Notefrom this construction that we’ve eliminated thepresence of the null variable from the definition ofpredictive family.

We now show that V , as defined in the con-struction above, is in fact a predictive family un-der our definition. Under our definition, thereare two cases: either X ∈ C or X ∈ C. IfX ∈ C, then for all f, x ∈ V × X , if we takeany f ′ ∈ V (which is non-empty), then there isno c′ ∈ C, so vacuously the condition holds. IfX ∈ C, then for all f, x ∈ V , we have that fwas either g or g′ for some function h ∈ U inthe construction of V (because all functions inV were part of some pair g, g′.) Then we takef ′ = g′, and have that for all c′ ∈ C, that is x ∈ X ,f ′(c, c) = f ′(c, c′) = g′(x) = h(∅), satisfyingthe constraint.

Finally, we show that the predictive V-entropiesof V (under our definition) and U (under that of Xuet al. (2020)) are the same. Consider Xu-predictiveentropies:

HU (Y |X) = inff∈U

Ex,y[− log f [x](y)] (22)

HU (Y |∅) = inff∈U

Ey[− log f [∅](y)] (23)

First we want to show HU (Y |X) = HV(Y |X).Consider the inf in Equation 22; the f ∈ U thatachieves the inf corresponds to some g ∈ V by con-struction, and since f(x) = g(x), we have that thevalue of the inf for V is at least as low as for U . Thesame is true in the other direction; in our definitionHV(Y |X ), the g that achieves the inf correspondsto some f ∈ U that produces the same probabilitydistributions. So, HU (Y |X) = HV(Y |X).

Now we want to show HU (Y |∅) = HV(Y ).Now, consider the inf in Equation 23. The f ∈ Uthat achieves the inf corresponds to some g, g′ inthe construction of V; that g′ takes any x ∈ Xand produces f [∅]; hence the value of the inf forV is at least as low as for U . The same is truein the other direction. We have that HV(Y ) =inff∈V Ey[− log f(a)[y]] for any a ∈ X . Either ag or a g′ from the construction of V achieves thisinf; if a g achieves it, then its corresponding g′

emits the same probability distributions, so WLOGwe’ll assume it’s a g′. We know that g′(a) = f(∅)for all a ∈ X , so HU (Y |∅) is at most HV(Y ). So,HU (Y |∅) = HV(Y ).

Since the V-entropies of the predictive familyfrom Xu et al. (2020) and ours are the same, all the

Page 12: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

information quantities are the same. This showsthat the predictive family we constructed in ourtheory is equivalent to the predictive family fromXu et al. (2020) that we started with.

D.2 From our V-information to that of Xuet al. (2020)

Now we construct a predictive family U under theframework of Xu et al. (2020) from an arbitrarypredictive family V under our framework. For eachfunction f ∈ V , we have from the definition thatthere exists f ′ ∈ V such that ∀x ∈ X , f ′(x) = Pfor some P ∈ P(Y). We then define the function:

g : X ∪ ∅ → P(Y) (24)

g(x) =

f(x) x ∈ Xf ′(a) x = ∅

(25)

where a ∈ X is an arbitrary element of X , and theset of constant-valued functions

G = g′ : g′(z) = P | P ∈ range(f), (26)

where z ∈ X ∪ ∅, and let

U =⋃f∈Vg ∪G (27)

The set U is a predictive family under Xu-V-information because for any f ∈ U , f is eithera g or a g′ in our construction, and so optional ig-norance is maintained by the set G that was eitherconstructed for g or that g′ was a part of. That is,from the construction, G contains a function foreach element in the range of g (or g′) that maps allx ∈ X as well as ∅ to that element, and U containsall elements in G.

Now we show that the predictive V-entropies ofU (from this construction) under Xu et al. (2020)are the same as for V under our framework.

First we want to show HU (Y |X) = HV(Y |X).For the g that achieves the inf over U in Equa-tion 22, we have there exists f ∈ V such thatg(x) = f(x) given that x ∈ X , so HV(y|x) ≤HU (y|x) The same is true in the other direction;the f ∈ V that achieves the inf in V-entropy simi-larly corresponds to g ∈ U , implying HU (Y |X) ≤HV(Y |X), and thus their equality.

Now we want to show HU (Y |∅) = HV(Y ).For the g ∈ U that achieves its inf , we have byconstruction that there is an f ′ ∈ V such thatfor any a ∈ X , it holds that g(∅) = f ′(a). So,HV(Y |X) ≤ HU (Y |X). In the other direction, for

the f ∈ V that achieves its inf given an arbitrarya ∈ X , there is the f ′ ∈ V from our construc-tion of U such that f(a) = f ′(x) = g(∅) for allx ∈ X . This implies HU (Y |X) ≤ HV(Y |X), andthus their equality.

D.3 Remarks on the relationship between ourV-information and that of Xu et al.(2020)

The difference between our V-information and thatof Xu et al. (2020) is in how the requirement ofoptional ignorance is encoded into the formalism.This is an important yet technical requirement thatif a predictive agent has access to the value of arandom variable X , it’s allowed to disregard thatvalue if doing so would lead to a lower entropy. Anexample of a subset of Ω for which this doesn’thold in the multivariable case is for multi-layer per-ceptrons with a frozen (and say, randomly sampled)first linear transformation. The information of, say,X1 and X2, are mixed by this frozen linear trans-formation, and so X1 cannot be ignored in favor ofjust looking atX2. However, if the first linear trans-formation is trainable, then it can simply assign 0weights to the rows corresponding to X1 and thusignore it.

The V-information of Xu et al. (2020) ensuresthis option by introducing a null variable ∅ whichis used to represent the lack of knowledge abouttheir variable X – and for any probability distribu-tion in the range of some f ∈ U under the theory,there must be some function f that produces thesame probability distribution when given any valueof X or ∅. This is somewhat unsatisfying becausef should really be a function from X → P(Y), butthis implementation of optional ignorance changesthe domain to X ∪ ∅. When attempting to ex-tend this to the multivariable case, the definitionof optional ignorance becomes very cumbersome.With two variables, the domain of functions in apredictive family must be (X1∪∅)×(X2∪∅).Because the definition of V-entropy under Xu et al.(2020) treats using X separately from using ∅, onemust define optional ignorance constraints sepa-rately for each subset of variables to be ignored,the number of which grows exponentially with thenumber of variables.

Our re-definition of V-information gets aroundthis issue by defining the optional ignorance con-straint in a novel way, eschewing the ∅ and insteadencoding it as the intuitive implementation that we

Page 13: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

1 2 3 4 5 6 7 8 9 10 11 12Model layer

0.16

0.17

0.18

0.19

0.20

0.21

0.22Bi

tsxpos : RoBERTa-768

BaselinedConditional

Figure 2: Probing results on RoBERTa for xpos. Re-sults are reported in bits of V-information; higher isbetter

described in the MLP – that for any function in thefamily and fixed value for some subset of the in-puts (which will be the unknown subset), there’s afunction that behaves identically even if that subsetof values is allowed to take any value. (Intuitively,by, e.g., having it be possible that the weights forthose inputs are 0 at the first layer.)

E Full Results

In this section, we report all individual probingexperiments: single-layer probes’ V-entropies inTable 2, single-layer probes’ task-specific metricsin Table 3, two-layer probes’ V-entropies in Ta-ble 4, and two-layer probes’ task-specific metricsin Table 5. In Figure 2, we report the xpos figure forRoBERTa corresponding to the other four figuresin the main paper. We see that it shows roughly thesame trend as the upos figure from the main paper.

Acknowledgements

The authors would like to thank the anonymousreviewers and area chair for their helpful feedbackwhich led to a stronger final draft, as well as Nel-son Liu, Steven Cao, Abhilasha Ravichander, RishiBommasani, Ben Newman, and Alex Tamkin, andmuch of the Stanford NLP Group for helpful dis-cussions and comments on earlier drafts. JH wassupported by an NSF Graduate Research Fellow-ship under grant number DGE-1656518, and byTwo Sigma under their 2020 PhD Fellowship Pro-gram. KE was supported by an NSERC PGS-Dscholarship.

RoBERTa Single-Layer V-Entropy

Layer upos xpos dep ner sst2

0 0.336 0.344 1.468 0.391 0.6431 0.145 0.158 0.827 0.216 0.6452 0.119 0.139 0.676 0.188 0.6303 0.118 0.133 0.635 0.172 0.5744 0.117 0.132 0.627 0.167 0.5455 0.119 0.136 0.632 0.167 0.4896 0.121 0.139 0.645 0.167 0.4847 0.126 0.145 0.640 0.170 0.4628 0.129 0.144 0.633 0.168 0.4679 0.131 0.149 0.653 0.173 0.49410 0.138 0.156 0.677 0.177 0.50811 0.154 0.169 0.705 0.184 0.52712 0.161 0.182 0.746 0.191 0.583

Table 2: V-entropy results (in bits) on probes takingin one layer, for each layer of the network. Lower isbetter.

RoBERTa Single-Layer Metrics

Layer upos xpos dep ner sst2

0 0.908 0.908 0.669 0.535 0.8081 0.968 0.964 0.821 0.710 0.8152 0.975 0.969 0.854 0.735 0.8133 0.975 0.970 0.865 0.763 0.8454 0.976 0.971 0.867 0.763 0.8505 0.975 0.970 0.866 0.763 0.8696 0.975 0.970 0.863 0.764 0.8707 0.974 0.969 0.864 0.754 0.8778 0.974 0.970 0.865 0.762 0.8689 0.974 0.969 0.863 0.756 0.86010 0.973 0.968 0.859 0.756 0.85711 0.971 0.967 0.854 0.744 0.85012 0.969 0.965 0.847 0.735 0.843

Table 3: Task-specific metric results on probes takingin one layer, for each layer of the network. For upos,xpos, dep, and sst2, the metric is accuracy. For NER,it’s span-level F1 as computed by the Stanza library (Qiet al., 2020). For all metrics, higher is better.

RoBERTa Two-Layer V-entropy

Layer upos xpos dep ner sst2

0-0 0.335 0.345 1.466 0.391 0.6390-1 0.141 0.154 0.763 0.210 0.6330-2 0.115 0.133 0.646 0.184 0.6150-3 0.110 0.127 0.609 0.169 0.5670-4 0.109 0.126 0.593 0.164 0.5490-5 0.110 0.125 0.602 0.163 0.4740-6 0.109 0.126 0.613 0.165 0.4840-7 0.109 0.127 0.609 0.166 0.4510-8 0.108 0.125 0.598 0.171 0.4620-9 0.109 0.125 0.614 0.167 0.489

0-10 0.110 0.127 0.636 0.177 0.5040-11 0.111 0.127 0.654 0.175 0.5250-12 0.116 0.132 0.682 0.185 0.563

Table 4: V-entropy results on probes taking in two lay-ers: layer 0 and each other layer of the network. Loweris better.

Page 14: arXiv:2109.09234v1 [cs.CL] 19 Sep 2021

RoBERTa Two-Layer Metrics

Layer upos xpos dep ner sst2

0-0 0.908 0.907 0.670 0.543 0.8080-1 0.969 0.965 0.834 0.722 0.8250-2 0.975 0.970 0.861 0.744 0.8220-3 0.976 0.971 0.870 0.765 0.8500-4 0.977 0.972 0.874 0.767 0.8490-5 0.977 0.972 0.872 0.772 0.8750-6 0.977 0.972 0.869 0.766 0.8750-7 0.977 0.972 0.869 0.760 0.8690-8 0.977 0.972 0.872 0.762 0.8620-9 0.977 0.972 0.869 0.766 0.864

0-10 0.977 0.972 0.864 0.755 0.8570-11 0.977 0.972 0.861 0.753 0.8590-12 0.975 0.971 0.856 0.743 0.847

Table 5: Task-specific metric results on probes takingin two layers: layer 0 and each other layer of the net-work. For upos, xpos, dep, and sst2, the metric is accu-racy. For NER, it’s span-level F1 as computed by theStanza library. For all metrics, higher is better.