Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
A GENERAL THEORY FOR EVALUATING JOINT DATA
INTERACTION WHEN COMBINING DIVERSE DATA SOURCES
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF GEOLOGICAL AND
ENVIRONMENTAL SCIENCES
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Evgenia I. Polyakova
April 2008
UMI Number: 3313643
Copyright 2008 by
Polyakova, Evgenia I.
All rights reserved.
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
®
UMI UMI Microform 3313643
Copyright 2008 by ProQuest LLC.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest LLC 789 E. Eisenhower Parkway
PO Box 1346 Ann Arbor, Ml 48106-1346
© Copyright by Evgenia I. Polyakova 2008
All Rights Reserved
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
/ i £} • <Jotvtuei
Dr. Andre Journel Principal Adviser
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy. „..--""" '*"""" ,"
( iP_£—tJeH3aerj
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
Dr. Paul Switzer
Approved for the University Committee on Graduate Studies.
m
iv
Abstract
Accounting for data interaction is a necessary and critical step in any data inte
gration algorithm. Data interaction, whether it is through information redundancy,
compounding or cancellation, can change completely the image provided by mere
association of individual data ignoring their interaction. Data interaction is just as
important as the individual data information content and depends on both data val
ues and the unknown being assessed. Yet, most data integration algorithms ignore
completely or partially data interaction, by assuming some form of data independence
for a given question asked or, worse, for any question asked. More advanced analysis
acknowledges dependence of data information content but still models it only as lin
ear dependence (linear correlation) between any two data rather than considering all
data together; and this linear correlation is assumed independent of the data values
and of the event or value being estimated.
In this study, the general problem of data integration is expressed by combining
probability distributions conditioned to each individual datum or data event into a
posterior probability for the unknown conditioned jointly to all data. The goal of
this thesis is to develop a method/model of statistical analysis accounting for data
interaction. Addressing this goal, we propose the nu expression [64] which is the
sister of previously developed tau expression. Both nu and tau expressions provide
an exact analytical solution to the problem of data integration by combining individ
ually conditioned probabilities while accounting for interaction between data. This
is achieved by separating individual data information and data interaction. The nu
and tau interaction parameters are data values-dependent and, even more critically,
v
unknown value-dependent. This data value-dependency (heteroscedasticity) allows
for a better representation of joint data interaction than do traditional regression or
kriging weights which are independent of the data values. However, the greater that
heteroscedasticity, the more difficult becomes the inference of the data interaction
parameters. We investigate the behavior of the nu and tau parameters versus data
values. The nu parameters being ratios of ratios of likelihood probabilities appear
more stable than the tau parameters and could be estimated starting from summary
statistics of the actual data values taken altogether. Also, the tau interaction weights
depend on specific ordering of the data. While such ordering is important, in most
applications it is the global (independent of the data sequence) representation of such
interaction that matters. The tau expression fails to provide such global measure.
The nu model allows the derivation of a single, data sequence independent, interac
tion measure.
The nu model proposed is extensively tested using synthetic data sets. The test ex
periments confirmed superior features of the nu model compared with the tau model
or traditional statistical approximations. The practicality of the nu expression will
depend on our ability in generating proxy training data from which to borrow and
export the v parameters.
VI
Acknowledgments
First of all I would like to thank my family who made me believe that everything is
within my reach. I am greatly thankful to my dad, Dr. Igor Polyakov, for making
many valuable comments related to my research. I am thankful to my husband, Ramil
Ahmadov, who helped me to improve the quality of the figures in this dissertation.
I am also thankful to my mom, Anna Polyakova, who helped me to keep things in
perspective.
I am deeply thankful to my adviser, Dr. Andre Journel, for his great insights, per
spectives, passion, and guidance.
Also, I would like to thank Dr. Jef Caers, Dr. Paul Switzer, Dr. Robert Dunbar, Dr.
Jerry Harris, and Dr. Tarantola for agreeing to serve on my committee and for their
valuable comments during the spring reviews.
I also would like to extend sincere gratitude to SCRF group. Special thanks to Dr.
Alexander Boucher, Dr. Jianbing Wu, Dr. Scarlet Castro, and Ting Li for helping
me with various aspects of my research.
Sincere gratitude is also extended to the staff of Department of Geological and Envi
ronmental Sciences for helping with the administrative aspect of my dissertation.
vn
Contents
Abstract v
Acknowledgments vii
1 Introduction 1
1.1 Data information and data interaction 3
1.2 The thesis proposal 5
1.3 Goals/objectives 7
1.4 A brief overview of the thesis chapters 8
2 A review of existing models 10
2.1 Conditional independence 10
2.1.1 The road to conditional independence 11
2.1.2 Heteroscedasticity 14
2.1.3 Bayesian networks 18
2.2 Probability combination algorithms 25
2.2.1 Linear pooling of probabilities 26
2.2.2 Supra Bayesian Methods 30
2.2.3 A brief overview of the tau representation 33
3 The nu representation 42
3.1 Derivation of the nu representation 43
3.1.1 The nu expression 43
3.1.2 Dictatorial property 51
vin
3.1.3 A measure for data interaction 52
3.2 Tau or nu expression? 52
3.3 Approximations based on the nu derivation 54
3.3.1 Evaluating the conditional independence assumption 55
3.3.2 The classified v0 approach 67
4 Application to binary data 77
4.1 An elementary case study 77
4.1.1 Equilateral configuration 77
4.1.2 Non-equilateral configuration 87
4.2 A 3D case study 95
4.2.1 The reference data set 95
4.2.2 The estimation configuration 97
4.2.3 Conditional probabilities and estimates 99
4.2.4 Ordering the data values combinations 102
4.2.5 Heteroscedasticity of the tau and nu weights 104
4.2.6 Independence-based estimates 107
4.2.7 The classified u0 approach 109
5 Application to non-binary data 117
5.1 A single constraint 117
5.2 Large non-Gaussian ternary case study 128
5.2.1 The reference data set 129
5.2.2 The estimation configuration 130
5.2.3 Conditional probabilities and estimates 133
5.2.4 Determining the v$ ' to ensure consistent probabilities . . . . 139
5.2.5 Classified VQ approach 142
5.2.6 Inference robustness 147
6 Summary and conclusions 151
6.1 Summary of major theoretical developments 151
6.2 The nu expression: Theory 153
IX
6.2.1 Tau expression 153
6.2.2 Nu expression 154
6.3 Approximations of the nu representation 156
6.4 Final conclusions 158
6.5 Future work 159
Bibliography 162
x
List of Tables
3.1 Joint distribution of indicators and their probabilities 56
3.2 Summary statistics: means and variances of 10,000 conditional proba
bilities P(A\B, C) and their approximations 60
3.3 Summary statistics: means and variances of 10,000 conditional proba
bilities P(A\B, C) and their transformed estimator 60
3.4 Eight data value combinations and their scores 69
4.1 Probability notation for the 16 joint occurrences 79
4.2 Distances between data-to-unknown and data-to-data 87
4.3 Summary statistics: means and variances of reference conditional prob
abilities and approximations stemming from nu representation for 931
data value combinations 113
4.4 Means and variances of 10 independent realizations So 113
4.5 The average means and variances of P(A = 1|D,B,C) over 90 combi
nations 114
5.1 Summary statistics for k = 1,2,3: spatial means and variances of
reference conditional proportions and of approximations defined by the
nu model and from the conditional independence assumption 142
5.2 Summary statistics: spatial means and variances of reference condi
tional probabilities and of estimates based on a classified nu represen
tation 148
5.3 The average means of the 50 eroded training data sets for k = 1, 2,3.
For comparison, the right column shows reference means 148
xi
5.4 Summary statistics: spatial means and variances of the 195 reference
conditional probabilities and of the estimates built from a nu represen
tation 149
5.5 Correlations of 195 reference proportion values P(A[D,B,C) with es
timates based on a classified nu representation 149
xn
List of Figures
2.1 Spatial geometry of the data for two different data values combinations. 16
2.2 Graphical representation of joint dependencies between variables A,
A , and D2 19
2.3 Graphical representation under conditional independence between vari
ables Di, and D2 given A 20
2.4 Graphical representations of bi-directional (1) and uni-directional (2)
Bayesian nets 21
2.5 Graphical interpretation of the joint probability P(A, B, C) based on
a different sets of relationships between three variables A, B, and C. . 22
2.6 Graphical representation of conditional independence between vari
ables D\, and D2 given A, and data independence between variables
A and B 23
2.7 Example of dual training images depicting the interaction between two
data types B and C 24
3.1 The scatterplots for VQ=1 model (left) and conditional independence
estimator (right) versus reference 59
3.2 The scatterplot for the estimator of fully conditional probability P(A =
0\B = 0,C = 0) based on transformed probabilities (y-axis) versus
reference (x-axis) 61
3.3 Four training classes and their respective representative scores 70
3.4 Training image depicting the interactions between data and unknown. 73
3.5 Data events definitions 74
xm
3.6 Training image (left) is summarized by the distribution of two summary
scores shown on the score map (right) 75
4.1 Spatial locations of three data Ji, I2, h and the unknown A 78
4.2 Conditional probabilities. Concordant data case: A — I\ — I2 — I3 — 1. 80
4.3 Data values-dependent error associated with the v0 = 1 model 83
4.4 The sequence-dependent Vi weights for the data concordant case
A = h = I2 = h = 1 84
4.5 The single sequence-independent VQ weight 85
4.6 The averaged error associated with data-value-dependent uQ model and
with the VQ = \ model 86
4.7 Non-equilateral data configuration 88
4.8 Conditional probabilities for non-equilateral case with A = I\ = I2 =
73 = 1 89
4.9 Checking the consistency relation 90
4.10 Approximation errors for the eight data value configurations 91
4.11 Error linked to the u0 = 1 model (non-equilateral case) 92
4.12 Error linked to "full independence" hypothesis (non-equilateral case). 92
4.13 Error linked to conditional independence (non-equilateral case). . . . 93
4.14 Bias (error) averaged over all data values combinations 94
4.15 Reference binary image generated by truncating a continuous Gaussian
realization at its upper quartile 96
4.16 Exhaustive indicator variograms, calculated over the 35 top layers. EW
is the east-west direction and NS is north-south direction 96
4.17 Data events definition 97
4.18 (1) The reference eroded data set S0, (2) its histogram, and (3) binary
reference field with the prior P(A = 1) = 0.274 100
4.19 (1) The estimate of fully conditioned probability P(A \ D,B,C) using
the VQ = 1 model, (2) its histogram, and (3) reference binary field with
the prior P(A = 1) = 0.274 101
xiv
4.20 (l)Histogram of error P*(A | D,B,C) - P(A | D,B,C) and (2) the
corresponding scatterplot of P*(A | D,B,C) based on u0 = 1 model
versus reference P(A | D,B,C) 102
4.21 Sequence-dependent interaction parameters T3 (red) and v3 (blue) for
data sequences (1) D B C / B D C , (2) D C B / C D B , and (3) CBD/BCD.104
4.22 Exact uQ parameter for 931 data value combinations 105
4.23 Scatterplots of estimated probabilities P*(A | D,B,C) versus the ref
erence P(A I D,B,C) 108
4.24 Exact u0 values versus average sand values defined over the three data
events D ,B ,C I l l
4.25 Scattergram of i/0 = 1 model (left) and classified u0 model (right)
relative to reference probability. 112
4.26 The histograms of the means of the 90 reference P(A = 1|D,B,C)
values (left), and their estimators based on the uQ — 1 model (center),
and classified z/0 approach (right) 115
4.27 The histograms of the variances of the 90 reference P(A = 1|D,B,C)
values (left), and their estimators based on the v0 = I model (center),
and classified UQ approach (right) 116
5.1 Reference categorical image generated using a training image generator
(the representation of the two categories A = 2 and A = 3 does not
reflect their proportions) 130
5.2 Exhaustive indicator variograms in x, y and z directions, calculated
over the 35 top layers for k — 1,2,3 131
5.3 Data events definition 132
5.4 (1) Spatial distribution, (2) histogram of the conditional proportions
P(A(u) = 1|D, B, C) defined over the reference eroded volume So, and
(3) the reference categorical field 134
5.5 (1) Spatial distribution, (2) histogram of the conditional probabilities
P*(i4(u) = 1|D,B,C) estimated with the model u^ = 1, and (3)
categorical reference field with respective proportions 136
xv
5.6 (1) Histogram of error P*(A = 1 | D,B,C) - P(A = 1 \ D,B,C)
and (2) the corresponding scatterplot of estimate P*(A = 1 | D,B,C)
versus reference P(A = 1 | D,B,C) 137 3
5.7 Histogram of £ P*(A = k | D,B,C) 138
5.8 Various scatterplots of the reference probability P(A = k j D,B,C)
with the estimator based on conditional independence assumption and (k)
the estimator based on UQ ' model 141
5.9 Classification of scores m^, m^2\ m^ 145
5.10 Scattergram of reference proportion P(A = fc|D,B,C) along x-axis
versus estimate P*(A = &|D,B,C) based on classified VQ ' model for
k=l (left), k = 2 (center), k — 3 (right) along y-axis 147
xvi
Chapter 1
Introduction
Accounting for data interaction is a necessary and critical step in any data inte
gration algorithm. Data interaction, whether it is through information redundancy,
compounding or cancellation, can change completely the naive image provided by
mere association of individual data ignoring their interaction. Think about datum 1-
a stockbroker advising purchase of a hot stock, then datum 2- another stockbroker
who compounds that buy advise, last, datum 3- a trustworthy friend who admits
knowing nothing about that stock but who alerts one that both stockbrokers receive
commissions from the same dubious investment house. Datum 3 is not related di
rectly to the question being asked: "Is that stock worth buying?", but it interacts
critically with the two other data resulting possibly in a dramatic change of their
information content. If the question being asked (the unknown) changes say into
"Will it rain tomorrow?", datum 3 (the two brokers work for the same house) loses
its weight, and one may want to safely compound the two concordant data 1 and 2
which advised one to bring an umbrella. Data interaction is just as important as indi
vidual data information content; data interaction depends on the data values and also
on the unknown being assessed (buy or not buy a stock versus will it rain tomorrow?).
Yet, most data integration algorithms ignore all or part of the concept of data inter
action, either by assuming some form of data independence for a given question asked
or, worse, for any question asked. In the better cases, dependence of data information
1
2 CHAPTER 1. INTRODUCTION
content is recognized but is modeled only
1. as linear dependence (linear correlation) between any two data rather than
considering all data together; and
2. this linear correlation is assumed independent of the data values and of the
event or value being estimated. The correlation PD1D2 between the two data D\
and D2 remains the same whether the two data take median-type, "middle-of-
the road" values {Di — di, D2 = d2} or extreme values
{Di = d\ » di, D'2 = d'2 » d2}
Then, and with much more severe consequences, the same correlation value PD1D2
is used no matter the unknown event A to which the two data are applied. It does
not matter whether the question A being asked relates to a stock buy or to the
appropriateness of carrying an umbrella. The same pair of median-type data val
ues {D± = di, D2 = d2} may carry a valuable information content when evaluating
a median-type outcome A = a, but may be of little value to evaluate an extreme
outcome A = a >> a.
The correlation matrix p hereafter defined is most often irrelevant to resolve the
previous questions:
P =
1
PD2Di
PAD!
PDXD2
1
PAD2
PDU
PD2A
1
(1.1)
What is needed is the trivariate probability involving jointly both data and the un
known, and that trivariate statistics is data values (di,d2) and unknown value ((in
dependent:
P(A = a, Dx = di, D2 = d2) = tp(a, du d2) (1.2)
From such joint probability one can retrieve the joint information content of the two
data {D\ = d\, D2 = d2} as to the outcome A = a occurring. That information
1.1. DATA INFORMATION AND DATA INTERACTION 3
content would take the form of the conditional probability:
P(A = a\D, = du D2 = d2) = p , / ^ ' / " ^ 2 - ^ = V»(a, <*lt d2) (1.3) tr\U\ — cti, L>2 — a2)
Both probabilities (1.2) and (1.3) are functions <p(-) and^(-) of the data values (d\, d2)
and of the outcome being assessed (a), a situation we will call as "heteroscedastic" or
data values-dependent. Any approximation of the probabilities (1.2) and (1.3) which
calls for any form of invariance as to data values is not to be taken lightly, a situation
we will call as "homoscedastic".
For example, the traditional correlation model underlying regression analysis and
kriging considers only dependence between any two (and only two) variables at a
time, as in the correlation matrix (2.29). In addition that correlation matrix
• is the same whatever the data values (di, d2) and the outcome value a: it is
homoscedastic;
• only characterizes linear dependence. For example, it says nothing about de
pendence between the two data D\ = d\, D\ = d\, although the data values
d\, d\ are actually available and may be better correlated with variable A;
• says nothing about the joint dependence (A, Di,D2). PDXD2 may be close to
zero and may have high correlation values for both PDXA and pn2A, and yet a
joint data event {Di = d1, D2 = d2} that is not informative of the event A = a,
that is:
P(A = a\Di = di, D2 = d2) « P(A = a).
1.1 Data information and data interaction
In the case of the two data values Di = di and D2 — d2 informing occurrence
of a specific outcome A = a of a third variable, what is needed is the conditional
probability (1.3). In the case of n data D\ = di,....,Dn — dn, one would need the
4 CHAPTER 1. INTRODUCTION
fully conditioned probability:
P(A = a\Di = d1,...,Dn = dn) = ^(a; du...,dn) (1.4)
a function ip(-) of the (n + l)-variate probability distribution. One would like to
decompose the function ip(-) into:
• the n elementary data contributions
P(A = a\Di — di), i = 1, ...,n
• some data values-dependent parameters 6 modeling the n data interaction in
presence of the specific outcome A = a being assessed.
The determination of the exact expression (1.4) would then be divided into the two,
potentially easier, tasks of
(1) determining the elementary probabilities
P(A = a\Di = di), i = 1,..., n
(2) determining the parameters 6 needed to combine the previous elementary prob
abilities into the fully conditioned probability (1.4):
ip(a, d1,...,dn) = Fe[P(A = a\Di = di) i = l,...,n] (1.5)
Further we will assume task (1) done: there are many calibration algorithms, in
cluding neural networks [68] or indicator/probability kriging ([49], [11]), that allow
evaluating elementary probabilities P(A\Di = di) associated to elementary data or
data events Di. There remains task (2) which is the objective of this thesis.
1.2. THE THESIS PROPOSAL 5
1.2 The thesis proposal
Combining information from different sources while accounting for their reliability
is a challenging and recurrent task in many scientific fields. Statisticians view this
problem stated in relation (1.5) as combining prior and pre-posterior probabilities
into a posterior probability (e.g. [31]; [13]). Most often some form of data condi
tional independence is assumed to obtain the posterior probabilities, which may lead
to incorrect and possibly non-conservative conclusions if any datum transforms the
information brought by the other data. In geology, there is rarely satisfying ground
to assume data independence or conditional independence. Different data stem from
events that are often associated to a common geological background. From one lo
cation to another nearby location, geological structures are related one to another,
leaving us with the challenge of building posterior probabilities that do not start by
assuming any form of independence.
The assumption of data independence is routinely used, for example, in linear re
gression theory, being the origin of the name "independent variables" [28]. The
independence assumption was overcome (only to a degree) by introducing the var-
iogram/covariance concept which accounts for linear dependence (and only linear)
between any two (and only two) data values or events, for example as observed at
two different locations in space. However, complex geological patterns whose de
scription involves multiple locations in space are beyond the reach of this traditional
two-point geostatistics until the recent introduction of multiple-point geostatistics
[70].
Any individual datum information can be coded into a conditional probability; this
raises the question of how to combine different probabilities. One simplistic solution
is to consider weighted linear averages of the prior probabilities to estimate the final
posterior conditional probability, as done in indicator kriging [45]. For example,
assume we are trying to find the probability P(A | B, C) conditioned to the two data
events B and C. One possible solution is:
6 CHAPTER 1. INTRODUCTION
P(A \B,C) = cnP{A | B) + a2P(A | C)
For such linear combination to take a value between [0, 1], one typically restricts
the weights a to be positive and sum to one which entails convexity of the result:
the combined probability is then bounded by the two prior probabilities P(A | B)
and P(A \ C). Such convexity is undesirable because it precludes the possibility of
compounding, say, the high probability P(A | B) with the concordant information
carried by P(A | C) into a combined probability P(A | B, C) higher than either.
The non-linear tau model introduced by Bordley [6] and Journel [48] allowed for an
efficient solution to the problem of data integration without any severe restriction on
the interaction parameters 9 as introduced in relation (1.5). Their work, however,
stopped short of providing an analytical expression for these data values-dependent
parameters 9 modeling the n data interaction in presence of the specific outcome
A = a. The tau model became the focal point of recent research at Stanford with
the key goal of obtaining and interpreting these interaction parameters. The major
breakthrough happened in 2004, when Krishnan [50] proposed a solution to the prob
lem of probabilistic data integration through the exact mathematical expression for
the interaction parameters 9. However, not surprisingly because they are exact, these
interaction parameters 9 were too complex to be used in practice as they are:
• data value and unknown value-dependent
• dependent on the specific ordering of the data: data Di being considered before
data D2.
However, Krishan's exact derivation is an excellent starting point to build approxi
mations for these interaction parameters. In this thesis we propose two such approx
imations. The first one broadens the traditional concept of conditional independence
assumption, assuming no-data-interaction. The second one approximation borrows
the interaction parameters 9 from the training data depicting the multivariate rela
tion between data and unknown, just like one would borrow a variogram or correlation
matrix from an outcrop. The other key contribution of this thesis is the modification
1.3. GOALS/OBJECTIVES 7
of the tau model into the so-called nu model with a single interaction parameter.
That single parameter, while still data values and unknown value dependent, is now
independent of the data ordering, and hence, provides a measure of the global (joint)
interaction between data and unknown.
1.3 Goals/objectives
Based on the above discussions and considerations, the goal of this thesis is to develop
model of statistical analysis taking into account joint data interaction. Meeting this
goal involves the following tasks:
1. Provide an overview of algorithms currently used in practice; identifying strengths
and weaknesses of each approach.
2. Identify the most critical components of any data integration algorithm.
3. Define a measure of data interaction.
4. Develop a data integration algorithm that measures joint data interaction.
5. Test that data integration algorithm for which the "truth" is known, and com
pare the results with those obtained by other statistical models.
Our study is guided by the following starting points:
• Data independence is not a valid assumption for most practical applications
involving a common genetical process (e.g. geology) linking all data together.
• Data interaction should be an essential component of any data integration al
gorithms.
• Ratios of probabilities are more stable than the component probabilities them
selves.
• An approximative model accounting for data interaction is better than an ar
bitrary assumption of no-data-interaction.
8 CHAPTER 1. INTRODUCTION
1.4 A brief overview of the thesis chapters
In a probabilistic setting, the data is assumed to inform the unknown in the form
of probability distributions. These probabilities differ because they originate from
different data events, because of different assumptions about the information content
of each datum, and because of different underlying theories about the physics under
lying each datum event [43]. We then have to aggregate these different probabilities
into a single distribution or posterior probability that that can be used for decision
making. This frame work has been studied over the past twenty years, and there
are many approaches to combining these probability distributions [43]. The easiest
solution is to directly assume independence between information. Section 2.1 offers
a critical overview of such assumption. However, it is important to build the models
not stemming from independence hypothesis as it is rarely satisfied in practice. Un
fortunately, one way or another many main-stream models can be shown to have a
link relating them to such independence assumption. In Section 2.2 we review three
major approaches to the problem of data integration. The first method reviewed in
this thesis is referred to as linear opinion pool. The second approach is known as
the supra Bayesian method where the probability distributions provided by data are
combined with the prior distribution using Bayes' rule. The third approach reviewed
is based on Journel's [48] and Krishnan's [50] derivation of the tau model.
Once a background of the problem has been developed, Chapter 3 suggests a method
ology to approach the problem of data combination. We built our methodology based
on the expression developed by Bordley [6] and independently approached from a dif
ferent angle by Journel [48], and further developed by Krishnan [50]. We refer to
this expression as the tau model because of the tau data interaction weights used
in that expression. In this work we propose another expression built on the tau
model, which we will refer to as the nu model. The nu model puts forward a single
global data interaction parameter which is not dependent on specific ordering of the
data. The tau model allows only for the derivation of sequence-dependent interaction
weights. The tau and nu sequence-dependent interaction weights are shown to have
1.4. A BRIEF OVERVIEW OF THE THESIS CHAPTERS 9
a one-to-one relation, they have different expressions limit, and have different behav
iors versus data values. The model with less variation over data will be easier to infer.
Both the tau and nu elementary interaction weights share the same attractive prop
erty to be dependent on the data values and, even more critically, on the unknown
value being assessed. This property has its drawback in that the interaction weights
are much more difficult to infer in practice. The concept of training image is pro
posed for such inference. A training image is a particular representation of the joint
data-to-unknown interaction, typically generated from the physics commanding that
interaction. A training image can be available directly from nature, e. g. a geological
outcrop, or it can be computer-simulated from an algorithm [70], [74].
In Section 3.2 we compare extensively the tau and nu models and suggest which model
is more appropriate in different applications.
In Chapter 4 various synthetic data sets are used to illustrate the theory presented
in Chapter 3. These examples led to some interesting, not all trivial conclusions and
observations.
In Chapter 5 the theory of Chapter 3 and testing of Chapter 4 is extended to non
binary data sets.
Chapter 6 concludes this thesis. Some common observations, collected from the
different chapters are made. Proposals for further research are suggested.
Chapter 2
A review of existing models
The problem of data integration could be considered as the most fundamental and is
pervasive in all modeling applications. Attempts at proposing rigorous probabilistic
solutions to this problem date back to the middle of the 20tfe century. We present three
such solutions: the linear pooling algorithm, the supra Bayesian approach, and the tau
model. The intention here is to summarize the key developments leading to this thesis
proposal. More thorough reviews can be found in Jacobs [43], Abidi and Gonzalez [1],
Genest and Zidek [31]. While many solutions for data integration exist, most share a
common link relating them to the traditional hypothesis of conditional independence.
This assumption is fundamental to many statistical algorithms and is used extensively
in practice. We, respectfully, argue that such assumption is often unrealistic and
should be not be accepted at face value, instead it should be documented from the
physics of the data. This raises the challenge of defining algorithms not starting from
any form of conditional independence. To be more specific, algorithms that take into
account the joint dependence between data and between data and unknown are highly
desirable.
2.1 Conditional independence
The hypothesis of conditional independence is widely used (whether it is stated or
not) in practice. This assumption refers to the notion of independence between data
10
2.1. CONDITIONAL INDEPENDENCE 11
given some unknown or given a question being asked. This assumption has been
taken for granted by many as if it was at the core of theoretical statistics: "...take a
notion of conditional independence as fundamental" [18] and "much of what appears
in the journals is yet another example of conditional independence" [58] are just a
few quotes to show the widespread usage of the conditional independence assump
tion. It has been acknowledged that the assumption of conditional independence is
rarely satisfied as "it is hard to find things that are truly independent" [51] yet it is
often adopted to "simplify computations and...to reduce the number and complexity
of the probability assessments" [39]. However, these non-traditional views are of
ten set aside by many statisticians who argue that at least approximately variables
can be treated as conditionally independent [18]. Many developments accept this
assumption without any sound prior justification. Indeed conditional independence
does lead to major simplification, but whether this assumption is appropriate must
be discussed. Under which physical conditions knowing a particular outcome of the
unknown makes the data independent? This question receives little coverage in the
statistical literature. As will be shown in later chapters of this thesis the assumption
of conditional independence is extremely stringent and rarely holds when applied to
spatially distributed data controlled by a common global physics such as geology.
Troublesome is the fact that in the presence of actual data dependence models built
on an assumption of conditional independence may lead to inconsistent probabilities,
that is probabilities that are outside the interval [0, 1] and do not follow basic laws
of probabilities. Of course, standardization and other ad hoc fixes can be applied to
correct such inconsistencies.
2.1.1 The road to conditional independence
Consider the derivation of the fully conditioned probability
P(A = a\D\ = d\, ...Dn = dn), that is the probability of the unknown event A = a
giving the n random variables (RV's) values D\ = di,...,Dn = dn. In all rigor, all
n data events D\,..., Dn should be used together (jointly) to model the sought-after
conditional probability P{A = a\D\ = di,...Dn = dn). In practice, however, when
12 CHAPTER 2. A REVIEW OF EXISTING MODELS
the data events "originate from different data sources, at possibly different scales
and resolutions" [48] it is rarely possible to directly build a model for such fully con
ditioned probability. It may be possible, however, to model the elementary single
event-conditioned probabilities P(A = a | A — di) with i = 1, ...,n. Such elementary
probabilities provide a standardized, unit-free, coding of information, across all data
types, which facilitates the task of data integration. Critically, as opposed to a de
terministic estimate, e.g. A* = / j (A) , the probability P ( A | A ) includes both the
information content of datum A and its uncertainty.
The goal is then to combine these elementary probabilities into an estimate or model
of the fully conditioned probability P(A = a| A = d\,..., Dn = dn):
P(A = a\Di = du ...,£>„ = dn) = if>(P(A = a),P(A = a | A = * ) , * = 1. - > n ) C2-1)
where P(A = a) is the prior probability, prior to considering any of the n data A -
The easiest solution to build such function ip is to assume the traditional hypothesis
of data conditional independence.
Let the notation A - i represent all data £>i,..., A - i in the sequence up to A ex
cluded. Conditional independence between the data events A and A - i giving the
unknown A amounts to assuming that knowing a particular outcome a of the variable
A somehow erases any interaction between data A - i and datum A which leads to:
P(Di = ck\A = a, A _ x = di_!) = P ( A = di\A = a) (2.2)
The chain rule for decomposing the (n + l)-variate probability is written [67] as:
n
P(A = a, A = di7 i = 1,..., n) = P(A = a) J ] P ( A = dt\A = a, A _ x = d;_x)
with A = 0.
2.1. CONDITIONAL INDEPENDENCE 13
Under the conditional independence assumption (2.2) that chain rule leads to:
n
P(A = a,Di = di, i = l,...,n) = P(A = a) J J P ( A = ck\A = a) i = l
and finally the conditional probability:
n
P(A = du ..., Dn = dn\A = a) = 11 P(Di = di\A = a) (2.3) i = i
That is the joint data likelihood P(D\, ...,Dn\A) is conveniently reduced to the prod
uct of much easier to infer elementary likelihoods P(Di\A).
The estimator of the fully conditioned probability P(A\D\,..., Dn) using Bayes inver
sion and the conditional independence assumption (2.3) leads to:
™* ^p{D^xr n
P{A)Y[P{Di = di\A = a) i = l P ( A , ...,Dn)
n
j ^ n P(A\A)P(A) i = l
P(Dlt...,Dn)
(2.4)
Hence: P ( A [ A , - , A . ) 1 fr P(A\Dt)P(Dt)
P(A) P ( A , . . . , A , ) i i P(A) {-b)
Even the restrictive data conditional independence hypothesis does not suffice to re
move the difficult-to-get data dependence term P(P )i , . . . , Dn).
14 CHAPTER 2. A REVIEW OF EXISTING MODELS
This can be solved by considering ratios of updated probabilities:
P(A\Du...,Dn) _ P(A) A P(Dj\A)
1 1 PfDAnn-n.A} { ' P(nonA\Dt, ...,Dn) P(nonA) +J- P(Di\nonA)
We should also note here, that the validness of the conditional independence assump
tion will strongly depend on the support of the unknown A. If the support of the
unknown A is larger than these of the conditioning data, such assumption of condi
tional independence then can be justified.
Conveniently, another approach to get the joint data probability P(Di,...,Dn) in
equation (2.5) is to assume full data independence. Assuming that the data events
Di, ...,Dn are jointly independent leads to:
n
P(D1,...,Dn) = l[P(Di) (2.7) t = i
Hence, under both the assumption of conditional independence giving A = a and
the assumption of data independence, the sought-after fully conditional probability
is written as: P(A = a\D1,...,Dn) fTP(A = a\Di)
P{A = a) ~ l \ P(A = a) [ '
that is the updating ratio associated to all data is equal to the product of the ele
mentary updating ratios.
2.1.2 Heteroscedasticity
All the probabilities and conditional probabilities presented thus far are data values
and unknown value dependent: P(A = a\Di = di, i = l , . . . ,n) written concisely
as P(A\Di, i = l , . . . ,n) . Critically, the spread (e.g. variance) of such probabilities
which relates to uncertainty is data values dependent, a situation we will refer to as
"heteroscedastic". Conversely, independence of that spread from the data values and
from the unknown value is referred to as homoscedasticity.
2.1. CONDITIONAL INDEPENDENCE 15
The roots of the word homoscedasticity, or invariability of the error variance, come
from regression theory. It has been treated in some detail by many statistics and
economics texts ([19], [38], [42], [72]). Most often the assumption of homoscedas
ticity is made as a matter of pure convenience as it reduces considerably the mod
eling requirements. However, as has been pointed by Downs [19], it is the study
of heteroscedasticity which "may provide the only available evidence of interacting
variables". Such interaction between data for any given unknown may change the
naive assessment made from an association of individual data ignoring their interac
tion. Just like conditional independence, the homoscedasticity assumption should be
documented and made with caution, not accepted blindly as a matter of pure conve
nience. Unfortunately, the assumption of homoscedasticity is much too often taken
for granted, heteroscedasticity being seen as an illness to cure [19].
Some examples of homoscedasticity are:
• In regression theory and traditional geostatistics, the (regression) kriging weights
are homoscedastic in that they depend only on the variogram/covariance model
and the spatial geometry of the data, but are data values independent. Crit
ically, the kriging (estimation) error variance is also data values-independent:
an assumption or a result that is contrary to what is observed in practice.
For example, consider the two identical data configurations with different data
values as shown in Figure 2.1. The two geometric configurations of Figure 2.1 are
identical: in both cases the two data D\ and D?, are located at the same distance
from the unknown A. However, in Figure 2.1 (2), the much different data values
{D\ = 1%, D2 = 15%) would most likely carry a greater error potential in the
estimation of the unknown A than in Figure 2.1 (1) where the unknown A is
surrounded by two consistently small values (Z?i = 1%, D2 = 1.5%).
• The in-built assumption of homoscedasticity in regression has led to much effort
16 CHAPTER 2. A REVIEW OF EXISTING MODELS
|D2=1.5°o)
(1) (2)
Figure 2.1: Spatial geometry of the data for two different data values combinations. (1) unknown A is surrounded by two points with small data values; (2) unknown A is surrounded by the same two points but with very different data values which potentially can lead to greater error.
to justify it [19], most notably by calling on the properties of the Gaussian ran
dom function. Indeed, a characteristic property of such multivariate Gaussian
distribution is that all conditional distributions are Gaussian fully character
ized by the conditional mean which identifies the linear regression estimate, i.e.
kriging, and the conditional variance which is homoscedastic and identifies the
non conditional error variance or kriging variance:
E{[A-A*]2\Di = di, i = l,...,n} = E{[A-A*}2}=a2K (2.9)
If one accepts without question the multivariate Gaussian distribution model,
then the homoscedastic assumption need not call for any further discussion.
• The homoscedasticity of data errors.
Assume the availability of n data events Di,...,Dn that inform the unknown
A with the corresponding error terms Ri, ..., Rn. These n data then can be
modeled as:
Di = fi(A) + IU(A) (2.10)
2.1. CONDITIONAL INDEPENDENCE 17
The measurement Di is seen as a physical deterministic function fa of the un
known A plus a random error or deviation Ri [48]. One can argue that the
model Di = fi(A) + Ri(A) is absolutely general as long as the distribution of
errors Ri is accepted as dependent on the variable A. Hence for A = a, the
data remains random Di = fi(a) + Ri(a) with the actual datum value d, corre
sponding to a particular realization r, (unknown) of the error random variable:
di = fi(a) + r-j. Then:
P(D'i = di\A = a) = P(Ri = n\A = a) \/i
and:
P(Di = di; i = l,...,n\A = a) = P(Ri=ri; i = 1, ...,n\A = a) (2.11)
The joint data likelihood calls for the equally difficult-to-get joint likelihood
of the n error RV's. Therefore several simplifying hypotheses are made, often
without further justification. The errors Ri are assumed:
1. conditionally independent given A = a,
2. with (homoscedastic) distribution independent of A
Under these two hypothesis, the joint data likelihood (2.11) becomes:
P(Di = di] i = 1, ...,n\A — a) =
JL " (2.12)
H P(IU = n\A = a) = [J P(Ri = n) i = i t= i
Lastly, a third hypothesis of Gaussian error distributions is commonly made.
We argue that errors are often directly related to the unknown A: change in the
unknown value should be reflected also on the distribution of the error term in
equation (2.10). In geostatistics, one particular form of such heteroscedasticity
18 CHAPTER 2. A REVIEW OF EXISTING MODELS
is the commonly observed "proportional effect" [49] which refers to an increase
in the spatial variance in areas with greater local mean. In such cases, Var(i?j)
is directly affected by the specific unknown value a [50].
2.1.3 Bayesian networks
Bayesian networks are often used to make a set of variables and their dependencies
visually explicit. One example of such network is the bi-directional Bayesian network
[63] which is used to represent a joint multivariate probability distribution. For exam
ple, consider the tri-variate distribution of variables A, D\, and D2 shown in Figure
2.2. The graph 2.2 depicts all possible joint combinations of the variables with the
dependencies between these variables represented by the bi-directional arrows. Tradi
tionally these dependencies are modeled by covariance-related measures of similarity.
The Bayesian graph of Figure 2.2 considers not only the data dependence between
two data (nodes D\ and D2) but between the two data taken jointly (node D1D2).
As seen in this figure, the Bayesian nets are necessarily data-values dependent requir
ing that dependencies be remodeled for each new data value combination {a ,d[, d'2]
different from {a,di,d2}.
To obtain the fully conditioned probability P(A = a\Di = dx,D2 = d2) one would
need to consider all the dependencies (bidirectional arrows) between the unknown
A and the data events Dx, D2 and, most critically the joint data event DXD2. For
example, the joint marginal probability P(A, Di,D2) is derived as:
P{A,DX,D2) = P(A)P(D1\A)P(D2\A,D1) (2.13)
If one assumes conditional independence of the data D% and D2 given the third
variable A, the equation (2.13) simplifies into:
P{A,D^D2) = P(A)P(D1\A)P(D2\A) (2.14)
2.1. CONDITIONAL INDEPENDENCE 19
Figure 2.2: Graphical representation of joint dependencies between variables A, Di, and £>2-
This simplification is shown in Figure 2.3 with the resulting net requiring less mod
eling efforts than that of Figure 2.2: most arrows starting from the joint data event
D1D2 are not anymore shown (not needed).
In climate studies, such bi-directional relationship is called a "feedback". Positive
feedbacks work to enhance the effect of original forcing. Negative feedbacks decrease
or remove the effect of the original forcing. For example, the ice-albedo feedback [7]
is the mechanism in which warming of temperatures (Di) leads to a reduction of ice
and snow coverage (D2), decreasing albedo (i.e. the reflection coefficient of Earth
surface) and resulting in further snow and ice retreat, more absorption of heat and
warming of air. Thus, the temperature (D\) impacts the ice/snow cover (D2). In
return, the ice/snow cover (D2) influences the temperature (JDI). Based on this polar
amplification concept, high latitudes are the areas where global warming is expected
to be more pronounced.
20 CHAPTER 2. A REVIEW OF EXISTING MODELS
[ D1 D2 J
Figure 2.3: Graphical representation under conditional independence between variables Di, and D^ given A.
An example of simplified bi-directional graph is shown in Figure 2.4 (1). In this
graph, the variables B and C interact affecting each other. However, at times the
relationship between the variables can take a simpler form where only one variable B
influences the other variable C; in Bayesian network such form of dependence is rep
resented by an uni-directional arrow such as in Figure 2.4 (2). For example, change
of incoming radiation (B) may result in change of ocean circulation (C) via change
of its thermal structure. However, ocean circulation has no impact on incoming ra
diation. Hence the relationship between radiative forcing and ocean circulation may
be considered as an uni-directional relationship.
As another example, consider the joint probability P(A — a,B = b,C = c) using the
three different joint representations of uni-directional Bayesian nets shown in Figure
2.5. In this Figure the leftmost graph (1) represents the situation in which data event
2.1. CONDITIONAL INDEPENDENCE 21
( 1 ) ( 2 ) Figure 2.4: Graphical representations of bi-directional (1) and uni-directional (2) Bayesian nets.
B is independent of both A and C and data event C is dependent on A. The middle
graph (2) represents the uni-directional dependence of data event C on data events
B and A. Finally, the graph (3) of Figure 2.5 is more complex since the data event
B also influences A.
The joint probability P(A = a,B = b,C = c) can be written for each of the three
uni-directional graphs:
(1). P(A = a,B = b,C = c) = P{C = c\A = a)P(A = a)P{B = b)
(2). P(A = a,B = b,C = c) = P(C = c\A = a,B = b)P(A = a)P(B = b)
(3). P(A = a,B = b,C = c) = P(C = c\A = a,B = b)P(A = a\B = b)P(B = b)
Limitations of Bayesian nets
• As mentioned before, all Bayesian nets are data values dependent. This requires
that the dependencies (i.e. the arrows) be remodeled for each new set of data
values. While such a dependence (heteroscedasticity) is often found in geological
settings, it is rarely put in evidence in the Bayesian nets actually used.
• As can be seen from Figure 2.2, the complexity of these nets grows exponen
tially as the number of the variables (and hence the number of joint data com
binations) increases. In practice, conditional independence is then assumed to
22 CHAPTER 2. A REVIEW OF EXISTING MODELS
© © © © ®^® \ \ / \ /
© © © (1) (2) (3)
Figure 2.5: Graphical interpretation of the joint probability P(A, B, C) based on a different sets of relationships between three variables A, B, and C.
simplify the computational cost associated with Bayesian nets [8], [10], and [76].
This reduces considerably the effort to simulate the required dependencies. One
such simplification is shown in Figure 2.6 where the two data events D\ and D2
are assumed to be conditionally independent relative to the third variable A.
At the same time, the variable A is assumed independent of the variable B. In
this Figure, the node A is referred to as the parent node while nodes D1 and
D% represent its children. Assuming conditional independence then amounts to
ignoring an important link (arrow) between the two conditioning data children
D\ and Di- However, more often then not the data interact on each other. Such
link can be critical in modeling the joint probabilities.
A possible way to avoid the reliance of Bayesian nets on the assumption of
conditional independence is to use a global representation (proxy image) of the
joint distribution of all variables involved. Such global representation provides a
possible image of the joint interaction between data and unknown. In geostatis-
tics such representation of the joint distribution is called a "training image".
This concept was introduced back in 1992 when Guardiano and Srivastava [35]
CONDITIONAL INDEPENDENCE 23
r I B ) 1 A ;
0 ure 2.6: Graphical representation of conditional independence between variables and D2 given A, and data independence between variables A and B.
and Journel [47] proposed to use a training image to represent the "type of
heterogeneities that the geologists expect to be present in the actual subsurface
reservoir" [70]. Such image can be borrowed directly from a physical outcrop
or could be obtained by computer simulation of the physics that govern data
interaction and their relation with the unknown [70], [74]. For example, a train
ing image could be obtained by an unconditional realization generated from an
object-based algorithm [36]. The geologist expertise combined with massive
modern computer power allows the generation of such image. By scanning this
image, one can retrieve directly all the required conditional probabilities as ob
served proportions, without any call for conditional independence.
As an example, consider the assessment of an unsampled event A from data
events B and C where:
— A is the presence/absence of a subsurface channel sand at an unsampled
location
0
24 CHAPTER 2. A REVIEW OF EXISTING MODELS
(1) Fades map B (2) Seismic signature C
Figure 2.7: Example of dual training images depicting the interaction between two data types B and C. Left: B for sand/no sand data. Right: C for seismic data.
- B indicates the presence/absence of sand data at nearby wells locations
- C is the result of seismic survey whose analysis leads to indirect indication
about channel occurrence [70]
A binary sand/no sand training image such as that of Figure 2.7(1) would
give a concept of spatial distribution of sand (here EW channels). Computer-
based simulation of the seismic survey would provide the seismic signature of
the training image (Figure 2.7(2)). The joint availability of the two related
training images shown in Figure 2.7 allows retrieving all corresponding training
probabilities of the type P(A\B), P(A\C), P(A\B, C) and thus evaluate the
data B, C interaction given A.
Link to Markov Chain
A commonly used model to represent a discrete time (or ID) stochastic process
is that of a Markov chain [57]. The Bayesian net shown in Figure 2.6 can be
2.2. PROBABILITY COMBINATION ALGORITHMS 25
seen as a special case of such a chain. In a Markov process, any previous state
is assumed irrelevant for predicting the probability of subsequent states given
knowledge of the current state:
P\Xn+\ = x\Xn — xn,...,Xi = x\) = P{Xn+i\Xn = xn) (2-15)
This property is known in statistical literature as memoryless property, and has
been widely accepted at face value. Unfortunately, while such property may
be appropriate for ID sequences of, for example, electrical events, it is already
questionable for 2D continuous electrical plates. It is most likely untrue for
geological events with no single directional origin.
In Bayesian nets such as that shown in Figure 2.6, it is assumed that the value
of particular node is conditioned only on its parent node leading to:
P{X1,...,Xn) = P{Xl)P{X2\X1)...P{Xn\X1,...,Xi.l) n n (2 16)
= Y[P(Xi\X1,...,Xi_1)*l[P(Xi\Parent(Xl)) i=l t = l
which replicates the condition (2.15) of Markov chain.
Using the notations of Figure 2.6, the joint probability P(Di, D2, B, A) can then
be written as:
P{DX,D2,B, A) = P(D2\A)P(D1\A)P(A)P(B) (2.17)
2.2 Probability combination algorithms
The estimators of the fully conditional probability
P(A = a\D\ = di,...,Dn = dn) based on direct assumptions of conditional inde
pendence (2.5) and in addition data independence (2.8) are widely used in practice
26 CHAPTER 2. A REVIEW OF EXISTING MODELS
because they help reducing the inference burden and computational cost [18]. Yet
such assumptions can be hardly ever justified from the physics of the data. Unfor
tunately, such limiting assumptions find their way in most areas of mathematical
statistics including the major algorithms used for the problem of data integration.
In the next Section, we will introduce three algorithms for combining individually
conditioned probabilities and demonstrate their link to assumption of conditional
independence.
2.2.1 Linear pooling of probabilities
Because of the extensive use of linear algebra in applied mathematics, one easy solu
tion to the combination of individual probabilities is their linear weighting:
P(A = a\Di = di i = l, ...n) = ] T XiP{A = a\D{ = d{) (2.18)
To ensure a licit posterior probability the weights A, are typically constrained to sum
to one and be non-negative. This model was proposed by Stone [69] in 1961 and
has received a wide usage due to its simplicity. However, this model has one major
limitation. The constraints on the weights do not allow the posterior probability
P(A = a\Di = di i — 1, ...,n) to be valued outside of the limits defined by the pre-
posterior probabilities, i.e. the conditional probability P(A = a\Dt = di i = 1, ...,n)
is strictly valued between [min(P(A = a\Di = di),max(P(A = a\Di = di), i =
l , . . . ,n] .
The convenience of this method is that one can choose and interpret the weights Aj.
Also these weights can be made data-values and/or outcome-value dependent.
One common heuristic is to give high weights to individual data which have the high
est degree of expertise about the unknown A. While this seems like a reasonable
way to interpret A, we argue that such heuristic may be misplaced. The individual
information content is already expressed by the individual conditional probabilities
P(A = a\Di = di). The weights A should carry instead the concept of data interaction.
2.2. PROBABILITY COMBINATION ALGORITHMS 27
Three algorithms have been proposed for determining the linear weights A»:
1. Weights, seen as veridical (i.e. truthful) probabilities [43], are built around the
basic assumption that only one particular datum Di is representative of the true
distribution. The weight Aj in equation (2.18) is considered as the probability
that the datum i is the correct representation of this true distribution. The
datum Di with weight A* closest to one is considered as the most likely veridi
cal. Unfortunately, the literature stops short of providing a unique, rigorous,
determination of the veridical weights Aj. It is, however, intuitively appealing
to assign higher weights to good ("correct") as opposed to poor data. Unfor
tunately, it is not easy to quantify mathematically a subjective concept such as
good or bad. When combining diverse data sources, the level of data expertise
and the amount of information the data share with other data should somehow
be accounted for [30]. Defining the concept of information sharing is already not
trivial, mathematically quantifying it poses an even greater challenge. Before
trying to determine the weights Aj, it is essential to clarify which aspect of data
integration these weights are to quantify and what information the elementary
conditional probabilities P{A\Di) of equation (2.18) already account for.
2. Experimental weights from minimizing error.
Bates and Granger [4] proposed to determine the pooling weights by minimizing
an error criterion.
Assume that n data were used, each individually, to determine the n probability
density functions (pdf's) / i , . . . , /„ , each being an estimate of the actual pdf 6
of the unknown. These n individual pdf's are then combined linearly:
n
In this setting, it is assumed that each probabilistic opinion fa, Vz, is unbiased,
i.e. E(fi - 6) = 0.
28 CHAPTER 2. A REVIEW OF EXISTING MODELS
One solution has been proposed by Dickenson [20], [21] under the assumption
that the data errors R = 6 — fi are normally distributed with zero means and
covariance a^. The weights are then found by minimizing the objective function:
mm m
o = E E A< v « + a E A * -*) (2'2°) t = l j=l i = l
m m
where the first term ^ ]T) ^i^jaij *s the variance or the expected squared error i=i i = i
m
of 9, and the second term a(Yl K ~ 1) ensures that the weights sum up to 1. i = l
A simple optimization, similar to that used for ordinary kriging, leads to the
following solution for the weights A [21], [37]:
A = S - ^ / e * ! ) - ^ (2.21)
where e '=(l , . . . , l ) , t denotes the transpose and S = E[(R - E\R])(R - E[R]Y]
is the covariance matrix for the data errors.
Freeling [24] established that under this minimum variance criterion the weight
Aj is larger for more accurate and less redundant (correlated) data. In other
words, data that are accurate and independent are rewarded with higher weights
A;.
There are several problems with this approach: first and foremost is the infer
ence of the pdf estimation error covariance E. Next, at best that covariance
only accounts for error linear dependence only two data at a time instead of
all data taken together. Last, the optimization proposed does not guarantee
non-negative weights Aj > 0.
3. Dual indicator weights.
In geostatistics, a link of the linear pooling expression (2.18) to dual indicator
kriging [22], [46] suggests obtaining the weights by modeling all the required two-
point covariances between data and between data and unknown, i.e. Cov(A, Di)
and Cov(Di,Dj), and then solving an indicator kriging system. In the case
2.2. PROBABILITY COMBINATION ALGORITHMS 29
when the unknown A and the n data Dj with i = l,...,n are binary events,
the elementary conditional probabilities P(A\Di) can be written in terms of
expectations as:
P M _ i i n - u - P ( A = ^D* = X) - E^AD^ - Cov(A,Di) + E(A)E(Di) [ ' U i } F ( A = 1) " E(Dt) ~ E(Dt)
Substituting the above expression into equation (2.18), we obtain for all i =
l,...,n:
(2.22)
= ^ ) + £aiCoi;()l1A)
with 5Z Aj = 1. The equation (2.22) identifies the dual indicator kriging system %=\
with dual weights «j = \/E(Di).
Both the minimum objective function (2.20) and the dual indicator kriging
approach (2.22) considers only the dependency between the unknown A and
one data at a time rather than considering the dependency between all data
together and the unknown. For example, the conditional probability
P(A = l\Di,...,Dn) should be expressed as the function of n data D\,...,Dn
taken jointly as [47]:
P(A = 1|£>! = du ..., Dn = dn) = E(A|A = du..., Dn = dn)
= b(0) + b^]Cov{A, D1) + ... + b^Cov(A, Dn)+
b™Cov(A, A , D2) + ... + b™Cov(A, Dn_u Dn)+ n ^ O T ^ . ^ n - l . ^ n j i - (2.23)
A,Dl,D2,D3) + ...
... + b(fl)Cov(A,D1,...,Dn)
b?Cov(A, DltD2, D3) + ... + b^Cov(A, Dn_2, Dn_i, Dn
30 CHAPTER 2. A REVIEW OF EXISTING MODELS
where (") is the number of combinations of n values taken i at a time.
The expression (2.23) then identifies the full and exact expansion of the fully
conditioned probability P(A = 1\D\ = di,...,Dn = dn), considering the infor
mation of one indicator data at a time, then two at a time, and ending with
taking all indicators together. The dual indicator kriging expression (2.22) can
then be seen as a truncation of the full expression (2.23). The full expression
(2.23) considers all possible 2n pairings of the joint indicator distribution of the
(n + 1) indicator variables A, Di,...,Dn. Indicator kriging considers only the
data taken one or two at the time and ignores any joint dependency between
three or more data taken together. Such interaction may be important as it
accounts for multiple data (pattern) dependency.
2.2.2 Supra Bayesian Methods
The theoretical basis for Bayesian analysis of the problem of data integration dates
back to 1974 when Morris [60] laid a foundation for its wide-spread usage. This
method has been further developed by Agnew [2], French [25], Lindley [52], [55],
Winkler [75] among others. We pursue the Jacobs' [43] presentation of the supra
Bayesian method which follows closely the derivation of Lindley [52]. The goal is
once again to obtain the fully conditioned posterior probability
P(A = a\Di = di,..., Dn = dn). The equation central to Lindley's development relies
on Bayes relation as:
T>( A i n A n A \ p(A = a>Di = d1,...,Dn = dn) P(A = a\Dt = du ...£>„ = dn) = —— —-
P(D1=d1,...,Dn = dn) (2.24)
oc P(A = a)P{Dx = dx,..., Dn = dn\A = a)
Similarly, we can define the posterior probability of the complement event nonA.
Dividing the equation (2.24) for A by the similar equation for nonA allows eliminating
the joint data probability from equation (2.24) which results in the following odds
2.2. PROBABILITY COMBINATION ALGORITHMS 31
ratio form:
P(A\DX = dr, ...,Dn = dn) P(A)P{D1 = d1}..., Dn = dn\A)
P(nonA\Di = d\, ...,Dn = dn) P(nonA)P(D\ = di,..., Dn = dn\nonA)
Taking the logs on both sides then yields:
P(A\D1 = d1,...,Dn = dn)
(2.25)
log P{nonA\D\ = d\, ..^Dn = dn)
P(A) P(D1 = du...,Dn = dn\A) l o g + i o g .
(2.26)
P(nonA) P(Di = d1,...,Dn = dn\nonA)
The equation (2.26) is often referred to as the posterior log-odds-ratio [43]. The first
term log „/ ^—'-pr in this equation is the prior log-odds. ° P[nonA)
The second term log r,/ n — ^—X' " n _ J i —T\ ls ^ n e l°g-°dds ratio of
probabilities of data Di,...,Dn when A occurs to when nonA occurs. These two
probabilities are the likelihoods of the data given the unknown events A and nonA,
respectively. It has been noted that such logarithmic ratio of likelihoods represents
the degree of joint dependencies among the n data D\,..., Dn [43].
Unfortunately, despite its theoretical appeal, the supra Bayesian approach fails to
provide a clear method for inference of these likelihood functions. Various assump
tions have been proposed to simplify the derivation of the two likelihoods
P(D\ = di, ...,Dn = dn\A) and P{Di = d\, •••,Dn = dn\nonA). Not surprisingly, one
popular approach is to assume some form of independence, with conditional inde
pendence assumption dominating the literature ([12], [14], [17], [27], and [65] among
others).
If the assumption of conditional independence (both given A and nonA) is imposed
32 CHAPTER 2. A REVIEW OF EXISTING MODELS
into equation (2.26), the ratio of the data log-likelihood reduces to:
P(D1=d1,...,Dn = dn\A) log
P(D\ — d\,..., Dn = dn\nonA)
l o g- = 2 2 l ° z P { D t l n o n A ) II P(Di\nanA) i = l
(2.27)
Another common approach for the estimation of the data likelihoods
P{D\ = d\, ...,Dn = dn\A) and P{D\ — di,..., Dn = dn\nonA) consists of assuming a
congenial family of distributions. A common assumption is that these likelihoods are
normal with the same variance/covariance matrix [43]:
P(D1 = d1,...,Dn = dn\A) is N(JJLA,E)
P(Di — di, ...,Dn = dn\nonA) is N(p, (2.28)
nonA;
where the vectors p^ and pnonA are the n-dimensional means for D = D\ = d\,..., Dn — dn
given A and nonA respectively, and E is a nxn covariance matrix [32] defined as:
E =
2^2 pi2cr1a2
<TS
Pn\o\o\ pn2(j\ol ... <
P2nOlp\
Using such assumption of normality, the posterior log-odds equation (2.26) then re
duces to [53], [54]:
2.2. PROBABILITY COMBINATION ALGORITHMS 33
. P(A\D1=d1,...,Dn = dn) log — —
P(nonA\D1 = d1,...,Dn = dn) ^ 2 9 )
However, this expression depends on the assumption of joint normality of the data,
which is often untenable in practice. Even with such stringent assumption, this
method requires massive modeling effort to get the covariance matrix E. Lastly, the
correlation matrix E does not take into account heteroscedastic, non-linear, multiple-
variables joint interactions between data and unknown.
2.2 .3 A brief o v e r v i e w of t h e t a u r e p r e s e n t a t i o n
Bordley's derivation
In 1982, Bordley [6] proposed an approach to the problem of data integration that
extends that of Lindley's quoted in the previous Section 2.2.2. He starts by trans
forming the elementary probabilities P(A\Di) with i = l , . . . ,n into odds functions:
P(A = a) u0 —
l - P { A = a) (2 30) Q = P(A = a\Di = di)
[Z-M) ' i
1 - P(A = a\Dt = di)
Bordley then states that under certain "regularity" conditions the sought-after fully
conditioned odd function O can written as:
n
O = F[Y,Ui(Oi)] (2.31) j = i
with O = / ^ ^ • • • ' V ^ l-P(A=a\Di=di,...,Dn=dn)
for an arbitrary continuous and monotonic odd function F and undetermined continu
ous functions Ui, ...,un, with the assumption that the data satisfy the non-interaction
34 CHAPTER 2. A REVIEW OF EXISTING MODELS
property. Bordley already stresses that data non-interaction is different from data
independence. He speculates that non-interaction of data is satisfied when some data
"give the same assessments of probability in two different scenarios, then we can ig
nore them in deciding which scenario makes the decision maker feel more sure about
the event A occurring." In other words, non-interaction occurs when the decision
maker's interpretation of one datum Di does not depend upon the other datum Dj.
In contrast, for data Di, Dj to be independent, the information on which the elemen
tary probabilities P(A\Di) and P(A\Dj) are built must be independent [6].
The Bordley's non-interaction property, combined with an axiom called a "weak like
lihood ratio axiom", leads to the following sought-after odds ratio:
O n
7T = /(Oi/Oo) o - <> f(On/O0) = HiOi/OoY' (2.32)
where / is a real valued function and o is the ordinary multiplication operator. Bor
dley regards the exponent weights r; as the degree of reliability of each datum Di.
However, he stops short of providing the mathematical expression of these reliability
weights; their determination is left to the subjective opinion of the decision maker.
Journel and Krishnan's derivation
The multiplicative model (2.32) has been independently re-established by Journel [48]
building on a permanence of ratio paradigm which observes that ratios are typically
more stable than their components. Journel starts by denning distances which are
the inverse of odds defined as in (2.30) with the sought-after distance x defined as:
l-P{A\Du...,Dn) X= P(A\Du...7Dn)
G [ ° ' ° ° ] ^
Similarly, the elementary distances Xi with i — l , . . . ,n and prior distance x0 are
defined as: l-P(A\Di) l - P ( A ) [n . ,
*- Film ' i o = ^ p r e|0'°°1 <2-34>
2.2. PROBABILITY COMBINATION ALGORITHMS 35
The definition of distances to the unknown A is a sensible way to
re-formulate the conditional probabilities. For example, the distance X\ is equal to 0
if P(A\Di) = 1. That is the event A is certain to occur after observing data event
D\. This distance is equal to infinity if P{A\D\) = 0: event A is an impossible event
after observing data event D\.
Consider only two data D\ and D2 that inform the unknown A. Journel assumes then
that "the incremental relative contribution of datum D2 to (increasing or decreasing)
the distance to A is the same after or before knowing D\n:
a f t e r ^ ^ i = b e f o r e ^m ^ ^ = *i*2 ( 2 > 3 5 )
X\ X0 X0 X0X0
The derivation is then extended sequentially to n data D\,..., Dn leading to:
XQ XQ XQ
Equation (2.36) identifies the Bordley's derivation with the Tj weights equal to 1.
Confirming Bordley's conjecture, this derivation calls for the incremental information
contributed by datum D2 to knowledge of A not to be affected by the knowledge of the
previously used data D\. Journel then admits that data usually experience complex
interactions and these interactions need to be accounted for. This led to introduction
of the Tj weights into equation (2.36) to account for such data interactions [48]:
With the tau weights, the model proposed by Journel completely identifies Bordley's
derivation with a single but most important difference in the way the two authors
interpret the Tj weights. For Bordley, the weights relate to the reliability of the data.
Such reliability however is already incorporated into the distances (2.34). Journel's
derivation (referred to as the tau model from now on) formulates the weights Ti as
data interaction parameters. The author however stops short of providing the math
ematical definition of such data interaction.
(2.36)
36 CHAPTER 2. A REVIEW OF EXISTING MODELS
This tau model remained a heuristic approximation of the exact fully conditioned
probability given all data events taken jointly, until the seminal contribution of
Krishnan [50]. Krishnan realized that interaction between multiple data is much
more complex than the traditional simple correlation between any two data. Indeed
the correlation measures only linear dependence between only two data taken at a
time. Joint data interaction should take into account not only the joint multiple point
dependency between data but also between these data and the unknown. It is also
critical to make this complex measure of dependency data values and unknown value
dependent. With these challenges, Krishnan proposed the following expression for the
Tj measures of data interaction in Journel's derivation. Krishnan considered first a
specific ordering of the n data £>i,..., Dn. There are n\ such sequences. For each such
sequence we can derive the n interaction weights r» of equation (2.36). Changing the
sequence will lead to a different set of n interaction Tj weights. However, all sequences
share one common feature, that T\ (that is the interaction weight assigned to the first
datum in the sequence) is equal to one. The value T\ = 1 simply tells that the first
datum does not interact yet with anything. The remaining (n — 1) interaction T{
weights are then derived in the following manner. Using the definition of conditional
probability (2.3), one can write:
miA,...,^)-^--^) P{L>l,...,Dn)
= P{Dl)P{A\D1)P{D2\A) D 1 ) . . .Pp w jA, A , . . . , i V i ) l ' '
P(Du...,Dn)
Note, all the probabilities in equation (2.38) should be written as data values and
unknown value dependent as: P(A = a\D\ = d\, ...,Dn = dn). However, we will use
the short notations of (2.38).
2.2. PROBABILITY COMBINATION ALGORITHMS 37
Similarly, the fully conditioned probability of the event nonA is:
a, A\-n n \ PjnonA, Du...,Dn) P(nonA\Du...)Dn) = ^ . ^
_ P{D1)P(nonA\D1)P(D2\nonA, Dl)...P{Dn\nonA, Du ...,Dn-i) P(£>i,...,£>n)
(2.39)
Dividing equation (2.39) by (2.38) leads to:
P(nonA\Dl,...,Dn) _ P(non^|£>i) P(D2\nonA,D1) P(Dn\nonA,Dl,...,Dn^1) P(A\Du...,Dn) ~ P{A\D{) P{D2\A,D{) '" P(Dn\A,D1,...,Dn^1)
Using Journel's definition of distances (2.33) and (2.34), the sought-after distance x
can be re-written as:
P{D2\nonA,Dl) P{Dn\nonA,D1)...,Dn_1) X Xl P{D2\A,DX) - P(Z?„|il,A,...,I>„-i) K }
Let:
where
P(£>2|raoraA,£>i) _ (P(D2\nonA) P(D2\A,D1) \ P{D2\A)
T1
(2.41)
I ,P(D2\nonA,Di)-<
*M,a) = (ZlZ± e [-00,00] (2.42)
The expression (2.42) is a ratio of data log likelihoods. It is important to note that
this r2 interaction weight is data value and unknown value dependent.
The distance x of expression (2.40) then can be re-written as:
T _ / r l n M " P(£>ro|nonA£>i,-,A>-i) (2A^ {Xl) W "• P{Dn\A,Du-,Dn.{) {1Ai)
where T\ = 1.
38 CHAPTER 2. A REVIEW OF EXISTING MODELS
Generalizing then equation (2.41) to all (n — 1) weights leads to
P(Di\nonA, A , . . . , A - i ) ( P(Di\nonA)
P(A|AA,-,A-i) V P(Di\A)
with the key result:
I p(Dj = dj\A = nona^A- i = d ^ i )
Ti(di, ...dn, a) = , — j ^ -r e [-oo, +ooj,
(2.44)
l o g P ( A _ = dj\A = norm) L ' J' (2.45)
n = i
P ( A = <£ I 4 = a)
Substituting expression (2.44) into (2.43) for i = 3, ...,n, we get the Bordley-Journel
expression:
Interpretation of the tau expression
The denominator of the rrexpression (2.45) measures how datum A = di discrimi
nates the outcome A — a from non&. The numerator measures the same but in the
presence of the previous data A - i = di-i = { A = di,..., A - i = ^ t - i}- Thus the
ratio (of ratios) 7$ indicates how the discrimination power of A = di is changed by
knowledge of the previous data A _ i = di- i taken all together. Critically, this weight
is specific, as mentioned before, to the ordering of n data events A , •••, A M and is
data value and unknown value dependent.
Consider the following specific values for the tau weights:
• Tj = 1
This condition is satisfied when the two ratios in expression 2.45 are equal:
PROBABILITY COMBINATION ALGORITHMS 39
P(Di = dj | A = nona, D^ = d ^ i ) _ P(Dj = dt \ A = nona) .
When Tj = 1, the ability of the datum (or data event) Dj = d^ to discriminate
a from nona is unchanged by knowledge of the previous {% — 1) data events
dj_i = {D\ = d\,..., Z)j_i = dj_i}. Relation (2.47) entails the following equality
of log ratios:
P(Di = dj 1 A = nona, A - i = dj_i) ° g P(Di = di\A = nona)
. P(Di = di\A = a,~Dl_l=di_l) = l oe r— ; = r 8 P{A = di | A = a)
Note that the tau model with unit tau weights is less constraining than the
assumption of conditional independence. While data conditional independence
given both the unknown event A and its complement nonA leads to unit tau
weights, the reverse is not true. Unit tau weights need not imply any data
conditional independence. It suffices that r takes a constant value in equation
(2.48), any value different from one would also result in Tj = 1.
A zero tau interaction weight occurs when the numerator i P(D; = di I A — nona, Z),_i = dj_i) r . ,n . rX . , , „ , ,. log—^r-r-— -L —-——^ of expression (2.45) is equal to 0 leading
P(Di = di | A = a, A - i = dj_i) to:
P(Di = di | A = nona, A - i = di-i) = P(Di = dt\A = a, A - i = d,_i)
(2.49)
In the presence of previously used data Di_1 the datum Di is non-informative in
that it does not discriminate event a from nona. Note, however, that considering
a different data sequence might result in a r, weight different from 0. In such
case, the data Di does add valuable information about the unknown event
40 CHAPTER 2. A REVIEW OF EXISTING MODELS
A = a.
• Ti > 1
1. If X{ > xo, that is if datum Di by itself increases the distance to event
A — a occurring as compared to the prior distance x0l then the interaction
factor Tj > 1 makes that increase even greater.
2. Similarly, if Xi < xo, that is if datum Di by itself decreases the prior
distance to event A — a occurring, then the interaction factor Tj > 1
reduces that decrease.
• If Ti < 1, the previous conclusions are reversed.
In summary, Krishnan has provided a solution to the difficult task of obtaining the
exact conditional probability P(A\Di, ...,Dn). This is done through relation (2.46)
by decomposing the problem into two, each simpler, tasks:
• obtaining information content through the individually conditioned probabili
ties
P{A\Di) with i=l,...,n.
• deriving multiple-point joint data interaction tau parameters whose exact ex
pressions are known.
The tau interaction weights in addition to be dependent on specific ordering of the
n data events Di,...,Dn are data values and unknown value dependent. While such
form of dependence allows for a more comprehensive representation of the fully con
ditioned probability P(A\D1,...Dn), it is too complex to be used in practice. This
calls for approximations built from Krishnan's exact tau parameter expression. We
argue that while sequence-dependent interaction weights are important in some appli
cations, most often it is the global representation of such interaction that is desirable.
Krishnan's derivation fails to provide a measure of such global data interaction.
Moreover, the r, weights are likely to exhibit an unstable behavior versus data values.
When the information is non-discriminating, the denominator of expression (2.45)
2.2. PROBABILITY COMBINATION ALGORITHMS 41
tends toward logl = 0 leading to infinite tau weight Tj —• oo, hence creating an
inference problem.
Krishnan [50] did note that the inference of the interaction T* weights is quite difficult
and the behavior of these tau weights was not fully understood. He pointed out the
need for further analysis starting with synthetic data sets. These data sets should
not only help develop a better understanding of the tau interaction parameters, but
also lead to further theoretical developments.
Chapter 3
The nu representation
The overview presented in the previous Chapter pointed to the need to remain alert
against any simplifying but potentially crippling hypothesis when it comes to data
dependence and interaction. The need to consider data jointly rather than one or
two at a time was also brought out. This is particularly critical when dealing with
spatially distributed phenomena where patterns of similar data carry valuable infor
mation beyond that carried by each datum individually.
This chapter builds the theoretical basis of the nu expression which is a sister of the
tau model proposed by Bordley [6] and Journel [48] and further developed by Kr-
ishnan [50]. In his thesis Krishnan gave the exact expression of the tau weights and
showed them to be directly related to the data interaction associated to any specific
sequence of data. With these weights the original tau model lead to an exact analyt
ical solution to the problem of probabilistic data integration.
The major contribution of the nu expression proposed in the present thesis is the
derivation of a single, data sequence-independent, interaction parameter v0. The
derivation of this UQ parameter and its estimation rely on the original idea of Jour-
nel's paper [48] that ratios of probabilities are more stable than the probabilities
42
3.1. DERIVATION OF THE NU REPRESENTATION 43
themselves-a well proven engineering paradigm. The exact i>o expression given here
after is too complex thus it is unpractical. However, availability of such exact ex
pression leads to avenues for its approximation. In this chapter we propose two such
approximations.
3.1 Derivation of the nu representation
3.1.1 The nu expression
Consider an unknown event A informed by n data events Di,..,Dn. These data have
been evaluated for their individual information content related to unknown event A
through the elementary probabilities P(A | Di)1. The challenge is then to recombine
the prior probability P(A) and the n single-event conditional probabilities P(A | DA
into the posterior probability P(A \ Di,...,Dn) while accounting for possible inter
action among data. The nu representation provides an exact expression for such
recombination.
One of the well-proven paradigms for engineering approximation is the permanence of
ratios: rates of increments are typically more stable than the increments themselves
[48], [62]. Using this key idea, define the following distances to the unknown event
(A) prior to and after observing any single data event:
l-P(A) PI A) . A , ~ x° = —pf A\— = p / A\ ^ [0' ooj; prior distance to A occurring, with A = nonA P(A) P{A)
1 - PI A I DA P(A I A ) r i Xi — — ^ , , . ^ x— = „ , . i ^ x e O.co ; updated distance knowing datum Dit
P(A | Di) P{A | A )
Xi equals to zero if P(A \ Di) = 1, and equals to infinity if P(A \ Di) — 0
The updated distance knowing jointly the n data is then:
•'Throughout this paper the short notation P(A | Di) is used for the (a,di) values-specific expression P(A = a | Di = di)
44 CHAPTER 3. THE NU REPRESENTATION
l-P(A\D1,...,Dn) P(A\Du...,Dn) ^_rn ^ X= P(A\Dl,...,Dn) =P(A\Du...,Dn)
e[°>°0] (3-X)
The one to one relation between the updated distance and the fully conditioned
probability is simply:
P(A = a\Di = di, i = l , . . . ,n) = — j — e [ 0 , l ] (3.2) 1 + x
If event A — a is a non-binary random variable, then nonA denotes any A-outcome
different from a.
Note that a distance is defined as a ratio, it is thus likely more robust than its
component probabilities with regard to any factor affecting the estimation of such
probabilities. Already, it can be seen that a distance has only one bound, it must
be non-negative, whereas a probability has two bounds [0, 1]. Bounds represent con
straints for any estimation or approximation.
A distance is also the inverse of an odds ratio [23], e.g. the prior odds of event A
occurring is:
V ' = — € 0, oo 1 - P(A) x0
The distance and its inverse, the odds ratio, are exchangeable variables. They are thus
Jeffrey's variables [71]. This allows for a convenient generalization of results: knowing
either distance (inverse) is sufficient condition to find its odds ratio (distance) form.
Jeffrey's variables are known to be more stable than the original variables on which
they are built.
3.1. DERIVATION OF THE NU REPRESENTATION 45
Next, consider the following exact decomposition of the joint conditional probability:
P ( A , A , - , A ) P(A\ A , - , A ) =
P ( A , - , A ) P(A)P(D1 | A)P(D2 | A,D1)...P(Dn \ A, A , - , A )
P(A,-,A) n
n^AiA.A-i) - PlArlp(D> D.) ^
where A - i = { A — di,..., A - i = ^»-i} is the set of all data events up to the i — 1th
datum A - i = di_\.
Note, A = {0} is the empty set.
The notation A - i implies a specific sequence of n data, with D\ being the first datum
considered and A being the last. There are n! such data sequences.
Next, rewrite the fully updated distance x as the product series:
_ n p(AiA,£»i_i)
^ ) % ( D 1 , . . . A ) A P ( A I A A - I ) = — " M l p /n i 4 n. ^ ^
X = n p^iA.Di-O y P(Di | A, A - i )
^ ( ^ ' ' p p i , . . . , ^ )
The ratio , ' / ' measures how the likelihood of observing the datum A = di
changes whether A or nonA is present, once the previous (i — 1) data values A - i have
been observed. This ratio is not yet a measure of interaction between the A-datum
and the previous data considered A _ i . A measure of such data interaction would be
obtained by comparing the previous ratio to the one ignoring D;_i, that is to p)D* J .
Consider then the nu parameter defined as the ratio of ratios:
= PmA,D^) > h = ( }
P(A|A) - ' V ' P(Di\A)
46 CHAPTER 3. THE NU REPRESENTATION
The denominator of equation (3.5) can be re-written with Bayes inversion:
P{Di | A) = P(Dj,A) P(A) = P{A 1 A ) P ( A ) P{A) = x±
P(Di\A) p(A) P(Di,A) p(A) P(A\Di)P(Di) x0
Hence, the expression (3.4) can be rewritten as the exact mi expression:
n n n
= J J V%— = VQ Yl ~> Wnere 0 = Y\ Ui - 0 (3-6) X° i=i X° i=X X° i=l
Expressions (3.4) and (3.6) are two equivalent exact expressions for the conditional
distance x, but equation (3.6) is written as product of the more easily accessible
elementary distances x, with parameters (the z/j and VQ) interpreted as measures of
data interaction. It is hoped that the ^-parameters (3.5) being ratios of ratios are
reasonably stable vs. data values, that is reasonably homoscedastic, more so than the
direct likelihood ratios of expression (3.4). If that is the case, z and i/0 parameters
would be easier to infer from proxy training data.
Remarks and Interpretation
• All probabilities in expressions (3.2) to (3.6), and therefore the (vo, ui) parameters,
are data values-dependent (the d'ts). In addition and most critically, they are also
dependent on the specific outcome A = a being discriminated from A = nona. Note
that this discrimination is between a and nona, not between any two outcomes dk,
ak< of a variable A with more than two outcomes {a^, k = 1,..., K > 2}. The non-
binary case K > 2 is treated in Chapter 5.
Data interaction
• The denominator of the i/j-expression (3.5) measures how datum Di = dj discrim
inates the outcome A = a from nona. The numerator measures the same but in
the presence of the previous data D^\ = {D\ = di,...,Di-i = dj-i}. Thus the
ratio (of ratios) Vi indicates how the discrimination power of Di = di is changed by
3.1. DERIVATION OF THE NU REPRESENTATION 47
knowledge of the previous data Di_\ taken all together. Note that the n parame
ters Vi are data sequence-dependent, as opposed to the single parameter UQ in
expression (3.6).
• The expression (3.5) of ^ is symmetric in Di and Aj = A - i : o n e c a n exchange Di
for Aj, the expression of V{ would remain unchanged.
• If Vi — 1, the ability of the datum (or data event) Di = di to discriminate a
from nona is unchanged by knowledge of the previous (i — 1) data events Dj_i =
{D\ = d\,..., Di_i = di-i}. Therefore we can consider Vi = 1 as the case of "non
interaction" of the two data events Di and Dt^i when it comes to discriminating
a from nona. The deviation | 1 — vt | is thus a measure of data interaction. That
measure is (a,dj,j = l,...,i) values-dependent, although at times we will use the
simplified terminology "Di, D,_i data interaction".
Note that non-interaction is not independence, if only because data independence,
say between two random variables D\ and D2 does not involve any third variable
A. The term "data redundancy" is sometimes used instead of "data interaction";
this is unfortunate because redundancy implies some overlap of the information
content of the two data D\ and D2 with redundancy of information being a state
one should avoid or, at least, correct. We prefer the more neutral term "data
interaction" which means only that one datum modifies the information brought
by the other whenever i/j ^ 1, i.e. whenever | 1 — i/j | > 0.
• Vi = 1 in equation (3.5) requires that the two ratios P(D<A~D ) an<^ p(j>L) ^ e e 1 u a ^
to each other. One sufficient (but not necessary) condition for Vi = 1 is conditional
independence of the two data Di and Dj_i given A = a and A — nona., that is:
P(Di | A = a, A _ i ) = P(Di | A = a)
and P{Di | A = nona, A - i ) = -P(A I A = nona)
Note that the traditional conditional independence of Di and .Di_i given only A =
a does not suffice to ensure Vi = 1, see example in Section 3.3.1. Thus data
48 CHAPTER 3. THE NU REPRESENTATION
independence, data conditional independence and data non-interaction (i/j = 1)
are different states.
• If Vi > 1 in expression (3.6), the relative distance Xi/xo is increased.
1. If Xi > x0, that is if datum Di by itself increases the distance to event A = a
occurring as compared to the prior distance XQ, then the interaction factor
Vi > 1 makes that increase even greater.
2. Similarly, if Xi < XQ, that is if datum Di by itself decreases the prior distance
to event A = a occurring, then the interaction factor vt > 1 reduces that
decrease.
• If Vi < 1, the previous conclusions are reversed.
• The greater the data interaction as measured by the deviation | 1 — V{ |, the more
critical it is to consider the data D;_i and Di jointly because they interact one on
the other as to evaluating the outcome A — a. Thus, one can also read | 1 — vi | as
a measure of the information modification brought by considering jointly the two
data -Dj-i and £V
Note that data interaction can go both ways, either increasing or decreasing the
probability of event A = a occurring. Hence ignoring data interaction by making
Vi = 1 is not necessarily a conservative approximation.
• Similarly when considering the single v0 parameter, the deviation | 1 — v0 | is
a measure of global data interaction. Note that for v0 = 1 it is not necessary
that all Vi = 1. For example, v\ = 1, v2 = 2 and v^ = 1/2 would result in
v0 = 1 * 2 * (1/2) = 1. In words, different elementary data interaction (v{ ^ 1) may
cancel out into zero global data interaction v0 — 1. What counts in the end are
not the elementary data interactions but the global one as measured by | 1 — v0 |.
This is the main contribution of the nu expression. Again the tau expression does
not deliver such global data interaction parameter.
3.1. DERIVATION OF THE NU REPRESENTATION 49
Equivalence
• Although the tau derivation fails to reveal a single sequence-independent interaction
parameter similar to the single parameter u0 in expression (3.5), the individual,
sequence-dependent, Tj and Vi parameters are related one-to-one as:
* = ( ^ r - l , o r : TV = 1 + ^ (3.7)
Once again these expressions (the tau and the nu) are (a, df, i = 1, ...,n) values-
dependent.
The fact that the vi) T* relation is dependent on the datum Dt relative distance
Xi/x0 indicates a potential difference between the nu and tau approaches not yet
fully understood. Already their inference processes do feature different robustness
characteristics: in presence of little informative datum, Xi/x0 ~ 1 and T; —> co
which makes its inference difficult.
Comparing the nu (3.6) and tau (2.32) expressions, we observe:
• The nu weights lie in interval [0,+oo], while the tau weights have no bounds
being in [—oo,+oo]
• The interaction weight applied to the first datum in the sequence regardless of
the expression used is always one per definition: Tx=v\=\
• the mathematical derivation of the exact tau weights expression (refer to Krish-
nan's thesis [50]) is less straightforward than that for the nu weight, as shown
in the beginning of this chapter.
• Tj = 1 for i = 1, ...,n implies the u0 = 1 model. Similarly, i/j = 1 for % = 1, ...,n
would also imply u0 = 1. However, i/* weights different from 1 can also imply
the I/Q = 1 model.
50 CHAPTER 3. THE NU REPRESENTATION
Associativity
Elementary data events can be regrouped into larger data events. This does
not affect the final result yielded by either the tau or the nu expression. For
example, consider the three elementary data events Di, D2, D3 and regroup the
two first into B = {Di, D2}- It can be shown that for the two data sequences
Di, D2, Ds and D2, D\, D3, the third weight, 1/3 and T3, given to the last datum
D3 is the same as the second weight, VBD3 or TBD3, given to that same datum
D3 in the data sequence B, D3.
Symmetry
• A data interaction measure should be symmetric in the two data events involved.
This property is verified by the nu parameters, not by the tau parameters. For
example, consider any two data events D\, D2 informing the unknown A. For
the first possible sequence D± —> D2, consider the notation with the upper script
(12) recalling that particular sequence:
x _ x\ (12) 2 _ xi I x2 vK2
]— = — • — ( 3 .
x0 x0 x0 x0 \x0'
J12)
For the second possible sequence D2 —>• D\
X_ = X2 _ (21)^1 = ^ 2 _ fx±
XQ XQ XQ XQ \XQ
_(21) r l
(3.9)
It comes immediately: v^ ' = v\ ', that is the nu parameter given to the second
datum in the sequence remains the same no matter the sequence.
As for the tau parameters, using the log transform it comes:
(21), Xl , Xi (12), X2
r\ log — = log — + r2v log —, (3.10)
XQ X2 XQ
3.1. DERIVATION OF THE NU REPRESENTATION 51
where T{ 7 r2 , unless x\ = Xi-
As noted before that, notwithstanding the one-to-one relation (3.7), there is a
difference between the nu and the tau parameters and thus between the nu and
the tau approach to data integration. The lack of symmetry of the tau param
eters represents an impediment which makes their interpretation and inference
potentially more difficult.
3.1.2 Dictatorial property
If any one datum information Di is fully decisive it imposes its result (as a dictator)
to the final conditional probability, no matter the other data Dj, j ^ i provided that
none of these is equally dictatorial and contradictory.
Indeed, if X{ = 0 (00) or equivalently if P(A | Di) = 1 (0), then the final distance x
given by the exact expression (3.6) is equal to 0 (00). That is P{A \ Di) = 1 (0) leads
to a decisive statement: A = a occurs certainly (certainly not). A contradictory and
equally dictatorial data event Dj would yield Xj = 00 when Xi = 0 and then equation
(3.6) would not be able to resolve such total contradiction.
The limit case vj = 0 corresponds to data interaction leading to such decisive result.
Indeed, first assume that none of the n data Di is dictatorial by itself, i.e. Xi ^ 0 or
00, V i = 1,..., n. Then z/j = 0 if and only if the numerator of its expression (3.5) is
zero, that is:
P(Di = di I A = non&, Dj_i) = 0, although P ( A = di\ A = nona) > 0
Thus, the previous information Z)j_i indicates assertively that the probability (like
lihood) of observing the datum Di = di in presence of A = nona, is zero, therefore
Vi = 0 entailing v^ = 0, although Xi > 0, and finally:
x = 0, that is P(A = a | Di, i = 1, ...n) = 1
Similarly, the case Vi = 00 would entail x — 00, hence
52 CHAPTER 3. THE NU REPRESENTATION
P(A = a j Di, i = 1, ...n) = 0, even though Xi / oo, V i.
3.1.3 A measure for data interaction
Considering the more general notations Di, Aj = Di_\ and v^ = U{, the deviation
| 1 — Vij | has been suggested as a measure of interaction between the two data sets
Di and A^. In view of the similarity of the two limit dictatorial cases Ui — 0 and
Vi = oo, a more symmetric and standardized measure of data interaction (when it
comes to discriminating A = a from A = nona) could be:
Kij = \ - exp(- | log Vi, j) = Kji e [0,1] (3.11)
where fy = Uji > 0 is the nu parameter defined by relation (3.5) generalized by con
sidering the notation Aj = D^i.
This interaction measure takes the limit values:
• Kij = 1, that is maximum data interaction, whenever ^ j = 0 or +oo
• K^ — 0, that is no-data-interaction, whenever Uij = 1 as discussed before.
3.2 Tau or nu expression?
If the Ti and v± parameters are evaluated explicitly, e.g. from training data, and made
data-values dependent, the two tau and nu expressions (2.32) and (3.6) are equivalent.
One should then prefer the nu-formulation since it puts forward the single interac
tion parameter va(a, di, i = 1, ...,n) which is independent of the data sequence
Di,D2, ...Dn. Also, evaluation of the 7$ parameter associated to non-informative da
tum such that P{Di \ A) = P ( A | A) would run into problems of instability linked
to division by a log ratio close to zero, see relation (2.32).
3.2. TAUORNU EXPRESSION? 53
However , if r* and Vi are assumed constant, independent of the data values (a, di, i =
l , . . .n), then the tau formulation should be preferred. Indeed, consider the case of
only two data events with the two different sets of data values:
{D1 = di, D2 = d2} and { A = d/, D2 = d2'}
• The nu model with the constant (homoscedastic) v0 = v^v2 parameter value is
written as:
#• = v0 for data set {dx, d2\
f i
%-— VQ ——- for data set id,', d2\ x ° Xn Xn I J XQ XQ
where x, x\, x2 are the distances corresponding to {di, d2} and Jb , Jb -| . IAJ Q
are the distances corresponding to {dx, d2}. Conditional distances are data
values-dependent, as opposed to the prior distance XQ — XQ . Therefore,
^- = -£--£-, for all constant VQ values.
The parameter VQ is seen to be ineffective on that ratio of distances.
• Conversely, the tau model (2.45) with constant T\ and r2 parameter values is
written as:
X~\ Xo log 7 - = 7i log h 72 log— for data set {d\, d2\ X° XQ XQ
log %- = T\ log — + T2 log — for data set i d-!, d*\ X ° XQ XQ I J
Thus:
or equivalently
JJ X i X o log — = n log v T2 log —
X X\ X2
a; \ aJi / \ £2
54 CHAPTER 3. THE NU REPRESENTATION
The tau parameters, when considered data values-independent, remain effective
unless r\ = T2 = VQ = 1. This latter property of the tau expression, remaining
effective when the T{S are considered data values-independent, explains why
the tau expression (2.32) is a convenient heuristic to weight more certain data
events [5]. It suffices to make T; > Tj > 0 to give more importance to data event
Di as compared to data event Dj, whatever the actual data values (di, dj).
However, this heuristic utilization of the tau model completely misses the main
contribution of the tau/nu expression which is the quantification of data inter
action for any specific set of values (a, di, i = 1, ...n).
In the nu representation, we then suggest if one decides not to trust individually
conditioned probabilities P(A — a\Di — di) to simply set these probabilities
closer to the marginal probability P(A = a).
3.3 Approximations based on the nu derivation
The exact developments presented above put forward the dependence of the fully
conditioned probability not only on the unknown value a but also on the data values di
themselves. This dependence, which we have called "heteroscedasticity", allows for an
exact representation of joint data interaction. However, full heteroscedasticity is too
complex to be accounted for in most practical situations. Thus, we need to formulate
approximations starting from the exact nu representation (3.6). In this section we
look at two such approximations and the resulting models. The first model ignores
data interaction by setting the global data interaction weight vQ = 1. We compare this
model to that based on the traditional assumption of conditional independence. The
second model does account for joint data interaction by considering a value vo ^ 1
and function of a few summary of the values (a, di, i — l , . . . ,n); that function is
calibrated from training data. We will refer to this second model as the "classified v0
approach". But first it serves to compare "no-data-interaction" with data conditional
independence on a large data set.
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 55
3.3.1 Evaluating the conditional independence assumption
The single uQ parameter in expression (3.6) is noteworthy because:
• it is data sequence-independent and involves jointly all n data events A
• the no-data-interaction model UQ = 1 does not require that any or all elementary
parameter values v{ be equal to 1. In other words, multiple and possibly complex
elementary data interactions (z/j ^ 1) may cancel out into a "global non-data
interaction" i/0 = 1. For example, consider three data that receive the following
sequence-dependent interaction weights: v\ = \ (per definition) and v3 = l/zx2.
While the weights u?, and u2 are not equal to one, they cancel out into global
interaction since VQ = V\VIVZ = \.
The approximation u0 = 1 corresponding to no-data-interaction (the interaction
measure (3.11) is then K^ = 0) is different and far less restrictive than the joint data
conditional independence hypothesis written as:
P(DU ..., Dn\A = a) = P{Dl | A = a) P{Dn \A = a),
as shown by the following case study.
A binary case study
Consider an example with binary indicator variables illustrating the VQ = 1 model
and comparing it to the estimator stemming from the assumption of conditional in
dependence (CI). The unknown A is an indicator variable fully defined/informed by
two indicator data B and C. For example, A could be a binary variable indicating if
on a particular year the sea surface temperature in the North Atlantic Ocean is above
a given critical level. B could be an indicator of sea level pressure being greater than
a certain threshold, and C could indicate whether the salinity of the ocean exceeds a
certain threshold.
The total number of possible joint combinations of values for the three binary variables
(A and the two indicator data B and C) is 23 = 8. Each of these combinations is
assigned a probability of occurrence denoted pk, k = 1,..., 8 (Table 3.1).
56 CHAPTER 3. THE NU REPRESENTATION
Pi
P2
P3
Pi
A 0 0 0 0
B 0 0 1 1
c 0 1 0 1
Ph
Ps Pi
Ps
A 1 1 1 1
B 0 0 1 1
c 0 1 0 1
Table 3.1: Joint distribution of indicators and their probabilities.
We generated 10,000 random realizations of these combinations of eight joint prob
abilities P(A,B,C). For this, for each pk, k = 1,...,8 we drew at random and
independently a number between 0 and 1 from a uniform distribution. Once all eight
joint probabilities pk were drawn, we standardize them to ensure the total law of 8
probability: ^ pk — 1 and Pk G [0,1], Vfc. Generated 10,000 sets of probabilities can fc=i
be considered as a fair, almost exhaustive, sample of all aspects of data interaction
and dependence, and hence represent a good basis for assessing robustness (i.e. the
degree of deviation from the truth) of diverse approximations.
Once a set of the eight probabilities of joint occurrence is simulated, the correspond
ing marginal and conditional probabilities can be retrieved. For example, consider
the conditioned probability P(A — 0\B = 0, C = 0). The corresponding marginal is
P(A = 0) = pi + p2 + pi + pA. The probability of the unknown A = 0 given that the
two indicator data B and C are also equal to 0 is:
P(A = 0\B = 0,C = 0) = P(A = 0,B = 0,C = 0) px
P(B = 0,C = 0) ~ pt+ps (3.12)
This exact conditional probability P(A = 0|B = 0,C = 0) can be approximated in
two different ways.
• CI estimator: the estimator based on the traditional conditional independence
(CI) assumption given A = 0 as described in Section 2.1 is:
P**(A = 0\B = 0,C = 0) = P(A = 0,B = 0,C = 0)
P{B = 0,C = 0)
P(B = 0, C = 0\A = 0)P(A = 0)
P(B = 0,C = 0)
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 57
Per data conditional independence:
P(B = 0, C = Q\A = 0) = P(B = 0\A = 0)P(C = 0\A = 0)
Thus,
P**(A = 0\B = 0,C = 0) =
P(B = Q\A = 0)P(C = 0\A = 0)P(A = 0) (3-13) P(B = 0,C = 0)
wherein in terms of the joint probabilities pk of Table 3.1:
ZPk P(B = 0\A = 0) = ^^—: and P(B = 0, C = 0) = p i + p 5
EPfc
i/0 = 1 estimator: the conditional probability P(A = 0\B = 0, C = 0) is
estimated by the nu expression (3.6) with the single global parameter u0 set to
1 (no-data-interaction). First the distances to A = 0 occurring are calculated
as:
_ 1 - P(A = 0) Xo~ P(A = 0) ;
_ 1-P(A = 0\B = 0) _ 1 - P ( 4 = 0|C = 0) Xl ~ P(A = 0\B = 0) ' X2 ~ P(A = 0|C = 0)
wherein in terms of the p^ s of Table 3.1:
P(A=0) = E f t t = l
P(A = 0|5 = 0) = n+jF+K+w and
p(A^o|c = o) = p i + ^ + g; + ,
58 CHAPTER 3. THE NU REPRESENTATION
The vo = 1 model gives:
X X\ X2
The conditional probability is immediately retrieved as:
P*(A = 0\B = 0,C = 0) = - (3.14)
Figure 3.1 gives the scatter plots of these two estimates versus the reference true
probability (3.12). The x-axis relates to the 10,000 exact P(A = 0\B = 0,C = 0)
values and the y-axis relates to the approximations of this exact probability based on
vo = 1 model (left graph) and based on conditional independence (right graph).
The approximation of conditional independence leads to 585 illicit probabilities (greater
than 1), that is approximately 585/10000^6% of all estimated probabilities. The most
severe case leads to the estimated probability P(A — 0\B — 0, C = 0) = 14.5. Of
course, in any application any such violation of the law of probabilities would be cor
rected. In most cases, such correction would involve changing the elementary single
datum-conditioned probabilities.
The probabilities estimated under the i/0 = 1 model are always licit by definition
when working with binary data. Also, observe the high correlation coefficient (0.83)
of the vo = 1 model results with the reference. This robustness of the u0 = 1 results
is attributed to the original paradigm stating that ratios of probabilities are likely to
be more stable than the probabilities themselves.
Table 3.2 gives the summary statistics of the 10,000 reference probabilities and the
two sets of 10,000 approximations. Not only the v0 = 1 leads to licit probabilities
better correlated with the reference probabilities, but also, critically, the v§ = 1 model
reproduces the sample statistics of the reference much better than the model based on
the conditional independence assumption. That assumption tends to over-compound
the two data information hence generating a positive bias (overestimation).
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 59
Figure 3.1: The scatterplots for u0=l model (left) and conditional independence estimator (right) versus reference. The illicit probabilities are shown above the red line. Beware of the different y-axis scaling for these two panels.
60 CHAPTER 3. THE NU REPRESENTATION
reference UQ = 1
conditional independence model
mean 0.50 0.50 0.55
variance 0.056 0.038 0.150
Table 3.2: Summary statistics: means and variances of 10,000 conditional probabilities P(A\B, C) and their approximations.
reference U0 = l
mean 0.50 0.50
variance 0.0560 0.0562
Table 3.3: Summary statistics: means and variances of 10,000 conditional probabilities P(A\B, C) and their transformed estimator.
To ensure that conditional independence does hold, one might consider transforming
the eight original joint probabilities pk of Table 3.1 for each of 10,000 realizations.
Such transform amounts to tampering with actual observations (the p^ s) to fit con
venient models and is not generally recommendable.
One such transformation which ensures conditional independence given both A and
nonA is:
^trans r ' l (P3 + PA)
Pz —
trans PA —
trans Pi —
trans ^ 8 —
Pi +P2
P2JP3+P4.)
P1+P2
Pb{Pl+P%)
P5+P6
P^Pl+Pn)
P5+P6
(3.15)
This transformation leads to the two approximations given by conditional indepen
dence (given A and nonA) and by the v0 = 1 model to be identical.
Table 3.3 gives the summary statistics of the reference conditional probability P(A\B, C)
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 61
and the estimator of that reference based on transformation (3.15). The mean of the
10,000 resulting approximations is equal to the mean of the exact probability (0.5).
However, as can be seen from Figure 3.2, the resulting estimator based on that trans
form is poorly correlated with the reference true probability with only 0.41 coefficient
of correlation. Note that the correlation coefficient of the v0 = 1 model applied to
the original non-transformed data was 0.83.
Figure 3.2: The scatterplot for the estimator of fully conditional probability P(A = 0\B = 0, C = 0) based on transformed probabilities (y-axis) versus reference (x-axis). The correlation coefficient between them is 0.41.
62 CHAPTER 3. THE NU REPRESENTATION
Instead of tampering with "data" i.e. changing the elementary single-datum condi
tioned probabilities P(A\B) or P(A\C), Journel [48] suggested standardizing the two
conditional independence-based estimates P**(A = a\B, C) and P**(A = nons\B, C)
as follows. Consider the conditional probability P(A\B, C) given by:
P(B, C\A)P(A) P(B, C\A)P(A)
P(B, C, A) + P(B, C, A) P(B, C\A)P{A) + P(B, C\A)P(A)
Assuming conditional independence given A and nonA leads to:
P{B\A)P(C\A)P{A) P(A\B, C) =
P(B\A)P(C\A)P(A) + P(B\A)P(C\A)P(A)
That is:
P(A\B, C) = S(A) ~ e [0,1] (3.16) ; S(A) + S(A) L J '
where S(A) = P(B | A)P(C \ A)P(A)
and S(A) = P(B \ A)P(C | A)P(A)
Were expression (3.16) applied to the estimation of the complement event A = nonA,
the following probability would be obtained:
P(A\B,C)= S(A)
S(A) + S(A)
which ensures that:
P(A\B, C) + P(A\B, C) = 1 (3.17)
Note that neither expression (3.16) nor (3.17) corresponds to conditional indepen
dence.
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 63
The uQ = 1 model can be shown to identify the standardized expression (3.16). Indeed:
^ | B - C ) = TT% = ^ S ( I j (3'18)
In summary, the no-data-interaction i>0 = 1 model represents a significant contribu
tion beyond the traditional hypothesis of data conditional independence. Conditional
independence giving A and nonA does lead to the v0 = 1 model, but there are many
patterns of data dependence that also lead to the same u0 = 1 model.
Notwithstanding these advantages, the v0 = 1 model can be restrictive in some appli
cations. It is thus important to consider the case when the global interaction weight
UQ is different from one, as shown by the stockbrokers example below.
The s tockbrokers case and UQ ^1
Consider an uncertain decision to be made about buying a particular stock (A = 1).
The prior probability is uninformative: P(A — 1) = 0.5, hence x0 = 1.
1. Two stockbrokers (D\,D2) strongly advise to buy that stock:
P(A = l|Z>i = 1) = P(A = 1\D2 = 1) = 0.9 with xx = x2 = grj = g = 0.11
The likelihood of having the second broker advising a buy {D% = 1) in presence
of (A = 1, D\ = 1) is much greater than in presence of (A = 0, D\ = 1),
therefore:
P(D2 = 1\A=J,D1 = 1) = 9 • P(D2 = 1\A = l ,Di = 1) » P(D2 = 1\A = 0,D1 = 1)
Thus the second nu weight given to the information D2 = 1 is, according to
expression (3.5):
64 CHAPTER 3. THE NU REPRESENTATION
V2 _P{D2 1
P(D, = 1 A = 1,D1 = 1)(X2,-1_ 1 A = l,D1 = iyxo) ~ 5 •9 = 1
In the absence of any other information, the distance x to A = 1 conditioned
to D\ = D2 = 1 is then , with v\ = v?, = \:
X2\ _ X±X2 _ (\\2
and P(A = l |Di = £>2 = 1) = Y ^ = 0.99
The compounding of the two brokers advices leads to a strong probability (0.99)
for a buy (A = 1). Note that the previous {v2 = 1) value also corresponds to
the no-interaction case
P(D, = 1 P(D2 = 1
4 = 1 , A = 1) _ P(D2 = 1 A = l,Di = l) P(D2 = 1
A = l) A = l)
that is knowing the information Dx = 1 does not affect how the information
D2 = 1 discriminates A — I from A = nonA = 1; i.e. there is no presumption
of any collusion between the two brokers.
2. A third adviser (D3) admits knowing nothing about that stock, hence
P(A = 1|£>3 = 1) = P(A = 1) = 0.5 with x3 = x0 = 1
However, when told that the two brokers £>i, Z?2 did both advise strongly a
buy, the adviser D3 warns about collusion with:
P{D3 = 1\A = 0 , D 1 - J } 2 = 1) = 100 • P{D3 = 1\A = l,Dl = D2 = 1) »
P(D3\A=1,D1 = D2 = 1)
Therefore given the data sequence D\ = 1, D2 = 1, D3 = 1, the likelihood of
the warning D3 — 1 is much larger when the stock is a dud (A = 0) than when
the stock is actually good (A = 1).
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 65
The third nu weight given to D3 = 1 is:
A = l,D1 = D2 = l) _ P(D3 = 1 A = I,D1 = D2 = I)[X0) - i U U i - i U U
with 4 = nonA.
The fully conditioned distance x is then, with: v\ — 1, z/2 = 1, ^3 = 100:
* = U • If • 100(|[) = (J)2 • 100 • 1 = 1.21, and: P(A = 1|£>! = D2 = 1, I>3 = 1) = , , -L_ = 0.45 < P(A = 1) = 0.5
The advise D3 = 1 in presence of Dx = D2 = 1 leads to dropping the buy
probability from P(A = 1\D\ — D2 = I) — 0.99 to a value 0.45 lower than the
prior P(A = 1) = 0.5.
3. Hones t broker: If in presence of D\ = 1 (buy advise given A = 1), the second
broker D2 is honest and admits that his input would not discriminate further
A from A — nonA, then:
P(D2 = 1 P(D2 = 1
^ = 1 . A = 1) = 1 ^ ^ = ^ - i = 1 ,4 = i , A = i) ^ ^ - ^ x o ^ - g
^2 = A 7 1 indicates strong (here honest) data interaction.
Then: Vi^- = 1, i-e. the datum £>2 = 1 is ignored and the distance condi
tioned to D\ = 1, D2 — 1 is:
that is: P(A = 1|D1 = D2 = 1) = P(A = 1|A = 1) - 0.9.
In such a case (honest second broker Z?2), the third adviser would not have
66 CHAPTER 3. THE NU REPRESENTATION
discriminated A from A = nonA with:
P(D3 = 1 P A
A=1,D1 = D2 = 1) = 1 =• v3 = i(m-\^ = i , 4 = 1 , A = A = 1) -* "S-^XQJ >"'Xl
That is the third adviser brings no further information, and:
J^i§J)(^§§)(-3§f) = fj(l)(l)=0.11
Hence, P(A = 1 |A = D2 = 1, A = 1) = P(A = 1|A = 1) = 0.9.
4. A different datum A^ = 1
Consider the different sequence of information A , D3, D2 with the adviser da
tum (A- = 1) being given prior to knowing the second broker advice (D2 = 1).
Note that this third information is actually different from that considered above
and must be denoted D3 = 1 with D3 ^ A -
Ignoring the potential for collusion, the adviser D'3 would have been fully non-
committing with:
P(D'3 = 114 = 1, A = ! ) _ • , x* P(A\D3 = 1) P(I>; = 1 |A=1 ) JD 1 = 1) ' x° P(A\D'3 = 1)
hence the nu weight given to that "second in sequence" adviser information
is: v'3 = 1, leading to:
x^ = ^Ulxi)(us^) = xj;
i.e. P(A = 1|A,D'3 = 1) = P(A = 1|A = 1) = 0.9.
Now comes the dishonest second broker data D2 = 1, with similarly to case 1:
P ( A = l\A = 1, A = 1, A = 1) = 9 • P ( A = l | 4 = 0, A = l, A = 1)
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 67
and v'2 = -h • 9 = 1
Then:
x0 - l " W ^ W ^ W - x0x0 - W
and, P{A = l |Di = 1, D'3 = 1, D2 = 1) = —L_, = 0.99. J. I tt/
The adviser datum Ds = 1, different from datum £>3 = 1 of case two because
of his ignorance of D2 being used, cannot reveal the collusion hence amend the
strong final buy probability (0.99).
3.3.2 The classified u0 approach
The UQ = 1 model is very simple yet performs significantly better than approxi
mations stemming from traditional data independence assumptions (see more case
studies in Chapter 4). Nevertheless, the vQ — 1 model fails to account for interaction
between data. In practice, data originating from the same physical source most often
do interact on each other. To reinstate data interaction and data values dependence
(heteroscedasticity) into our probabilistic system, we propose to borrow the data in
teraction parameters from proxy training data or images revealing the physics of that
interaction.
Consider first the example of n categorical data A , each datum taking one of K
categories; similarly the data Di could be continuous with the histogram of each da
tum being divided into K classes. There are Kn possible data values combinations, a
daunting number if K or n are large. Let / be anyone of these Kn possible data value
combinations. Each such data event d ^ — < Di = d\ , i — 1,..., n > may be reduced
into a few summary scores s ^ = < Sj , j = 1,..., n > with n « n.
For example, if D is a multiple-point data event involving n continuous variables Di
all related to the same attribute, two classical scores are the mean and variance of
68 CHAPTER 3. THE NU REPRESENTATION
the n data values, i. e. d"' ~ s ^ = {m^l\a2^}, the dimension reduction is from n
to n — 2 < < n.
Of course, there are other possible summary scores than mean and variance although
requiring n > 2, but still n « n [74].
A proxy experiment (or training data set), mimicking the actual data interaction,
would provide L joint outcomes < a^l\ d\' i = 1, ...,n >, I = 1, ...,L, with a® = a or
nona. These L joint outcomes could be classified into N classes or clusters of similar
scores s ^ = \sj, j = l,...,n < < n > , I — 1,...,N, where N « Kn. We can
associate a typical (or prototype) z^-value to each of these score classes or clusters,
since on the training data set we know both the final distance x and the n marginal
distances Xj. The greater the deviation j 1 — v$ \ of this prototype value u0, the more
consequential the global data interaction, and hence the more likely the classified UQ
approach is to outperform the simplistic v0 = 1 model.
For each application, having observed the actual data set d = {Di = di, i = 1,..., n}
we would calculate the corresponding score vector s ~ d. This score should be defined
in similar fashion as the training score s^ . We then find the training class s ^ closest
to that experimental score vector s and retrieve and use the ^o-prototype value of
that closest class. The distance measure needed to find that closest training class
should be defined carefully and that distance is case-dependent. One simple measure
could be the absolute distance r between the two scores s ^ = \sj , j = 1, ...,n >
and s — {SJ, j = 1, ...,n } defined as:
ro = mm \ = J2\sf-Sj\, l = l,...,N\, (3.19)
where N is the number of score classes.
The closest training class Co will be that with the smallest distance TQ.
As an example, consider a training image whereby an outcome A is evaluated by
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 69
three binary data values. This leads to a total of 23 = 8 data value combinations.
One training score s ^ could be the average of the three binary data values. The eight
data value combinations and their respective scores are shown in Table 3.4.
These eight data value combinations can be classified into 4 classes of equal scores
data value combination 000 001 010 011 100 101 110 111
score 0/3=0.00 1/3=0.33 1/3=0.33 2/3=0.67 1/3=0.33 2/3=0.67 2/3=0.67 3/3=1.00
Table 3.4: Eight data value combinations and their scores.
as shown in Figure 3.3.
Assume that the actual conditioning data event consists of the following eight bi
nary values: 0111 0101 with corresponding score s = 5/8 = 0.63. Note that the
actual information size is eight binary data as opposed to the training information
which consists of only three binary data. Using then equation (3.19), we find that
the training class number 3 (with training score s® = 0.67) is closest to the actual
data score s. We will thus use the training Rvalue corresponding to that third train
ing class. That VQ-value is likely not equal to 1 reflecting the training data interaction.
Following our decomposition paradigm presented in Section 3.1, we borrow only the
data interaction parameter u0 from the training set or catalog, no t any of the elemen
tary training probabilities. The z , i/0, parameters being ratio of ratios of conditional
probabilities (3.5), should be more stable (more homoscedastic) in regard to data
values than the training conditional probabilities. If this engineer-type conjecture of
permanence of ratios proves right, the u0-values borrowed from the training set should
70 CHAPTER 3. THE NU REPRESENTATION
training classes
s = 0.00 s = 0.33
001 000
i
© 010 100
0
s = 0.67
011 101 110
®
s = 1.00
111
Figure 3.3: Four training classes and their respective representative scores.
be exportable to the application field, much more so than the conditional probabil
ities. For example, one would not export to an actual subsurface hydrocarbon field
direct porosity or permeability measurements taken from an analog outcrop. Instead,
one may retain the more stable permeability ratio Kv(u)/Kh(u), with Kv being verti
cal permeability at location u and K^ being the horizontal permeability at that same
location u.
In any particular application, we suggest that
1. the n elementary distances Xi, i. e. the n elementary single datum-conditioned
probabilities P(A\Di = di) be evaluated directly using the actual data values
di. As we mentioned before, this problem has received many solutions. For
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 71
example, numerous literature sources propose algorithms for obtaining condi
tional probability via neural networks [3],[40], [41]. In geostatistics, one could
consider an indicator algorithm for modeling the elementary conditional dis
tribution functions [34]. Obtaining such elementary, single datum-conditioned,
probability is not in the scope of this thesis.
2. the single weight VQ modeling the joint data interaction be borrowed from a
proxy experiment (or training set) where the relations |^- versus -^- are known,
thus providing proxy values for UQ.
The difficulty with borrowing such proxy weight u0 is that this weight is (a; dj, i =
1,..., n)-values dependent. The proposed VQ classification accounts for such data val
ues dependence although approximating it through summaries (scores) of these data
values. That heteroscedasticity has its positive side: the u0 weights measuring joint
data interaction do depend on data values, as opposed to the homoscedastic kriging
variance and regression weights.
An example of the classified UQ approach
As an example of the classified UQ approach consider obtaining posterior conditional
probability P(A = 1|D, B) where D and B are two data events informing the unknown
A. For example, A could be a binary variable indicating the presence of sand at the
location u of a potential reservoir site. Data event D could be indicator of sand
fades at nearby well locations, and data event B is the indicator of sand facies but at
more remote wells. Assume then the availability of the following prior, pre-posterior
probabilities, and training image:
• P(A = 1): prior probability of A occurring. Such probability could be obtained
from historic data. Note, this prior is common to both data events D and B.
• P(A = 1|D) and P(A = 1|B): probability of A occurring given information
provided by data event D and B taken separately. This is equivalent (through
Bayes relation) to knowing the respective likelihood functions P(D\A — 1) and
72 CHAPTER 3. THE NU REPRESENTATION
P(B\A = 1) of observing data event D or B given the unknown A = 1. For
example:
, , P ( A = 1 , D ) P(D\A)P(A = 1) P(A = 1|D) = v ' ' - v ' ' v !
P(D) P(D) P{B\A)P(A = 1) (3.20)
X) P(D|A = a)P(A = a) (a=0,l)
• the training image of Figure 3.4. Such training image is a synthetic representa
tion of the interaction between data events D and B and between them and the
unknown A. In practice a training image could be obtained from an outcrop or
built using a process-based simulation algorithm [70], [74]. The training image
allows retrieving the data values-dependent global interaction parameter i/0 of
equation (3.5).
Next consider the templates defining the data events D and B as shown in
Figure 3.5 (1) and (2):
— the closest data event D comprises 4 data locations located 10 meters
away from the unknown A(\i). These 4 data are located at the corners of
a square centered at location u.
— the second data event B comprises also 4 data locations with the same
geometry but located 15 meters away from the unknown u;
When conditioning only to either D or B alone, there are 24 = 16 possible
combinations of binary data values to consider. When conditioning jointly to
both D and B data events, there are 28 = 256 possible combinations of binary
data values. Note, that ideally a training image such as that shown in Figure
3.4 should be large enough to depict all possible 256 data value combinations.
The training image provides replicates of the (D, B) joint data event and the
corresponding A value. That training image thus provides all probabilities of
APPROXIMATIONS BASED ON THE NU DERIVATION 73
Training Image p=0.72 0.28
50 100 150 200 250 East-West direction
ure 3.4: Training image depicting the interactions between data and unknown.
74 CHAPTER 3. THE NU REPRESENTATION
( D (2)
Figure 3.5: Data events definitions.
the type P(A\B), P(A\D), and P(A\B, D) and consequently proxy values of the
u0 data interaction parameter, as defined in equation 3.5.
Regarding the inference issue, if the training image is not large and "rich"
enough to display enough replicates of all data events (taken jointly) found
in the actual field, one can reduce these data events through a few summary
statistics or scores. For example, the 256 (D, B) data values combinations of
this example could be summarized by two scores Si and S2, where:
1. Score Si is the arithmetic average of the (4+4)=8 data values
2. Score 52 could a measure of east-west connectivity calculated on the same
8 data values as suggested by Zhang [74].
That is the 256 data value combinations of the data template of Figure 3.5(2)
have been summarized by only two scores Si and S2- Figure 3.6 shows schemat
ically such dimension reduction where the two scores Si and S2 are plotted on
x and y axis, respectively. Each pair (Si, S2) on the score map (Figure 3.6,
3.3. APPROXIMATIONS BASED ON THE NU DERIVATION 75
training image score map ° 2 •?
training data reduced by two summary scores
•
•
• • • ••• • • • • »•
• • • •
•
•
••
• •
• • • •
• •• • • ••
• • •
50 100 150 200 250 S-, East-West direction
Figure 3.6: Training image (left) is summarized by the distribution of two summary scores shown on the score map (right).
right) corresponds to a particular training data occurrence with the configura
tion shown in Figure 3.5(2).
Further using a traditional classification technique such as cross or A;-mean par
titioning, the score space is divided into clusters or classes of similar score values.
For example, Figure 3.6 shows nine such classes. For each such class, we can
retrieve a prototype u0 value by, for example, taking the average or median
of that class training u0 values. These nine prototype u0 values are likely all
different from the value 1, they allow us to step away from the assumption of
no-data-interaction of the VQ = 1 model.
In the application phase, it is a simple task to find the training class closest to the
actual conditioning data scores and retrieve that class prototype vo value to combine
the elementary probabilities. The classified UQ paradigm is general in that:
1. The actual conditioning data event can be quite complex. In the example of
Figure 3.5 the conditioning data events D and B comprise 4 data points each.
76 CHAPTER 3. THE NU REPRESENTATION
In an actual application, the joint conditioning data event might comprise many
more than eight data points. Particularly important are the actual data score
values retained to find the closest training class and retrieve its prototype u0
value.
2. The actual conditioning scores need not match exactly any of the training class
scores. In other words, the actual conditioning data event does not need to have
exact replicates in the training image: it suffices to find the training class with
the closest set of score values.
Of course, the set of scores retained should be chosen so that it reflects the main
characteristic of any specific joint conditioning data set. Too many scores and the
training image available may not offer enough replicates to fill-in reliably the score
space.
Chapter 4
Application to binary data
The purpose of this chapter is to illustrate the nu model with application to binary
data sets. A binary data set consists of two values coded as either zero or one. We will
sometimes refer to the category zero as mud/no sand and to the category one as sand
following petroleum engineering convention. The reference binary data sets presented
in this work are assumed exhaustively known. Such reference data sets provides the
exact fully conditioned proportions and allows checking any approximation including
those resulting from the u0 = 1 model and the classified u0 approach against tradi
tional estimators based on data independence and conditional independence. Various
important parameters controlling data interaction are investigated. Particular focus
is given to the dependence of data interaction on data values. This heteroscedastic
dependence makes more difficult the inference of an accurate u0-raodel. The level
of heteroscedasticity of the tau and nu parameters are compared; we expect the nu
weights to be more stable versus data values and hence easier to infer.
4.1 An elementary case study
4.1.1 Equilateral configuration
To investigate how the nu and tau parameters relate to data interaction, the following
simple experiment is proposed. It involves one unknown A located at the center of an
77
78 CHAPTER 4. APPLICATION TO BINARY DATA
equilateral triangle and three data Ii, I2, h located at its three apices (Figure 4.1).
10.0
6.0
2.0
-2.0
-6.0
-10.0
-
1 1
1
—1—
1—
1
—1—
1—
1—
—1—
1
1
-
/
h ii
i i i
1:
A A
(5.77)
—i—i—r—
VlO.O)
\
— •
b
— i — i — i — — i — i — t —
-10.0 -6.0 -2.0
Figure 4.1: Spatial locations of three data I\, h, h and the unknown A. The distances are given in parentheses.
All four variables are binary (0,1) and were generated by truncation of a simulated
Gaussian field.
More precisely, 100,000 unconditional joint realizations of four corresponding Gaus
sian random variables Z(u) are generated by LU decomposition of their 4x4 covariance
matrix (program LUSIM in [11]).
The isotropic covariance model used to build the covariance matrix is:
7(h) = exp ( —), with practical range 3r
4.1. AN ELEMENTARY CASE STUDY 79
That range 3r is made variable from one set of 100,000 realizations to another set of
equal size. Each set allows to study data dependence and interaction in evaluating
the central value A.
All standard Gaussian realizations, denoted z(u), are truncated at the median value
z = 0 to generate joint realizations of the four binary indicator variables:
1 if z(ua) > 0
0 otherwise
where u0 is the location of the central value A to be evaluated, and u a , a = 1, 2, 3
are the three data locations.
The four variables being binary, there are a total of 24 = 16 possible joint combina
tions of their values. A joint probability of occurrence pk, k = 1,..., 16, is assigned 16
to each of these 16 joint combinations (Table 4.1). Note that ^ p\. = 1. fc=i
P i
P2
Pz Pi
P5
Pe P7
P8
A 0 0 0 0 0 0 0 0
h 0 0 0 0 1 1 1 1
h 0 0 1 1 0 0 1 1
h 0 1 0 1 0 1 0 1
p% PlO
P l l
P\2
Pl3
Pu Pl5
Pl6
A 1 1 1 1 1 1 1 1
h 0 0 0 0 1 1 1 1
h 0 0 1 1 0 0 1 1
h 0 1 0 1 0 1 0 1
Table 4.1: Probability notation for the 16 joint occurrences.
The prior or marginal probability associated with any of the four binary variables is
Po = 0.5 corresponding to the prior distance:
x0 = !^ -£o = 1 with po = P(A =1) = 0.5.
* (u a) =
80 CHAPTER 4. APPLICATION TO BINARY DATA
0.9 v0=1modej.. -
P(A)=1 -e—e- o o o o—e-e-e-o o o -e-e-e—&- o o o <b
10 15 20
Practical range
25
Figure 4.2: Conditional probabilities. Concordant data case: A = I\ = I2 = h = 1. The fo = 1 model outperforms the model based on conditional independence assumption as seen from the zo — 1 model values being closer to reference probability.
The 16 probabilities pk are set equal to the corresponding 16 proportions of joint
occurrence calculated from each set of 100,000 simulated realizations of the four vari
ables A, Ii, I2, h- From such consistent set of probabilities of joint occurrence,
all conditional probabilities can be retrieved and plotted vs. the practical range 3r
(Figure 4.2).
For example, the probability that A = 1 given that all three indicator data are 1 is
retrieved as:
P(A=1\I1 = I2 = I3 = 1) P{Ahhh = 1) P{hhh = 1) Ps + Pie
Pie (4.2)
The practical range 3r of the underlying Gaussian random function is increased along
the abscissa axis of Figure 4.2 from zero (pure nugget effect) to 30, a large range value
4.1. AN ELEMENTARY CASE STUDY 81
three times the distance between any two of the three data Ij in Figure 4.1. The or
dinate axis gives the various probabilities, all related to the event A = l . All of these
probabilities are seen to increase with the correlation range 3r, as expected for the
case of concordant data: I\ = I2 = ^3 = 1.
The probabilities plotted along the ordinate axis of Figure 4.2 are:
• the constant prior probability P(A = 1) ~ 0.5. The small fluctuations around
the expected value 0.5 are due to the finite sample of only 100,000 realizations.
• the three elementary conditional probabilities conditioned to one single datum:
P(A = 1 I Ij = 1), j — 1, 2, 3. Had we drawn an infinite number of realizations
(instead of only 100,000) these three elementary conditional probabilities will
be all exactly equal.
• the exact fully conditioned probability P(A = 1 | Iihh — 1) a s calculated from
expression (4.2)
• the approximation of the previous exact fully conditioned probability using the
nu expression (3.6) with the approximation u0 = 1.
_1_
l + x* P:0=1(A = 1 I hhh = 1) = ^ 7 ^ (4.3)
with the distance x* approximated with the v0 = 1 model as:
X1 = T\X± ^ 1 - ^ ( 4 = 1 1 4 = 1) zo I J V Xk P(A = l\Ik = l)
• the approximation of the exact fully conditioned probability using the com
mon hypothesis of data conditional independence. More precisely, the exact
probability is written:
F ( A = l | / 1 , / 2 , / 3 = l) = ^ = / 2 = / s = 1 l - 4 = I ) P ( ' 4 = 1 ) P(h = h = h=\)
82 CHAPTER 4. APPLICATION TO BINARY DATA
per data conditional independence:
P (l1 = l2 = I3 = l\A = l) =
P (I1 = 1\A = l)P(h = 1\A= 1)P(I3 = 1\A = 1)
where all conditional probabilities can be written in terms of the probabilities
Pk of Table 4.1, e.g.
16 16
p(h = i\A = i) = ^2Pk/^2Pk fc=13 k=
P(h = 1, I2 = 1, h = 1) = P{hhh = 1) = Ps + Pie
Thus, the approximation provided by data conditional independence is:
3 P(A =l)Y\P(Ik=l\A = l)
PXT(A = 1 I hhh = 1) = ^ - r (4.4) CIK ' 2 3 ; P{hhh = 1)
The denominator in expression (4.4), although available here, is typically very difficult
to get in real practice. This is the reason why the practice of conditional independence
considers ratio of conditional probabilities of the type [29]:
P*CI{A = 0 1 hhh = 1) _ P(A = 0)"p(Ik = l\A = 0)
P*CI{A = 1 | hhh = 1) P(A = 1) 1=1 P(Ik = 1 | A = 1) l ' j
This ratio is none other than expression (3.4) of the distance x under data conditional
independence given A — 1 and A = 0, i.e. expression (4.5) entails the vQ — \ model.
However, the u0 — 1 model is not necessarily based on two previous assumptions of
conditional independence (given A=\ and A = 0); in that regard the VQ = 1 model
is a less restrictive hypothesis, that of no-data-interaction.
4.1. AN ELEMENTARY CASE STUDY 83
0.04
0.03
0.01
-0.02 -
-0.03
-0.04
1
data values for l,l2l
^ - . i^ ia i .S^y- 'arr . j ' ! ' ! . ' • '• S"^-;* W " * ^ M S ,
-
-
1
1
3
• '
/ '' - o -
- ' . ^ '•> ' * r '»1- C-v
'v' \ '\ \
,
1 '
. ** ^
^ ' • ' ' ^
.-• , . . -c. ' .•• - - * / " • . ? N \Q
, .» • ' ' - ' .£••''
\ - ' •s v-;\
V % i O . Oln
. _ , - . - . . • '
/'"* -' . ><*.; ff - > • -
:*"''
J0n
, p
'"" ' — ' v p
*" ' 10 15 20
Practical range 25 30
Figure 4.3: Data values-dependent error associated with the vo = 1 model. The largest error is attributed to the cases when all three data I\, I2, h are concordant (the cases [1,1,1] and [0,0,0]) deviating most from assumption of no-data-interaction.
From Figure 4.2, the u0 = 1 model is seen to provide a better approximation than the
conditional independence hypothesis (given A = 1), increasingly better as the corre
lation range 3r increases. The v0 — 1 model corresponds to a hypothesis of no-data-
interaction which is increasingly poorer as the correlation within data and between
data and unknown increases. In the case of concordant data (Ii = I2 = I3 = I) used
to evaluate the probability of A — 1, ignoring data interaction by assuming UQ = 1
leads to over-compounding the three individual probabilities
P(A = 1 I Ik = 1) and an overestimation increasing with the correlation range.
Interestingly, the conditional independence approximation (A — 1) leads to an un
derestimation of the exact fully conditioned probability P(A = l\lil2l3 = 1).
Dependence on Data Values
To evaluate how the vQ = 1 approximation fares depending on the set of three data
values, Figure 4.3 plots the error
84 CHAPTER 4. APPLICATION TO BINARY DATA
1.25
10 15 20
Practical range
Figure 4.4: The sequence-dependent Vi weights for the data concordant case A = I\ = I2 = h = 1- The first weight v\ is equal to 1 by definition. The third weight u3 reflects the greatest interaction. All three weights increase with correlation ranges since all data are concordant.
[P;Q=1{A = I | ilti2,i3) -P(A = I\ h,i2^)]
for the 23 = 8 possible sets of data values (Ji = i i , h = *2, h = H)-
As expected, ignoring data interaction leads to increasing errors as the correlation
range increases with overestimation when two or more of these the data are valued
1 and underestimation for the other cases. Also the largest error occurs when the
three data values are concordant (1, 1, 1) or (0, 0, 0), cases which contradict most
the assumption of no-data-interaction.
EXACT N U WEIGHTS
Availability of the exhaustive set of 16 joint probabilities pk allows calculation of the
exact (yi, vQ) weights as defined by expression (3.5). Figure 4.4 shows the three data
sequence-dependent i/j weights calculated for the case A = 11/2/3 = 1- Three data
4.1. AN ELEMENTARY CASE STUDY 85
h,h, h produce 3! = 6 possible data sequences. However, because of the equilateral
data configuration (Figure 4.1) associated with an isotropic correlation, the data
sequence does not matter here. The first datum in any sequence always receives a
unit weight V\ = \. As the correlation range increases the interaction between the two
first data increases leading to an increasing second nu weight i/2. The third datum
u3 reflects the even greater interaction between the first two data and the last one.
Note that data interaction and hence the nu weights are data values-dependent; that
interaction is maximal here when all data are concordant I\ = I2 — I3 = 1, and it
increases with the correlation range.
0.91 1 1 1 1 1 1 0 5 10 15 20 25 30
practical range
Figure 4.5: The single sequence-independent uQ weight: (1) with only two concordant data A = Ii = Ij = 1 and h = 0 with i 7 j ^ k (solid line) and (2) with all data that are concordant A = I\ = I2 = I2 — 1 (circled line). The interaction is the greatest when all three data Ii, I2, and ^3 = 1 are concordant. This interaction increases with the correlation range.
Figure 4.5 gives the single, data sequence-independent, exact u0 weight for the case
when all data are concordant A = J : = J2 = I3 = 1 (solid curve marked by circles)
and for the case with only two concordant data A = Ii = Ij = 1 with i ^ j = 1, 2, 3
86 CHAPTER 4. APPLICATION TO BINARY DATA
0.04
0.03
-0.01
-0.02
-0.03
v0 vanes
v0=1
10 15
Practical range
20
Figure 4.6: The averaged error associated with data-value-dependent u0 model and with the VQ — 1 model. Data-value-dependent UQ model shows significant improvement reflected in smaller errors.
(solid curve). Note, in this case it does not matter which two of the three data are
concordant because of equilateral data configuration (Figure 4.1). In the presence of
concordant values A = Ii = I2 = h = 1, the strong data interaction is expressed
through an exact vQ value increasingly different from 1 as the data are more depen
dent one to another. With only two concordant data, the v0 weight is still increasing
with the range. However, as expected, this increase is less dramatic than for the case
with three concordant data.
Our inference paradigm consists of two steps. We first evaluate the single datum-
conditioned probabilities P(A = 1 | I3•, = 1), j = l , 2, 3 using the actual data from the
actual field under study. We then use some training image or expert catalog the data
values-dependent vQ weight and export it to the actual field under study to combine
the previous single datum-conditioned probabilities.
For example, assume that from some prior expertise (perhaps built from experiments
4.1. AN ELEMENTARY CASE STUDY 87
on training data sets similar to that used in this study) we have access to the following
VQ weight function:
{ 1, whatever the data values for any small range 3r < 6
1 + ••i2ooo i a s function of the practical range 3r > 6
This function is assumed applicable only when two or more data are valued 1. The
error graphs (Figure 4.3) are re-calculated using this improved i/o-model. The results
of Figure 4.6 show a significant reduction of the error and demonstrate that the
worth and practicality of the nu/tau approach depends on ability to go beyond the
approximation u0 — 1.
4.1.2 Non-equilateral configuration
For this second example, a non-equilateral configuration of three data was retained to
observe the impact of data locations on data interaction. Figure 4.7 shows the data
configuration, and Table 4.2 gives the corresponding Euclidean distances.
The study built around this data configuration is similar to that done for the equi
lateral case. Figure 4.8 shows the conditional probabilities associated to the case
A = I\ — Ii = Is = 1. Note that data values concordance represents an unfavorable
case for any independence-related approximation.
For this example we also included one more estimator for comparison of the results
based on the VQ = 1 model. This estimator considers a hypothesis of data indepen
dence combined with the hypothesis of conditional independence given A = 1.
A
h h h
h 10.63 0.00 21.40 3.61
h 11.18 21.40 0.00
22.83
h 11.66 3.61 22.83 0.00
Table 4.2: Distances between data-to-unknown and data-to-data.
CHAPTER 4. APPLICATION TO BINARY DATA
115
110
105
100
95
90
-
-
1 la
^
|
* \
A/
r
i i i
l2
/t / i
ft i
t
7
"'"--,..
85-
90 94 98 102 106 110
Figure 4.7: Non-equilateral data configuration.
Figure 4.9 shows the sum of the two estimates P*(A \ Ix,I2,h) + P*(A \ Ii,I2,I3)
for the three sets of approximations. That sum should be equal to one. It appears
that only the estimate associated with u0 = 1 verifies that consistency relation for all
ranges. The two independence-based estimates (4.4) and (4.7) are not self consistent
(over A and ^4), particularly the estimate based on conditional independence. This
consistency represents a valuable in-built property of the u0 — 1 approximation in
presence of data dependence.
We will call that combination of hypotheses as "full independence". The resulting
approximation is written:
P*{A\I1,I2,h) = P(A,h,I2,I3) P{A)P{h | A)P(I2 | A,h)P{h I A,h,I2)
P(h,h,h) P(h,h,h)
The numerator per conditional independence given A = 1 is written:
4.1. AN ELEMENTARY CASE STUDY
10 15 20 Practical range
30
Figure 4.8: Conditional probabilities for non-equilateral case with A = Ii = I2 = h = 1. The estimate based on full independence assumption (line marked by points) leads to a large over-compounding of the concordant information. Conditional independence estimate (line marked by plus signs) gives probability that is less than 0.5 for small (< 21) ranges. The u0 = 1 model (dash-dotted line) provides consistently better results.
P(A)P(I\ | A)P(l2 | A)P(h | A). The denominator per data independence is writ
ten: F(/1)P(/2)P(/3). Thus,
P *(A\h,I2,h) = P(A)P{h 1 A)P{I2 1 A)P{h 1 A) P(/1)P(/2)P(/3)
p(A)P(A 1 h)P(h) P(A | I2)P{h) P(A 1 h)P(h) nA) P(A) P(A) P(A)
P{h)P{I2)P{h)
P(A 1 h)P{A | h)P{A J h) P(A)2
per Bayes' inversion
(4.6)
90 CHAPTER 4. APPLICATION TO BINARY DATA
conditional independence full independence
u u c 01
c o
15 20
Practical range 25
Figure 4.9: Checking the consistency relation. Case I\ — I2 = h = 1- The v0 = 1 model produces the licit probabilities. The estimates based on data independence assumptions (both conditional and full independence) do not follow the general law of probabilities which requires the probabilities to sum to 1.
Or, equivalently:
P\A | Iu I2,13) P(A | h) P(A | I2) P(A | J3)
P(A) P(A) P(A) P{A) (4.7)
For example, in terms of the p'ks of Table 4.1, the probability P(A — 1 | I\ = 1) is
obtained as:
P(A = l\h = l) = 16 E Pk
t=13
16
16 8 £ Pk+ E Pk
k=13 k=5
, and P(A = 1) = E Pk-k=9
From Figure 4.8, we observe that the estimate (4.7) based on "full independence"
leads to a large over-compounding of the concordant information It = 1. Conditional
independence (4.4) gives an estimate which is less than the prior probability (0.5)
at small ranges; this represents a severe error since all three individual probabilities
P(A = 1 | 7j = 1) are above the prior. Again the uo = 1 approximation (4.3) provides
consistently better estimate.
4.1. AN ELEMENTARY CASE STUDY 91
- * - conditional independence 0.4 I -©- full independence
0 5 10 IS 20 25 30
Practical range
Figure 4.10: Approximation errors for the eight data value configurations. The conditional probability estimated through u0 = 1 model (solid lines) has more stable and smaller errors; the conditional independence assumption (lines marked with stars) leads to the largest errors.
Figure 4.9 shows the sum of the two estimates P*(A | Ii,I2,I3) + P*{A \ IUI2,I3)
for the three sets of approximations. That sum should be equal to 1. It appears
that only the estimate associated with v§ = \ verifies that consistency relation for all
ranges. The two independence-based estimates (4.4) and (4.7) are not self consistent
(over A and A), particularly the estimate based on conditional independence. This
consistency represents a valuable in-built property of the vo = 1 approximation in
presence of data dependence.
The approximation errors defined as:
[ i ^ = 1 (A = 1 | i 1 ) i 2 , i 3 ) - F ( A = l l i i . i a , ^ ) ]
lP£I(A=l\il,i2,i3)-P(A = l\i1,i2,i3)}
[P*FI{A = 1 | h,i2,i3) -P{A = \\ i i , i2 , i3)]
92 CHAPTER 4. APPLICATION TO BINARY DATA
Practical range
Figure 4.11: Error linked to v§ = 1 (non-equilateral case). The errors attributed to the VQ = 1 model are small and stable attesting that this model is the best among others presented.
data values for I, l2l3
10 IS 20
Practical range
Figure 4.12: Error linked to "full independence" hypothesis (non-equilateral case). The largest error is attributed to the case when all three data are equal to 1. This is the case when the consequence of wrong assumption of full independence is most severe.
4.1. AN ELEMENTARY CASE STUDY 93
data values fori, l j l3
. 0 . 2 i 1 1 1 1 1 1 O S 10 15 20 25 30
Practical range
Figure 4.13: Error linked to conditional independence (non-equilateral case). The errors are large and non-stable. The positive errors associated with overestimation of true conditional probability (A = 1) are higher than the negative errors associated with underestimation.
for each of the eight data values combinations when estimating A=\ are plotted in Fig
ure 4.10. Again, the conditional probability estimated through the vQ = 1 assumption
has more stable and smaller errors; the conditional independence assumption leads
to the largest errors. Figures 4.11, 4.12, 4.13 give the errors specific to each estimate
with indication of the three data values. Beware of the different ordinate axis scaling.
The errors associated with the u0 — 1 estimate (4.3) are small and centered around
zero (Figure 4.11). That error is smallest when the two close-by data are different
(Ii 7 J3), corresponding to data values less conflicting with the underlying no-data-
interaction hypothesis. The vo = 1 model appears to downplay the contribution of
the isolated Ii datum value: in Figure 4.11 the two error curves for 72 = 0 and h = 1
are similar for any given combination of the Ii, I3 data values. The smallest errors
for the z/fj = 1 model are related to cases of non-concordant data values, particularly
non-concordant I\ and I3 values, i.e. 001, 011, and 110.
Figure 4.12 shows the errors for the "full independence" estimate. The error is largest
94 CHAPTER 4. APPLICATION TO BINARY DATA
x104
6
5
4
o
UJ 3
2
1
0
-1
5 10 15 20 25 30
Practical range Figure 4.14: Bias (error) averaged over all data values combinations (non-equilateral case). The full independence estimator and uQ — 1 models provide reasonably unbiased estimates, while the conditional independence leads to severe overestimation.
for the case of data Ii = I2 — h = 1 concordant with the outcome A = 1 being
evaluated. In such case, the assumption of data independence is most invalid. The
most significant result is the large error associated with the conditional independence
estimate, see Figure 4.13. The errors are much larger and more unstable than for
the other two estimates. Also, the positive errors associated with overestimation of
the true conditional probability that A = 1 are much higher than the negative errors
associated with underestimation leading to an overall bias.
Figure 4.14 shows the bias or error averaged over the eight data value combinations
when estimating the probability that A=l. On average, the v0 = 1 model (4.3) and
the "full independence" model (4.7) provide reasonably unbiased estimates, while the
estimate based on conditional independence leads to a severe overestimation of the
reference posterior probability.
conditional independence full independence
— • v „ = 1
4.2. A 3D CASE STUDY 95
4.2 A 3D case study
The applicability of the u0 inference paradigm is now tested using a large 3D reference
binary data set where all conditional probabilities involved in the tau and nu expres
sions (2.32) and (3.6) are known, including the exact full data-conditioned probability
P(A = a\Di = di, i = 1, ...,n). Various approximations of that reference probability
can be evaluated. The heteroscedasticity of the uo, Vi and Tj weights, i.e. their level
of dependence on the data values (di, i = 1,..., n) can be evaluated. The greater that
heteroscedasticity, the more difficult would be the inference of these data interaction
parameters in practice.
4.2.1 The reference da ta set
We start by generating a reasonably large 3D non-conditional realization of a Gaussian
field using the sequential Gaussian simulation code sgsim of the GSLIB software [11].
This 3D field is of size 100x100x50, comprising 500,000 nodes. The variogram model
used is spherical with small nugget (10%), isotropic horizontal range equal to 50 pixel
units, and a shorter vertical range equal to 20 pixel units. This Gaussian field is then
truncated at its upper quartile value, yielding the reference binary indicator field
shown in Figure 4.15.
Denote that reference field by S : {A(u) = 0 or 1, u e S } with P(A(u) = 1) =
0.25. In our paper, we will sometimes refer to the binary data valued 1 as sand.
Conversely, the binary data valued 0 will be referred to as non-sand or mud. We
borrowed this convention from petroleum engineering where the location of channel
sand is of the great interest. Figure 4.16 gives the reference indicator variograms
in the x, y, z directions calculated from indicator data from the top 35 layers of S;
the reason for excluding the bottom 15 layers will become apparent soon hereafter.
Those indicator variograms reflect the horizontal-to-vertical anisotropy of the original
Gaussian field.
96 CHAPTER 4. APPLICATION TO BINARY DATA
t=50
t=1
P(A=1)=0.25 P(nonA)= P(A=0)=0.75 shown in black
Figure 4.15: Reference binary image generated by truncating a continuous Gaussian realization at its upper quartile.
0.3 -
0.25 -
D.2
0.19
0.1
0.05 - .
Horizontal EW
0 10 20 30 40 50 SO 70 SI
Mean: p=.274 Variance: p(1-p)=.199
0.3 -
0.25 —
0.15 -^
0.1 -_
0.05 -
Horizontal NS '•;"r,"\' ~v~~\~ :""T"\r"";
|- - - r - i - -1 - - \A
\--'r^^-\--\
•'•&'* - - r - r - f - -
liniJTinjimTfiWpiii
. - . j - - - | . - - ;
- - - p - - » - - ;
0.3 •
025
0.2 •
0.15 •
0.1 •
0.09 •
["" I 1
0 5
Vertical
! | l I , | I I I ! | l l l | l l l l j
10 15 20 25 30 35
Figure 4.16: Exhaustive indicator variograms, calculated over the 35 top layers. EW is the east-west direction and NS is north-south direction.
4.2. A 3D CASE STUDY 97
(1): conditioning to (2): conditioning to (3): conditioning to one data event D two data events D,B three data events D,B,C
Figure 4.17: Data events definition.
4.2.2 The estimation configuration
Consider the evaluation of the conditional probability of an unsampled value A(u)=l,
given any combination of the following three multiple-point data events (Figures 4.17
(1), (2), (3)):
• the closest data event D comprises four data locations at the level just below
that of A(u). These four data are at the corners of a square centered on the
projection of location u on their level (Figure 4.17 (1)).
• the next closest data event B comprises also four data locations with the same
geometry as for data event D, but located five levels below that of A(u); (Figure
4.17 (2)).
• the furthest away data event C again comprises four data locations but located
15 levels below that of A(u) (Figure 4.17 (3)).
98 CHAPTER 4. APPLICATION TO BINARY DATA
If the unsampled location u of A(u) spans only the eroded field
So = {x = 11,. . . , 90; y = 11,...,90; z = 16, ..50} then each value A(u) can be eval
uated by any of the 3 data events D, B, C. From here on, all statistics will refer
to that "common denominator" field S0 comprising 224,000 nodes. Over that cen
tral field So, the marginal statistics for the event A = 1 being assessed is P(A)=0.274.
The definition of an "eroded" field S0 common to all data configurations entails that
the spatial averages of conditional probabilities (proportions) remain the same no
matter the conditioning data event retained. For example, if conditioning is only to
the sole D-data event:
P(A\D) = r ^ E P ^ ( u ) = ^ = d(u))
= i j E p(Aw =*) = pw= ° - 2 7 4
On1 — ueS0
where the data event D can take 24 — 16 possible combinations of data values. When
conditioning jointly to the two D and B data events:
P(A\D, B) = - ^ Y, P W U ) = 1ID = d(u) ' B = b(u))
= p(A = 1) = 0.274
where the data event (D, B) can take 28 = 256 possible combinations of binary data
values.
When conditioning jointly to three data events D , B and C:
P{A\D, B,C) = -±-Yl P ^ ( u ) = 1 I D = d ( u ) ' B = b(u)> C = c ( u ) )
= P(A = 1) = 0.274
where the data event (D, B, C) can take 212 = 4096 possible combinations of data
values. Because S0 is not that large, |So|=224,000 nodes, not all 212 data values
4.2. A 3D CASE STUDY 99
combinations are present in So; this does not affect, however, the previous equality:
P(A = 1|D, B, C) = P(A = 1) = 0.274.
Note also that the nu representation (3.6) does not restrict us to only the point
support of the unknown A (as in this example). Unknown event A can similarly be
defined as the data event provided we can find enough replicates of such data event
in our reference binary data set.
4.2.3 Conditional probabilities and estimates
As an example, Figure 4.18(1) gives the So-volume of the 224,000 exact probability
values P(A — 1|D, B, C) which are valued in the interval [0, 1] with mean 0.274 and
variance 0.067. Again, the mean is equal to that of the reference binary values. The
histogram of the probability values is given in Figure 4.18(2). We will use this field
as the comparison tool for future analysis. Similar figures and statistics are available
for all the following conditional probabilities, although not all are given here:
• single data event-conditioned: P(A(u) = 1|D), P(A(u) = 1|B), P(A(u) = 1|C)
• two data events-conditioned: P(A(u) = 1|D, B), P(A(u) = 1|D, C),
P(A(u) = 1|B, C)
• all three data events-conditioned: P(A(u) = 1|D, B, C)
• the estimated probability P*(A(u — 1)|D, B, C) using the uQ = 1 model (3.6)
to combine the previous single data-event conditioned probabilities.
Under the u0 = 1 model at each location u the estimate is:
P*(A{u) = 1|D, B, C) =
P*(A(u) = 1|D = d(u), B = b(u), C = c(u)) = 1 + ^ ( u ) ,
with for the estimated distance x*(u):
x*(u) _ x D ( u ) x B ( u ) x c ( u )
XQ XQ XQ XQ
100 CHAPTER 4. APPLICATION TO BINARY DATA
(3) reference binary values
P(A=1)=0.274
1
0.893
0.666
0.5
0.333
0
0
V*/o.3
0 .25 -
frequency
o -
N=224,000 | m=0.274 1 a2=0.067
1
L H*L •b^. v0 model
Figure 4.18: (1) The reference eroded data set S0, (2) its histogram , and (3) binary reference field with the prior P(A = 1) = 0.274. The mean of the eroded data set is equal to that of the reference binary values. The area of the probability map (1) is equal to 80x80x35=224,000. This probability map will be used as the comparison tool for future analysis.
where: _ 1 - P(A = 1)
x0
i n 274
0 274 = ^ ' ^ *s ^ e marginal distance;
xD(u) = pfh^L 111)=: d(uY) is t h e d i s t a n c e t 0 ^(u) 1 updated by the data LM.J A | i - * KA.\U.JJ
event D = d(u) .
The distance xD(u) varies from one location u to another. It is obtained by scanning
the reference image 5o with the template definition of Figure 4.17(1) for the proportion
of D-replicates identifying the data values combination rf(u) which also features at
their upper center (one level above) a value A(u) = 1. Note our estimation paradigm
assumes that all elementary conditional probabilities P(A\T)), P(A\B), P(A\C) are
known. This analysis addresses only the problem of combining these elementary
probabilities into an estimate of the fully conditioned probability P(^4|D, B, C)
while accounting for data interaction. Similarly, from the training image one can
4.2. A 3D CASE STUDY 101
retrieve the other two elementary distances xB(u) and xc(u).
The vo = 1 model then provides an estimate of the fully conditioned probability,
P*(A(u) = 1|D, B, C) (Figure 4.19(1)).
Figure 4.19: (1) The estimate of fully conditioned probability P(A | D,B,C) using the uQ = 1 model, (2) its histogram, and (3) reference binary field with the prior P(A = 1) = 0.274. The spatial mean and variance of the estimated probabilities using the uQ = 1 model are greater than the corresponding statistics of the reference case leaving room for improvement.
This estimate is necessarily valued in the interval [0, 1]: its spatial mean is 0.288 and
spatial variance is 0.098. Its histogram is given in Figure 4.19(2). The histogram and
scattergram of the error defined as
P*(A(u) = 1|D, B, C) - P(A(u) = 1|D, B, C)
are shown in Figure 4.20.
The spatial variance and the spatial mean of the estimated probabilities using the
UQ = 1 model are greater than the corresponding statistics of the exact conditional
102 CHAPTER 4. APPLICATION TO BINARY DATA
P*: m=0.288 o2=0.098 P: m=0.274 o2 =0.067
v0 model reference
(1) (2)
Figure 4.20: (1) Histogram of error P*(A | D,B,C) - P(A | D,B,C) and (2) the corresponding scatterplot of P*(A | D,B,C) based on zo = 1 model versus reference P(A | D ,B,C) .
probability of Figure 4.18 leading to a positive mean error of 0.14. One would expect
smoothing (smaller spatial variance) from an estimation. Note that the vo — 1 model
corresponds to an approximation of no-data-interaction which is a poor assumption
in presence of the two well correlated data events D and B. This ignorance of data
interaction results in over-compounding of the individual single-datum conditioned
probabilities leading to an overestimation of the fully conditioned probability and an
associated greater variance.
4.2.4 Ordering the data values combinations
The statistics presented in Figures 4.19 and 4.20 pool together the 224,000 estimated
conditional probabilities over So, irrespective of the actual conditioning data values.
4.2. A 3D CASE STUDY 103
Recall that there are 4 * 3 = 12 binary indicator data grouped four by four into the
three data events D, B, and C; therefore there is only a total of 212 = 4096 possible
data values combinations.
To study heteroscedasticity of the nu and tau parameters, that is their dependence
on data values, we should first rank or classify the 4096 possible data values combi
nations, then plot the Ui, Tj, VQ parameters vs. data values combinations and observe
their data values dependence. Note that the lesser that data dependence, particularly
of the single parameter uQ, the easier would be its inference in practice; this would
justify our paradigm of separating individual data event contribution and data inter
action.
Out of the total of 4096 possible data values combinations, 96 were not found in
So and of the remaining 4,000 only 931 combinations were found with at least 10
replicates. To ensure statistical significance we retain only the latter. These 931 data
values combinations were ranked along the abscissa axes of Figures 4.21 and 4.22
with increasing proportion of binary data valued 1, starting at abscissa 1 with 12
binary data all valued 0 (which may be interpreted as "no sand" event) and ending
at abscissa 931 with all 12 data valued 1 (which may be interpreted as all "sand").
The combinations with the same proportion of binary data valued 1 were then ranked
by physical distance to the unknown event A. From the template definition (Figure
4.17(3)), the data event D is closest to the unknown event A, followed by data event
B; then by data event C which is the furthest from that unknown A.
The proportion of binary data valued 1 increases toward higher abscissa axis and
since we are evaluating the probability of event A = 1, we are expecting increase in
data interaction, hence any hypothesis of no-data-interaction or data independence
would become worse.
The next section discusses advantages in use of nu model versus tau model for the
above cases and for general setting.
104 CHAPTER 4. APPLICATION TO BINARY DATA
01
o w
BDC/DBC
^ |TW
I blue: 02(nu) = 0.48 red: <x2(tau) = 18.6
100 200 300 400 500 600 700 800 9
data value combination id
(1)
CDB/DCB
blue: o2(nu) l = 2.85 red: a2(tau) =80.2
BCD/CBD
blue: a2(nu) = 2.63 red: c2(tau) = 5.24
100 200 300 400 500 600 700 800 900
data value combination id
(2)
100 200 300 400 500 600 700 800 900
data value combination id
(3)
Figure 4.21: Sequence-dependent interaction parameters r3 (red) and u^ (blue) for data sequences (1) D B C / B D C , (2) D C B / C D B , and (3) C B D / B C D .
4.2.5 Heteroscedasticity of the tau and nu weights
It follows from from expressions (3.5) and (2.45) that,
• for any data sequence: v\ = T\ = \
• the Vi and Tj parameters are data sequence-dependent. For example, the last
parameter, v$ or r3, is not the same whether it applies to the sequence BDC or
the sequence CDB (Figures 4.21(1), (2)). However, this last parameter remains
unchanged from sequence BCD to sequence CBD (Figure 4.21(3)), or from
sequence BDC to sequence DBC (Figure 4.21(1)). Indeed, the last parameter
4.2. A 3D CASE STUDY 105
30
:ion
para
mel
te
rad
c
.
v0
iMkjWikiJ*
3
=nv, = l<v2<v3
ll jj
•
111 'IllilliJIf *•
yyMi|i|f|i|iijif|in';
100 200 300 400 500 £00 700 800 900
data value combination id
cr
^ S f
interaction parameter
Figure 4.22: Exact VQ parameter for 931 data value combinations. The VQ = \ model is excellent for the 600 first data value combinations as can be seen from the VQ values being close to 1.
(us or r3) measures the data interaction between the last data event (D in Figure
4.21(3)) and the indifferentiated ensemble of all previous data (BC or CB).
• the single global data interaction parameter u0 — v\v2vz = v2v^ is data sequence-
independent. The greater |1 — v^\ the larger the global data interaction is.
Figures 4.21 give the (r3, u3) parameter values applied to the last datum event in the
data sequence, as calculated from their exact expressions (3.5) and (2.45) using the
exhaustive proportions read from the reference field So- The following observations
can be made:
• the tau parameter is more unstable than the corresponding nu parameter as
seen from higher variability in tau series compared to nu series (Figures 4.21).
This is due to the denominator of the tau expression (2.45) becoming close to
log 1 = 0 whenever a datum is little or non-informative in discriminating A = a
from A = nona, as is the case for the furthest away datum event C.
106 CHAPTER 4. APPLICATION TO BINARY DATA
• the u3 parameter accounting for interaction of the last data event in the sequence
is smallest when applied to the non-informative remote data event C (Figure
4.21(1)) and largest when applied to the two most informative closer data events
D or B (Figures 4.21(2) and 4.21(3)).
• the 1/3 parameter increases along the abscissa indicating that the data interaction
|1 — z/3| is data values-dependent and that data interaction increases as more
of the elementary binary indicator data are valued 1; note that the event being
assessed is A = 1. This is particularly notable when z/3 applies to D, the most
informative data event (Figure 4.21(3)).
Figure 4.22 gives the v0 global data interaction parameter as calculated from the
exact ^-expressions (3.5) with UQ = 1 • z/2 • ^3. This u0 value is seen to be data
value-dependent increasing as the three data events D, B, C become more redundant
in assessing the probability of event A — 1 by displaying a greater proportion of
elementary binary data valued 1 (higher abscissa values). However, for all but the
last 300 data values combinations out of total 931 retained, the approximation u0 = 1
appears quite robust, i.e. essentially data-value independent (homoscedastic). For
the last 300 data values combinations, a quadratic model of the type
u0 = 1 + X(p - pc)2, V p > p c
would provide a good approximation of the data value dependence of that single
global correction parameter u0, where:
• p is the proportion of sand in the two closest data events D and B pooled
together;
• pc is a threshold proportion below which the v0 = 1 model would be applied,
above which the quadratic model would be applied;
• A > 0 is a fitting parameter.
With the previous quadratic approximation, the dimension 3*4=12 for data value
dependence of VQ has been reduced to 2 (the two parameters A, and pc). In a real
4.2. A 3D CASE STUDY 107
application, So would be a training image built to mimic actual data interaction. A
study of data interaction would be developed on that training data set resulting in
some approximation of the global u0 parameter, say:
VQ = f(sj, j — 1, ...,n ), with n small
where p is a function of a few easily accessible statistics Sj summarizing the possibly
much larger space of variability of conditioning data events and values. That function
UQ = <£>(•) is then exported to the study of combining the various single data event-
conditioned probabilities. These single data event-conditioned probabilities should
not be read from the training set; only the interaction parameter VQ, or equivalently
the function p, is to be borrowed from the training set.
4.2.6 Independence-based estimates
To evaluate comparatively the performance of the u0 = 1 (no-data-interaction) model,
from the same 5 0 reference field we calculate estimates of the fully conditioned prob
ability F(A|D,B,C) stemming from two common approaches calling for data (condi
tional) independence. The expressions for these two estimators were given in Section
2.1.
The "conditional independence" (CI) estimator is written:
i3 ,(A|D,B,C) P(A\B) P(A\B) P(A\C) P(D)P(B)P(C)
P(A) P{A) P(A) P(A) ' P ( D , B , C) { '
The "full independence" (FI) estimator is written:
P^AID.B.C) _ P(A\B) P(A\B) P(A\C)
P(A) P(A) P{A) P(A) { '
The two sets of estimated probabilities Pci(A = 1|D,B,C) and Ppj(A = 1|D,B,C)
given by expression (4.8) for conditional independence and expression (4.9) for full
independence are retrieved from the reference set So- These and the VQ = 1 model es
timated probability P*0=1(A = 1|D,B,C) are plotted against the /So-exact probability
108 CHAPTER 4. APPLICATION TO BINARY DATA
P{A = 1|D,B,C) in Figure 4.23.
No order relation violation Fewer order relation violation Most order relation violation Best correlation Poorest correlation Second best correlation
reference reference reference
Figure 4.23: Scatterplots of estimated probabilities P*(A | D,B,C) versus the reference P(A j D,B,C): (1) for estimate based on u0 = 1 model, (2) for estimate based on conditional independence assumption, (3) for estimate based on full independence assumption.
Although there is clearly data interaction (essentially between data events D ,B, and
A), the no-data-interaction model u0 — 1 (Figure 4.23(1)) gives reasonable results
with the largest correlation coefficient p = 0.82 and with estimated values necessarily
valued in the interval [0, 1]. The full independence approximation (Figure 4.23(3))
may appear at first sight to give equivalent good results (/? = 0.70) but expression
(4.9) does not guarantee that the resulting estimate PpJ(A\H,'B,C) lies in the interval
[0, 1] whenever there is actual data dependence and interaction: a large number of
these probability estimates are valued above 1. Assuming independence between data
might lead to severe violations such as probabilities that are greater 1. In practice,
these violations need to be corrected, for example, by setting all illicit probabilities
to 1. However, such artificial correction may add to overall bias of the estimates. For
4.2. A 3D CASE STUDY 109
example, the conditional independence estimator (Figure 4.23(2)) has fewer order vi
olations than the estimator based on full independence assumption (Figure 4.23(3)),
yet its correlation coefficient with the reference case (p = 0.36) is considerably lower
compared to the u0 = 1 model (p = 0.82) or full independence model (p = 0.70).
Using the no-data-interaction I/Q = 1 model calls for considering distances which
are ratios of conditional probabilities. In presence of departure from data indepen
dence hypothesis one is better off approximating ratios of probabilities (aka v0 = 1
model) which are generally more stable than the probabilities themselves. This was
the original point made by Journel [48]. However, much better results (as will be
shown in the next section) than those provided by the uo = 1 model can be obtained
with little additional effort by modeling the heteroscedastic variability of v0 using
a training/calibration data set mimicking the actual data interaction. No matter
how approximative is that training model of data interaction, it is likely to be better
than a blank and wrong hypothesis of no-data-interaction, or worst data conditional
independence. We consider this approach hereafter.
4.2.7 The classified VQ approach
The proposed classified UQ approach described in Section 3.3.2 can be summarized in
the two phases-training phase and application phase.
In the training phase we need to:
1. build a training data set mimicking (even only roughly) the actual data inter
action. From that set, retrieve the training data values-dependent z/o-values,
called proxy i/0-values
2. reduce each set of training data values to a few summary statistics or filter
scores. Based on these scores, classify the proxy values u0. Each class is identi
fied by a single (average or median) u0-value, called a "class z/0-prototype"
The application phase then consists of returning to the actual study field and then:
110 CHAPTER 4. APPLICATION TO BINARY DATA
1. Finding the class closest to the actual conditioning data scores.
2. Retrieving that class "prototype" value u0.
3. Using that u0 value to combine the elementary probabilities. These elemen
tary probabilities must be evaluated from the actual study field, no t from the
training data set.
In the following example, the training data set is the reference data set (an ideal
case). Later in this section, the more realistic and less favorable case of a training set
different from the reference one will be considered.
For demonstration purposes consider then the classified u0 approach applied to the
reference data set shown in Figure 4.15. The goal is to estimate P(A = 1|D,B,C).
Each of the three conditioning data events D, B, and C comprises 4 binary data
points (refer to Figure 4.17 for the geometry of the three data events). There are 931
possible data value combinations for which we can reliably estimate such probability.
Consider as data summary (score) the single statistics defined as an average sand
proportion (i.e. the average of the 3x4 binary data and where sand is defined by
the binary data valued 1). That statistics can take only twelve possible values corre
sponding to the 12 classes of data events. Each class prototype VQ value is the average
of the proxy u0 values falling into that class. In Figure 4.24, the prototype u0 values
are shown in red for each of the 12 classes. The mean of these 12 proxy u0 values is
equal to 2.21 which indicates a significant deviation from the assumption of no-data
interaction (i.e. the vQ = l model). For each set of actual data values we look for the
closest training class and use the corresponding prototype v0 value (instead of u0 = 1)
for building the fully conditioned probability P(A = 1|D,B,C). Important remark
is the uncertainty of these prototype u0 interaction weights is different for each of
12 classes. For example, the variance of the proxy z/0 weights for the last class as
seen on the right of the Figure 4.24 is much larger than the variance of the proxy i/Q
weights for the classes in the middle of this Figure. To account for such uncertainty
we then can consider evaluating the lower and upper quantiles (i.e. 10-quantile and
4.2. A 3D CASE STUDY 111
90-quantile) of the proxy I/Q weights for each class, and then evaluating the fully con
ditional probability P(A = 1|D,B,C) for these two quantiles.
O
25
20
15 4
10
mean(v0j =2,21 mean(sand) =0.40 p{¥& /satiicl) =QJ0- .
1.12
0.66
i—'—i—•—r
0.89
0.75 ! I
8.48
3.80
1.69 •
2.69 *
* i
6.17 * •
* * *
* •
! !
• •
14.24
8.82
. , — | — r ^ — I — , . | i r
0 0.2 0.4 0.6 0.8
average sand values Figure 4.24: Exact v0 values versus average sand values defined over the three data events D,B,C. The average u0 values and their statistics are shown in red.
Comparison of the classified vo approach with the vo = 1 model performance is shown
in Figure 4.25.
The left graph shows a 0.82 coefficient of correlation between the reference true prob
ability and the UQ = 1 model for the 931 data value combinations. We observe only a
small increase in that correlation when using the classified u0 approach with p = 0.85.
Linear correlation, however, is not a fair measure of comparison between these two
112 CHAPTER 4. APPLICATION TO BINARY DATA
reference reference
Figure 4.25: Scattergram of u0 = 1 model (left) and classified u0 model (right) relative to reference probability. The correlation coefficient of the classified VQ approach with the reference case (p = 0.85) is improved somewhat compared to that of the uQ = 1 model (p = 0.82).
models as it measures only linear dependence. The significant improvement brought
by the classified VQ approach can be observed in reproduction of the reference statis
tics for the 931 data value combinations retained, see Table 4.3.
Because of data over-compounding, the u0 = 1 model overestimates the reference
spatial mean and variance of the 931 exact probabilities. The classified VQ approach
reproduces the statistics of the reference case much better. The class-dependent VQ
model provides a significant improvement which is not fully reflected by the coefficient
correlation.
4.2. A 3D CASE STUDY 113
reference
i/0 = l classified u0 model
mean 0.44 0.52 0.41
variance 0.04 0.07 0.04
Table 4.3: Summary statistics: means and variances of reference conditional probabilities and approximations stemming from nu representation for 931 data value combinations.
Experiments with a different training set
To test further the robustness of the previous results, consider two different data sets.
The first one provides the information content (the actual data), and the other one of
fers training data from which data interaction is borrowed. For this, ten independent
Gaussian realizations were truncated at their respective upper quartiles generating
ten independent binary fields similar to that shown in Figure 4.15. The means and
variances of the eroded ten realizations SQ are given in Table 4.4.
1 2 3 4 5 6 7 8 9 10
mean 0.276 0.303 0.294 0.191 0.269 0.263 0.253 0.169 0.165 0.289
variance 0.200 0.211 0.208 0.155 0.196 0.193 0.189 0.140 0.138 0.205
Table 4.4: Means and variances of 10 independent realizations SQ.
We can now consider different combinations of these ten realizations for retrieval of
the various conditional probabilities:
• information content: we obtained the individually conditioned probabilities
114 CHAPTER 4. APPLICATION TO BINARY DATA
P(A\B), P(A\C), P(A\D) from realization i = l,...,n = 10.
• proxy UQ values: for training we then used any realization j ^ i.
There is a total of n * (n — 1) = 90 possible combinations of the pair (actual vs.
training) realizations that can be used for the approximation of
P(A — 1|D,B,C) using the classified u0 approach. These estimates are then com
pared to the results of the u0 — 1 model.
Figure 4.26 shows the histograms of the means of the 90 reference P(A = 1|D,B,C)
values and their estimators based on classified u0 approach and on the u0 = 1 model.
The average of these 90 reference mean values is 0.399 (Figure 4.26 left). The respec
tive averages for the classified u0 approach and for the u0 — 1 model are 0.386 and
0.458 (Figure 4.26 right and center respectively). The u0 = 1 model leads to signifi
cant overestimation by over-compounding the individual probabilities. The classified
u0 approach reproduces well the mean value of the reference case. This similarity is
highly desirable property as it indicates that classified UQ approach is unbiased.
Figure 4.27 shows the histograms of the variances of the 90 reference
P(A = 1|D,B?(H) values and their estimators based on classified u0 approach and on
the u0 = 1 model. The average of these 90 reference variance values is 0.041 (Figure
4.27 left). The respective averages for the classified u0 approach and for the UQ = 1
model are to 0.041 and 0.076 (Figure 4.26 right and center respectively). The classi
fied UQ approach reproduces almost exactly the variance of the reference case.
reference UQ = 1
classified u0
mean 0.399 0.458 0.386
variance 0.041 0.076 0.041
Table 4.5: The average means and variances of P(A = 1|D,B,C) over 90 combinations.
Table 4.5 summarizes the average means and variances of the 90 reference fully con
ditioned probabilities P(A = 1|D,B,C) and the same statistics based on the UQ — 1
4.2. A 3D CASE STUDY 115
Exact v„ model
u a V ^ 0.15 O"
;n=90i • mean in [0.336, 0.<M m(mean) = 0.399
llll
6]
1 0.34 0.36 0.38 MO 0.42 0.44
reference (i)
0.42 D.44 0.46 0.48
v0 model (2)
classified v0 model lmean'into.118,0703] m(mean)>
0.2 03 0.4 0.5 0.6 0.
classified v. model (3)
Figure 4.26: The histograms of the means of the 90 reference P(A = 1|D,B,C) values (left), and their estimators based on the vo = 1 model (center), and classified VQ approach (right).
and the classified VQ approaches. The vo = 1 model over-compounds significantly the
elementary probabilities leading to a significant overestimation (bias). In contrast,
the classified vQ approach accounts better for the joint data interaction, and thus
decreases the over-compounding of information content and hence reduces the overall
bias.
116 CHAPTER 4. APPLICATION TO BINARY DATA
Exact n-90
- mean in [0.029,0.050]! m(mean) =0.041
0.03 0.033 0.40
reference (i)
n=90 mean in [0.046,0.096] m(mean) =0.076 :
classified v0 model
JU 0.05 040 0.07 0.00 0.09
v„ model (2)
am a.02 o.o3 a.a* o.os 0.06
classified y, model (3)
Figure 4.27: The histograms of the variances of the 90 reference P(A = 1|D,B,C) values (left), and their estimators based on the u0 — 1 model (center), and classified uo approach (right).
Chapter 5
Application to non-binary data
As was shown in Chapter 4, in presence of actual data dependence, the v0 = 1
model significantly outperforms the results associated with the traditional estimators
based on any data independence hypothesis. Estimators defined by the independence
assumptions could lead to illicit probabilities, e.g. greater than one. The u0 = 1
model guarantees licit probabilities regardless of the level of data dependence. In this
chapter, we generalize the nu model to the case of non binary variables with extensive
testing using a ternary variable data set.
5.1 A single constraint
Consider the evaluation of the posterior probability P(A = fc|D) for Vfc = 1,...,K
where k is the particular outcome of the unknown A. For example, category k could
indicate the presence/absence of a channel sand.
n
D = P| Di, where the conditioning information D is constituted of n elementary data
events D{.
117
118 CHAPTER 5. APPLICATION TO NON-BINARY DATA
Using the notations of Chapter 3, the distances to event A = k are written:
„(*) P(A # k)
P(A = k)
{k) P(A^k\Di) Xi ~ p{A = k\Diy * - 1 ' - ' n ^l>
x(k) = P(A ± k\D)
P(A = k\D)
The fully conditioned posterior probability P(A = fc|D) is then:
P(A = fe|D) = T ^ r ) (5.2)
These posterior probabilities must verify the law of total probabilities , whatever the
data set D*, i.e. for all x0, x^, and x:
K K
fc=l fc=l l i x*
(5.3)
For each category k, the nu expression from Chapter 3 is written:
x{k) J L . , , x (* )
(fe) 1 1 " * (fe) x0 i=l x0
or n n
4=1 i=\
where
x 0 x 0
The sequence-dependent interaction parameter v\ is written as:
P{Di\A^k£i-1)
P(Di\A)
5.1. A SINGLE CONSTRAINT 119
Note that the prior distances should verify the constraints:
±P(A = k) = l, * E T T - ^ 1
k = l fc=l l ' x 0
which entails:
k'^k
For example, the case for K — 1 :
P(A = K) = 1 =• a#° = ? = 0 =* —^TK)=1 (5-7) 1 1 + 4 j
For the case K = 2 :
i + i _ ( i + 42)) + (i + 41)) i + xj1} i + 4 2 ) ( i+41 ))(i + 42))
O , (!) , (2) 2 + X^+Xy _ (1) ( 2 ) _ 1
~ 1 + r W 4. T(2) 4. r (1 ) r ( 2 ) ~ ° ° _
~ 0 ' 0 ' 0 0
When K = 3 :
1 1 1 + T + 1 %Q 1 T XQ 1 "T X Q
W^/i ,J3)\,n ,JQ\fi ,J2)> (1 + # ' ) ( ! + < Q + (1 + s^)(l + gW) + (1 + x^)(l + x?)
(1 + XP){1 + XP)(1 + X®)
Q . 9 [ T( 1 ) 4- r ( 2 ) 4- r ( 3 ) l 4- [ T
( 1 ) T ( 2 ) 4- T( 1 ) r ( 3 ) 4- r ( 2 ) r ( 3 ) l o n^ •"!.() ^ 0 ' 0 J ' L 0 0 ^ 0 0 ' 0 0 J
1 4 . \rW 4. r (2 ) 4. T(3)l 4. [ r(
1)T(2) . T ( 1 ) r (3) 4. T(2)T(3)i , (1) (2) (3) i n^ [ X Q T ^ X Q T ^ X Q J T ^ [ X Q X Q - f X Q 0 ^ 0 0 J ^~ ^ 0 0 0
1 : f f (1) (2) (3) _ 2 , r (1) (2) (3), 1 111 X Q X Q X Q — i i T [ X Q T^ X Q H^ X Q j
(5.8)
(5.9)
120 CHAPTER 5. APPLICATION TO NON-BINARY DATA
Then generalizing to any K :
K
E r 1 num
^ i + 4 f c ) den
where: K
den = 1 + Y^ Q
and Ci is the sum of all combinations of /-product of xQ defined as:
K K K
fcl=l fe2=l fc(=l
with k\ ^ k2 7 ... 7 A;/, and:
= #+(*--l)£>j( num fe=i
iC K
K (k)
T ( * i ) T ( * a ) XQ X Q +MEE
fci=ife2=i i f K K
fel=l A;2=l fc3=l
K K K
+ ... + (jf-DEE-E
_(fc l ) - ( *2)_( fc3) X Q X Q X Q
X Q X Q . . . X Q
Hence,
num = K + f > - I ) £ E - E 4fcl)4fc2)-4fci)] /=1 fei=lfc2=l fc;=l
5.1. A SINGLE CONSTRAINT 121
In summary, when K > 2, the constraint to ensure licit prior probabilities is
fc=l L i x 0 Z = l / = 1
and finally if if-2
^ = n x°]=(R - i ) + z ^ -i - i)°i (5-n) *;=! /=i
For example, for the case K = 2, we have: a?g £Q = 1
For the case K = 3, it becomes: a r ^ o ^ o 0 = 2 + d = 2 + [x£1} + a;£2) + 4 3 ) ] -
For the case if = 4, we may write:
s f ^ W = 3 + 2d + Ca = 3 + 2[a<1} + 42) + 43) + xf] + f T ( l ) T (2 ) , T ( l ) _ ( 3 ) , _ (1) T (4) , (2) (3) (2) (4) (3) (4). ' L O 0 ' 0 0 ' 0 0 ' 0 0 ' 0 0 ' 0 0 J
Similarly, these constraints apply to any of the n elementary distances x] , % = 1,..., n
and to the fully updated distance x^k\
Constraint on the nu model
The distance constraint (5.11) applied on the fully updated distances x^ induces a
single, but non-linear, constraint on the parameters VQ of the nu model.
• K = 2:
Assuming that the prior distances are consistent (i.e. XQ 'X0 ' = 1), the con
straint on the fully updated distance x^ is:
3.(1)3.(2) = ! ^ xWx!g_ = yiy2 = 1
where yk is defined by relation (5.5).
Under the nu model (5.4) this constraint leads to:
n n n
yy=1 iff ^ n n ^ n n ^ = w I N v>=* (5-12) »=1 i = l i = l
Assuming that the single datum-conditioned probabilities are consistent, i.e.
122 CHAPTER 5. APPLICATION TO NON-BINARY DATA
x\ 'x\ — 1, Vi => y\ y\ ' = 1, Vz, relation (5.12) leads to the following single
constraint on the nu interaction parameters:
i/Wi/f = 1 (5.13)
For example, if v$ — 1, then VQ = 1. If I/Q = 0-5, then ^ = 2.
• K = 3:
The constraint (5.11) on the distance x^ is written:
3 3 (fe)
fc=i fc=i
Under the nu model this is written:
3 n 3 n
n ^[ i1-" n ^=2+E ^ W ] 1 - n ^ (5-M) fc=l i = l fc=l i = l
Setting: n
5(fc) = [xJ)/c)]1-"[]xf)>0 (5.15)
the constraint (5.14) is rewritten:
f[ulik)S^ = 2 + J2^k)S{k) (5.16)
fe=i fe=i
with I/Q ^ > 0 and assuming that all prior input distances XQ ', x\ ' are consis
tent. This is one single constraint (non-linear) on the three weights VQ ', k=l,
2 ,3 .
Example 1
Consider the case where S^ = S® = 1 and S& = 0.5. The constraint (5.16)
5.1. A SINGLE CONSTRAINT 123
is then written:
, ( l ) „ ( 2 ) „ ( 3 ) / 0 _ 0 , , , ( 1 ) , 7/(2)x , f3)
leading to:
i ^ W 7 2 = 2 + M 1 ' + i /H + ^ 7 2
?)[^1V-l]=4 + 2rf) + )
This relation indicates that any model with fQ ^ = 1 is not permissible in
this situation. One possible solution to satisfy the above constraint is to set
z,W = 1, z,J2) = 1.5 and uf] = 18.
The constraint (5.11) for four categories is written similar to relation (5.16):
4 4
f[ vP&k) = 3 + 2 J2 4k)S{k) + E E "Pv^SWS^ (5.17) fc=i fc=i fc=i fc'=i
where k > k'.
Extending Example 1
Using the values above: S^ = S^ = 1 and S^ = 0.5 consider the case when
5(4) = 5(1) = S(2) = 1 T h e c o n s t r a i n t ( 5 1 7 ) jg w r i t ten as:
Z1),/2),/3),/4) /o - Q _L or,/1) _L „(2) i „(3) /oi _L r,/1),/2) i w(!)„(3) /o _L_ ,/2)„(3). ^ ^ W / 2 = 3 + 2[u?> + ^ + tf 72] + [ ^ ^ + u?>v?>/2 + ^ ^ / 2 ]
+ N4) + W + « 4 ) + «72]
Then:
^ M 2 ) ^ 3 M 4 ) = 6 + 4rf> + n?> + u§»/2] + 2 [ « > + ^ ) # / 2 + ^ 2 > # / 2 ]
+ [ 4 ^ + 2 ^ 4 ) + 2^V + ^ ^ ]
124 CHAPTER 5. APPLICATION TO NON-BINARY DATA
Factoring out the fourth parameter vQ ', it comes:
^ 4 ) [ ^ 2 ) ^ 3 ) - 4 - 2 ^ - 2 ^ - ^ ] =
6 + 4[vP + v? + v$»/2] + 2[vP„V + vPvV/2 + u®u$»/2] > 0
Note the previous solution v^ — 1, VQ —1.5, and VQ ' — 18 is not permissible
since the factor multiplying UQ would vanish. Similarly, the joint set of values
^o — ^o — ^o — 1 is n ° t acceptable since that factor would be negative. The
values VQ — 1 and VQ = 1.5 require that u^ > 18. For example, the values
VV> = 1, uf] = 1.5, i/$° = 20 calls for the solution v^ = 109
• Genera l case: K > 2
The model vQ = 1 introduced in Chapter 3 cannot be extended to all k =
1,..., K > 2. The K nu weights i^ , must verify a single but non-linear con
straint of type (5.11). Inference of the K nu weights VQ ' should then be done
jointly on the entire vector [[P(A = k), P(A = fc|A), i = l , . . . ,n; P(A =
fc|D);i/Q ]> k — 1,---K > 2]. Note that instead of imposing that single con
straint on the nu weights vQ , one could determine (K — 1) such weights VQ ,
k y£ ko and set the remaining probability to
P(A = ko\D) = l - E P(A = k'\D). k' ^k0
Numerical example:
To demonstrate how the constraint (5.11) can be used in practice and show the
consequences of not using it we develop this simple numerical experiment. Consider
the case of an unknown A informed by two data events D\ and D2 (i.e. n = 2). The
unknown A can belong to any of three categories k = 1, 2, 3 with K = 3. There
might be a wide variety of geological analogs to this situation. For example, the first
category may represent mud, the second category may indicate the presence/absence
of channel of sand at particular unsampled location of a potential reservoir, and the
5.1. A SINGLE CONSTRAINT 125
third category may be indicative of fractures in that reservoir.
Assume the availability of the following prior distances:
• Prior distance
4 ° = 0.50, 42 ) = 3-00, 43 ) = 11-00. That is:
P(A = 1) = 1/1.5 = 0.667, P(A = 2) = 1/4 = 0.25, P(A = 3) = 1/12 =
0.0833
To verify the consistency relation (5.11), check: 3 3
fl # = 16-50 = 2 + £ a#° = 2 + 14.50.
• Conditioning to data event D\\
x? = 0.25, xf> = xf} = 9.00.
That is: P(A = l|Z?i) = 0.80, P(A = 2\Dt) = P(A = 3|£>i) = 0.10
To verify the consistency of the probabilities conditioned to the single data
event D\, check: 3 3
fl 4fc) = 20.25 = 2 + J2 x? = 2 + 18.25. fc=i fc=i
• Conditioning to data event D2:
4X) = 0.333, xf = 43 ) = 7.00.
That is: P(A = 1|Z>2) = 0.75, P(A = 2|D2) = ^ = 3|£>2) = 0.125
To verify the consistency of the probabilities conditioned to the single data
event D2, check:
II # = 16-33 = 2 + J2 4fc) = 2 + 14-33.
Note: £>2 compounds the D\-information by increasing the probability for A = 1
from its prior value P(A = 1) = 0.667.
According to relation (5.5), the relative distances to A = 1 are:
(!) (2) (3)
yi1) = ^ iy = 0.5, y?> = ^ = 3.0 ^ = ^ = 0 . 8 1 * Xr\ Xr\ X(\
126 CHAPTER 5. APPLICATION TO NON-BINARY DATA
Similarly,
(!) (2) (3)
y? = % = 0.667, y?> = \ = 2.33 y f = \ = 0.636 XQ XQ XQ
Note the y-expression (5.5) of the nu model:
yM = ^f[y?\ lb =1,2, 3
For the event A = 1 with conditioning to data D\ and £>2> the model fQ ' = 0.5 would
lead to the approximation:
y0) = 0.5* (0.50* 0.667) = 0.167 and x{1) = 0.167 * 0.50 = 0.0835
Hence, the model v$ — 0.5 < 1 increases the compounding of data Di, D2 and leads
t o P * ( 4 = l |A,-D2) = 0.923.
Setting the two other nu parameters ^ 0 = ^ 0 unknown u, the constraint (5.16)
is written: 3 3
fc=l fc=l
with:
5(1) = ^ _ = 0 J 6 7 ) 5(2) = ^ 2 _ = 2 L 0 ) 5,(3) = ^ 1 _ | _ = 5 ? 2 7
^ 0 ^O X0
The following equation needs to be solved for the unknown u^ = v^ with ujp = 0.5: 3 3
f[ u^SW = 0.5u2(20.09) = 2 + J2 Vok)S{k) = 2 + 0.0835 + 26.727ti. k=\ fc=l
That equation is:
10.045M2 - 26.727u - 2.0835 = 0
5.1. A SINGLE CONSTRAINT 127
with discriminant: 5 = 714.33 + 83.715 = (28.25)2 > 0.
The solutions are u = 26'72207 f"25 • The negative solution is unacceptable leaving
u = vf] = v{3) = 2.737.
These two nu values lead to the approximations:
y(2) = i>(2)y[2)y(2) = 2.737 * 3 * 2.33 = 19.13
x(2) = x(2)y(2) = 3 „, 1 9 1 3 = 5 7 3 8 ) a n d p*(A = 2|D1,D2)=0.017.
and,
y(3) = 43)y[3)yi3) = 2.737 * 0.818 * 0.636 = 1.424
a;(3) = 43 )y ( 3 ) = 15.66, and: P*(A = S ^ , L>2)=0.060
The probabilities should be all consistent which means that to avoid any order viola
tions all probabilities across three categories k should sum up to 1. For our example,
the proposed constraint (5.11) allowed for such consistency, 3
i.e. £ P*(A = k\D1}D2) = 0.923 + 0.017 + 0.060 = 1.00. Note that the u^ = 1,
3
k = 1,2,3 model would have led to inconsistent probabilities since ]T) Lfci =
0.857 + 0.046 + 0.149 = 1.052 > 1.
Summary of the numerical example:
• The VQ = 0.5 model increases compounding as it increases the probability that
A = 1 from P(A = 1 |A) = 0.80 to P*(A = 1\DUD2) = 0.923.
The model v^ = v$ — 2.737 increases the two distances x^ and x^\ thus
decreases the two probabilities of having A = 2 and A = 3 into:
128 CHAPTER 5. APPLICATION TO NON-BINARY DATA
P*(A = 2\D1}D2) = 0.017 and P*(A = 3|Dx, D2) = 0.06 from their values
P(A = 2|L>i) = P(A = 3| A ) = 0.10.
• The model v^ = 2 leads to x^ = 0.334, and P*(A = l\DXlD2) = 0.75 =
P(A = 1\D2), thus erasing the compounding of data events D\ and D2.
For I/Q ' = 2, the consistency constraint is written:
2n2(20.09) = 2 + 0.334 + 26.727M,
that is: 40.18w2 - 26.727^ - 2.334 = 0, (2) (3)
leading to the positive solution: u = vQ = VQ — 0.745.
This leads to the estimates:
P*(A = 1 |A, D2) = 0.75, as found above.
Similarly, the fully updated distance x^ for the second category k = 2 is:
x(2) = U^S^ = 0.745 * 21.0 = 15.64. Thus, P*(A = 2\DU D2) = 0.060.
and fully updated distance x^ for the third category k = 3 is: x^ = v$ S^ =
0.745 * 5.727 = 4.27. Thus, P*(A = 3 |D u D2) = 0.190.
3 These estimates verify the consistency relation J2 P(A = k\D\,D2) = 1-00.
fc=i Without the proposed constraint (5.11) such consistency would not be possible
leading to order violation problem and possibly to biased estimate of the fully
conditional posterior probability P(A = k\Di,D2).
5.2 Large non-Gaussian ternary case study
In Chapter 3, we suggested the data combination paradigm in which posterior prob
ability P(A = k\Di,...,Dn) is obtained by completely separating the single datum
information content through the n elementary probabilities P(A = k\Di) with i =
1,..., n and the data interaction through the nu interaction weights I/Q , k — 1,..., K.
The elementary probabilities should be evaluated from the actual data. The interac
tion weights can be obtained from a training data set providing proxies, or replicates,
of the data interaction. The constraint (5.11) on the K interaction weights I/Q ensures
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 129
that the resulting K fully conditioned posterior probability P*(A = k\D\, ...,Dn) are
all licit probability estimates.
The applicability of the proposed UQ ' inference paradigm is now tested using a large
3D reference ternary data set where all conditional probabilities involved in the nu
expression (3.6) are known, including the exact full data-conditioned probability
P(A = k\Di = di, i = l , . . . ,n) . Various approximations of that reference proba
bility can then be evaluated.
5.2.1 The reference data set
We start by generating a reasonably large 3D non-Gaussian field using the training
image generator code [56] of the SGEMS software [66]. This code generates various
geological structures using a non-iterative, unconditional Boolean simulation [36].
For this data set, we generated a ternary image with three mutually exhaustive cate
gories: category 1 for mud, category 2 for channel sand, and category 3 for fractures.
This 3D field is of size 100x100x50, comprising 500,000 nodes and yielding the refer
ence categorical field shown in Figure 5.1.
Denote that reference field by S : {A(u) = 1, 2, 3, u E S} with
P(A(u) = 1) = 0.67, P(A(u) = 2) = 0.23, and P(A(u) = 3) = 0.10 and 3
with: ^2 P(A(u) = k) = 1.00, and u denotes the location coordinates vector.
Figure 5.2 gives the reference indicator variograms in the x, y, z directions calculated
from indicator data from the top 35 layers of S for each of three categories k = 1,2,3;
the reason for excluding the bottom 15 layers will become apparent soon hereafter.
Those indicator variograms reflect the horizontal-to-vertical anisotropy of the original
categorical field.
130 CHAPTER 5. APPLICATION TO NON-BINARY DATA
Figure 5.1: Reference categorical image generated using a training image generator (the representation of the two categories A = 2 and A = 3 does not reflect their proportions).
5.2.2 The estimation configuration
Consider the evaluation of the conditional probability of an unsampled value A(u)=k,
given any combination of the following three multiple-point data events (Figure 5.3).
As seen from Figure 5.3, the closest data event D comprises four data locations at the
level just below that of A(u). These four data are at the corners of a square centered
on the projection of A(u) on their level. The next closest data event B comprises
also four data locations with the same geometry as for data event D, but located five
levels below that of A(u). The furthest away data event C again comprises four data
locations but located 15 levels below that of A(u).
If the unsampled location u of A(u) spans only the eroded field
5o = {x — 11, ...,90; y = 11, ...,90; z = 16, ..50} then each value A(u) can be eval
uated by any of the three data events D, B, C. From here on, all statistics will
refer to that "common denominator" field So comprising 224,000 nodes. Over that
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 131
k = 1 Mean: p = 0.636
Horizontal EW 0.3 -37
025 4'.'P
">: o.i5 -fY 0.1
O.QS - § • • * - ' - -'• 0 J .u . ' . - i .
"|iili|iiit|iiil|inijiiii[mr^n|li D 5 1016 2026 3035
fllnHIHIffr
0 5 1015 2025 3035 Vertical
0.3 • n -0.25 ! ' [ 0-2 !!<_•
0.15 I 0.1 |.Jfct.
0.05 " 3 * - 1 — i — i - T - t - n 0 - S . C - L . ' - _ ! _ . « _ J - J . J .
-! 1_ -«.. H » -* - *
0 S 10 IS 2020 3030
lag
k = 2 Mean: p = 0.249
Horizontal EW
k = 3 Mean: p = 0.115
Horizontal EW
1015 2025 3035
Horizontal NS
nil] HI i|ini| ini)iin] IIII)IIII)II'
0 5 1015 2026 3035
Horizontal NS
j7 HI j in i jn i7| in 7j» J i] u II |u nji i 0 5 1015 2025 3035
Vertical
i i i i i i i
f f T T T T T T T '
I I I I I I I I I I I I I I
r r T - r - r - r * r - r
-1— i— r - r - r - r " ' )l III) III fill lit HI lift lit lll^lll llll
0 5 1015 2025 3035
Vertical
Figure 5.2: Exhaustive indicator variograms in x, y and z directions, calculated over the 35 top layers for k — 1,2,3.
132 CHAPTER 5. APPLICATION TO NON-BINARY DATA
one data event D two data events D,B three data events D,B>C
Figure 5.3: Data events definition.
central field So, the marginal statistics (prior proportions) are: P(A — 1) = 0.636, 3
P(A = 2) = 0.249, P(A = 3) = 0.115 with: ]T P(A = k) = 1.00. fc=i
The definition of an "eroded" field SQ common to all data configurations entails that
the spatial averages of all conditional probabilities (proportions) remain the same no
matter the conditioning data event retained. For example, if conditioning only to the
sole D-data event then the conditional probability P(A = k\D) is:
P(A = k\D) = j^Y, P(A(") = *|D - d(u)) uGi>o
= \ h E p(A(u) =k) = p(A =fc); k = *•> 2>3
W ues0
where the data event D can take KA = 34 = 81 possible combinations of data values
with K = 3. For the case when conditioning jointly to the two D and B data events,
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 133
the conditional probability P(A — k\D, B) is then:
P(A = fc|D, B) = j±- £ P(A(u) = k\D = d(u), B = b(u)) 1 6 0 1 u £ 5 0
= P(A = k); jfc = l, 2, 3
where the data event (D, B) can take 38 = 6561 possible combinations of categorical
data values. Similarly:
P(A = fc|D, B, C) = - ^ V P(A(u) = fc|D = d(u), B = b(u) , C = c(u))
= P(A = k); k = l, 2, 3
where the data event (D, B, C) can take 312 = 531,441 possible combinations of data
values.
Note, because So is not that large (1501=224,000 nodes) therefore not all data values
combinations are present in SQ. This does not affect, however, the previous equalities:
P(A = fc|D, B) = P(A = Jfc|D, B, C) = P(A = k), V k.
5.2.3 Conditional probabilities and estimates
As an example, Figure 5.4(1) gives the So-volume of the 224,000 exact conditional
proportion values P(A(u) = 1|D, B, C) with mean 0.636 and variance 0.084. Their
histogram is given in Figure 5.4(2). This spatial field can be considered as reference.
The mean is equal to that of the reference categorical field as expected since we scan
the same So-volume.
Similar figures and statistics are available for all the following conditional proportions,
although not all are presented here:
• single data event-conditioned: P{A(u) — fc|D), P(A(u) = k\B), P(A(u) = k\C)
for all k = 1, 2, 3.
134 CHAPTER 5. APPLICATION TO NON-BINARY DATA
prior proportions: P(A=1)=0.636 P(A=2)=0.249 P(A=3)=0.115
Figure 5.4: (1) Spatial distribution, (2) histogram of the conditional proportions P(A(u) = 1|D, B, C) defined over the reference eroded volume So, and (3) the reference categorical field with respective proportions.
• two data events-conditioned: P(A(u) = k\B, B), P(A(u) = k\D, C),
P(A(u) = fc|B, C), for all k.
• all three data events-conditioned: P(A(u) — k\D, B, C).
• the estimated probability P*(A(u = k)\D, B, C) using the VQ model (3.6) to
combine the previous single data-event conditioned probabilities.
Again the mean of these proportions will be equal to that of the reference categorical
field as expected since for all the above proportions we scan the same <So-volume.
When the model VQ — 1 is used, at each location u the estimate of the fully condi
tioned posterior probability is:
P*{A{u) = fc|D = d(u), B = b(u) , C = c(u)) l + z(fc)(u)
(5.18)
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 135
with the estimated distance x^(u) being such that:
# » ( u ) _ ^ fc )(u)xL fc )(u)4 fe )(u) Tn (k) (fe) (k) X0 Xn Xn' Xn
where x0 = — p , } _ ,\ is the marginal distance, with
(1) 1 - 0.6356 Xo = ^ 6 3 5 6 - = ° - 5 7 3 3
(2) _ 1 - 0-2490 _ X° " 0.2490 ~ 6 m b l
(3) _ 1-0.1154 _ X° ~ 0.1154 ~ 7 - b b 5 5
and JCD (u) = —p//tV \ _. I.|T>_ J / \\ is the distance to A(u) = k updated by the
sole data event D = d(u).
The distance xD (u) is obtained by scanning the reference image So with the template
definition shown in Figure 5.3(1) for the proportion of D-replicates identifying the
data values combination d(\i). Our estimation paradigm assumes that all elementary
conditional probabilities P(A = fc|D), P(A — k\B), P(A = k\C) are known. This
study addresses only the problem of combining these elementary probabilities into an
estimate of the fully conditioned probability P(A = fc|D, B, C) while accounting for
data interaction.
Similarly, from the training image one can retrieve the other two elementary distances
XB (u) and a4 (u). The VQ — 1 model (5.18) then provides an estimate of the fully
conditioned probability, P*(A(u) — fc|D, B, C). For example, the spatial distribution
and histogram of the estimates P*(A(u) = 1 | D,B,C) using the UQ=1 model are
shown in Figure 5.5(1).
These estimates are necessarily valued in the interval [0, 1]: their spatial mean is
0.635 with spatial variance 0.010. Their histogram is given in Figure 5.5(2). Com
paring Figure 5.5 to Figure 5.4, shows that the bias of VQ = 1 estimates relative to
136 CHAPTER 5. APPLICATION TO NON-BINARY DATA
N=224,00C m=0.635 a2=0.010
I ^ ^ ^
: : .-
\ 1 ; [ i A i I i I i I i I ' i • i ' i
prior proportions: P(A=1)=0.636 P(A=2)=0.249 P(A=3)=0.11S
0 3 0.4 0.5 0.6 0.7 0.8
v„ model
Figure 5.5: (1) Spatial distribution, (2) histogram of the conditional probabilities P*(A(u) = 1|D,B,C) estimated with the model v^ = 1, and (3) categorical reference field with respective proportions.
the reference probabilities is small (compare their spatial mean of 0.635 and 0.636).
Note, however, that the estimates based on VQ = 1 model have the lesser spatial
variance (0.010 < 0.084) due to estimation smoothing effect.
The histogram and scattergram of the error defined as
P*(A(u) = 1|D, B, C) - P(A(u) = 1|D, B, C)
are shown in Figure 5.6. The correlation between the local probability estimate and
the actual true proportion is low (p = 0.34), leaving room for finding a better data
interaction parameter va ' different from 1.
Note that the Z/Q = 1 model cannot be extended to all k = 1,2,3. For example, we
calculated the estimates P*(A = 2 | D,B,C) and P*(A = 3 | D,B,C) using the nu
parameter value vQ = 1 and I/Q = 1. Figure 5.7 shows the histogram of the sum
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 137
u
cr
P*: m=0.635 &'• =0.010 P : m=0.636 a2=0.084
B m=0.0002 1 a2=0.0749
, , , , . , , , , . . . , , . „ , _ , . _ .
v0 model (1)
reference (2)
Figure 5.6: (1) Histogram of error P*(A = 1 | D,B,C) - P(A = 1 | D,B,C) and (2) the corresponding scatterplot of estimate P*(A = 1 | D,B,C) versus reference P ( A = 1 | D , B , C ) .
138 CHAPTER 5. APPLICATION TO NON-BINARY DATA
u 0.34
a> cr <V 0.2
0.1-
estimatcdiim =1.00034 exact:: m=l Data count: 224,001
Mean: 1.0003*
Variance: 4.86514
Maximum: 1.0962:
Upper quartile: 1.0029:
Median: 0.9996
Lower quartile: 0.99961
Minimum: 0.9207
- i — i | i — i i i | i i i i | — i — i — i —
0.95 1 1.05
summation
Figure 5.7: Histogram of ]T P*{A = k | D,B,C) . The v^ = 1 model cannot be k=\
3
extended to all categories k since the mean of J2 P*(A = k | D,B,C) > 1 which k=i
contradicts the general law of probabilities.
tk \ of the estimates ^2 P*(A(u) = /c|D, B, C) resulting from the model UQ ' = 1 V k.
fe=i The spatial mean is equal to 1.00034 which is slightly greater than one. Out of the 224,000 estimated values, 104,392 (that is about a half) were outside of the required
interval [0, 1].
In the previous runs, the single data event-conditioned probabilities P(A(u) — k\D),
P(A(\i) = k\B), and P(A(u) = fc|C) were set equal to the corresponding propor
tions over So. Each data event D, B or C can take 34 = 81 possible combinations
of data values. This small number of possible data value combinations ensures that
most likely all such 81 combinations are present in the training image with a num
ber of replicates greater than 10. This in turn ensures that the spatial estimates
P*(A = k | D,B,C) , such as the one given in Figure 5.5 based on the VQ = 1 model
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 139
are statistically significant. However, the statistical significance of the fully condi
tioned proportion P(A = k | D,B,C) shown in Figure 5.4 could be questioned. The
statistics shown in Figures 5.5 and 5.6 pool together the 224,000 estimated condi
tional probabilities over So, irrespective of the actual values of the conditioning data
values and the corresponding number of replicates. Note that there are 4 * 3 = 12
categorical (K = 3) indicator data grouped four by four into the three data events
D, B, and C; therefore there is a total of K12 = 312 = 531,441 possible data values
combinations. Out of the total of 531,441 possible data values combinations, 475,271
(89%) were not found in So and of the remaining 56,170 only 195 combinations were
found with at least 10 replicates. To ensure statistical significance we retain only the
latter 195 data combinations for future analysis.
5.2.4 Determining the Z/Q to ensure consistent probabilities
As was shown in Figure 5.7, the model v$ — 1 cannot be extended to all k = 1,2,3
since it may lead to inconsistent probabilities. To ensure consistency, the K weights
PQ ' must verify the single but non-linear relation (5.11). Consider the model Z/Q ' = (2)
UQ = 1, where the two first nu parameters are set to 1 indicating no-data-interaction
when evaluating A = 1 and A = 2. The third weight v$ is determined using the
constraint (5.11): u^ = 2 + 5 ( 1 ) + 5 ( 2 ) (5 19) "0 S(!) * S(2) * S(3) - S(3) l J
where S^\ S^2\ S® are defined by relation (5.15).
For the evaluation of the exact fully conditioned proportion P(A — k | D,B?C) we
use only the 195 statistically significant (with number of replicates greater than 10)
data value combinations. For each of these 195 combinations, we calculated:
P*(A = 1\ D,B,C) under the v^ = 1 model
P*(A = 2 | D,B,C) under the ^ 2 ) = 1 model (5.20)
P*(A = 3 | D,B,C) with UQ calculated using expression (5.19)
140 CHAPTER 5. APPLICATION TO NON-BINARY DATA
We also calculated the estimate PCJ(A = k | D,B,C) under the conditional indepen
dence (CI) assumption using expression (4.8) as:
P*CI{A = k\D,B,C)
P{A = k) (5 21) _ P(A = fc|D) P(A = k\B) P(A = k\C) P(D)P(B)P(C) K ' ~ P(A = k) P(A = k) P(A = k) P(D B C)
The CI estimate (5.21) and the nu model estimated probability P*(A = fc|D,B,C)
as given by (5.20) are plotted against the So-exact proportion P(A — k\D,B,C) in
Figure 5.8 for each category k = 1, 2, 3.
In presence of strong data interaction (essentially between the two close-by data events
D ,B, and the unknown event A), the no-data-interaction model v$ = v$ ' = 1 out
performs significantly the estimator based on the conditional independence assump
tion. That can be seen from Figure 5.8 (bottom) which shows that for the 195 data
value combinations retained the correlation coefficients based on v^ = 1 and vQ = 1
are both equal to 0.67. While for the CI estimator these coefficients are 0.14 and 0.34
for k = 1 and k — 2, respectively (Figure 5.8, top panels). Also, for k — 1, the CI
estimator leads to illicit probabilities, i.e. probabilities that are greater than one (Fig
ure 5.8, top left panel). This inconsistency comes from the fact that out of 195 data
value combinations retained, the category k = 1 is more likely to be present in the
training image than the other two categories. This leads to assumption of conditional
independence to be more likely invalid when evaluating P(A = 1 | D,B,C) than
for the other two categories. For k = 3, the constrained v$ ' model also significantly
outperforms the estimator defined by the conditional independence with a correlation
coefficient equal to 0.43 versus 0.18 (Figure 5.8, top and bottom right panels).
Table 5.1 also shows that the estimator defined by the the nu model allows for a better
reproduction of the spatial mean and variance of the reference across the 195 data
value combinations. The conditional independence estimator tends to underestimate
the spatial mean and to overestimate the spatial variance.
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 141
"S" a. i u S 0.8
5 0.6 s o 53 °-4
k=l k=2 k=3
o u
• ;
f
H
f * 1
4 * : • •
: .;
••
•.! •* * :• " * - • •
••,*.r . • v< • t * •
• . *. * •
' P=0-;14 :
*. . . . . ,«.* . . . . . .?
* • * » • • A.. , « V , * * •
.*V-"rV/ • » • .. •
. . ** . . * .i».... . ..
*
p40.34
.....; ? . ^ w ; . •.^•..•| ,.
0.3 0.4 0.5 0.6 0.7 0.8
P=o.i8 ';
J | • i i i
• • * •
• • . *
V*,-A
. . j . . . ; i . r 1 ' i M • i '
0.1 0.2 0.3 0.4 0.5 0.1 0.15 0.2 0.25 0 J 0.35
0.8 -
del
o *=«-— II -> 0.5-
0.4 -
9:
,
•
w .
=o,67 ; : ; , : .: :. : :•
</ •: .-.'^r-" ' . ». . : « . • • : . . •
• . • . . . f
« ' • * ' : '•• • * • « - • • • - . " - • »
.: «, . : . : •
•• i . 1 • i 1 1 1 i 0.3 0.4 0.5 0.6 0.7 0.8
reference
0.5 -
0.4 -
0 .3 -
0.2-
0.1 •
p=0.67
' . • . . . . •:
- . - . , v ; • : . - • •
- p=0.43
0.1 0.2 0.3 0.4 0.5
reference
1 1 i . 1 1 1 1 . 1 1 1 1 1 1 1 1 1 1 .
0.1 0.15 0.2 0.25 0.3 0.35
reference
Figure 5.8: Top: the reference proportion P(A = k | D,B,C) (x-axis) versus the estimated proportion P(A = k | D,B,C) based on the conditional independence assumption (y-axis) for k = 1,2,3. Bottom: the reference P(A = k | D,B,C) (x-axis) versus the estimated P(A = k | D,B?C) based on the VQ ' model (y-axis) for k = 1,2,3.
142 CHAPTER 5. APPLICATION TO NON-BINARY DATA
k = l reference
^ = 1 conditional independence
k=2 reference
^ = 1 conditional independence
k=3 reference
vi3)
conditional independence
mean 0.61
0.66 0.54
mean 0.24
0.22 0.18
mean 0.15
0.12 0.096
variance 0.0161
0.0078 0.0574
variance 0.0117
0.0073 0.0174
variance 0.0030
0.0002 0.0023
Table 5.1: Summary statistics for k = 1, 2,3: spatial means and variances of reference conditional proportions and of approximations denned by the nu model and from the conditional independence assumption.
The v^ model based on equation (5.20) ensures that:
P*(A = 1 | D,B,C) + P*(A = 2 | D,B,C) + P*(A = 3 | D,B,C) = 1.
This consistency relation is not ensured by the conditional independence assumption
since out of the corresponding 195 estimated probabilities, eight were greater than 1.
While the v$ ' model based on equation (5.20) provides a significant improvement
over the traditional estimator based on a conditional independence assumption, it
still assumes no-data-interaction. We next look at application of the classified UQ
approach as proposed in Section 3.3.2.
5.2.5 Classified v$ ' approach
Consider now the classified v0 approach as proposed in Section 3.3.2. The proposed
classified u0 approach described in Section 3.3.2 can be summarized in two phases,
the training phase and the application phase.
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 143
In the training phase we need to build a training data set mimicking (even only
roughly) the actual data interaction. From that set, we retrieve the training data
values-dependent ^-values, called proxy f0-values. This calls for reducing each set
of training data values into a few summary statistics or filter scores. Based on these
scores, we can classify the proxy values VQ. Each class of data is associated to a single
(average or median) u0-value, called a "class ^-prototype".
The application phase consists of returning to the actual study field and finding
the training class closest to the actual conditioning data scores, retrieving that class
"prototype" value, and finally using that u0 value to combine the elementary proba
bilities. These elementary probabilities must be evaluated from the actual study field,
not from the training data set.
When evaluating the probabilities of P(A = fc|D,B,C), for all k = 1, 2, 3, the three
conditioning data events D, B, and C comprises each four categorical data points
(refer to Figure 5.3 for the geometry of the three data events). There are only 195
data value combinations with at least 10 replicates for which we can evaluate reliably
the corresponding exact conditional proportions.
Were the training data set identified to the reference data set (ideal training), the (k)
mean of the proxy I/Q values is equal to 1.36, 1.03 and 0.85 for A; = 1 , 2, 3, respectively.
This represents a significant difference from the assumption of no-data-interaction (i.e.
the I/Q = 1 model). Note that in Chapter 4 of this thesis, we dealt with a similar
data set except that it included only two categories. In that two-category case for
evaluation P(A = 1 |D,B,C) the mean of the proxy VQ values was equal to 2.21.
This represents a much greater deviation from 1 (hence from no-data-interaction case)
than the previous value 1.36 obtained for the ternary data set. By adding the third
category, the actual data interaction was understated because the interaction param-(k)
eter UQ has been evaluated by averaging over all data values combinations and all
possible values for the unknown. Because of this severe averaging, we do not expect
this particular u0 classification to lead to results much better than the vQ — UQ — 1
144 CHAPTER 5. APPLICATION TO NON-BINARY DATA
model. Ideally, the classification of the training VQ proxy values should differentiate
data values and also the unknown value. This would require, however, a much larger
training image displaying enough replicates of all data values-dependent conditioning
data events.
For this ternary data set, the interaction is the greatest for the first category (i.e. k=\)
and we retained only 195 statistically significant data value combinations. Each con
ditioning data point consisted of four points. Each of these points can take three
values (k = 1 , 2, or 3). Evaluating P(A = 1 |D,B,C), we expected data interaction
to be the greatest when all four points of each of the three conditioning data events
D , B , C take the value 1. Similarly, when evaluating P(A = 2 |D,B,C) we expect
the interaction to be the greatest when all four points of the three conditioning data
events D , B , C take the value 2. However, out of the 195 data value combinations
retained, the case with all four points of the three conditioning data events D , B , C
equal to 2 was not found. The case with all four points of the three conditioning data
events D, B, C equal to 1 was found, this explains why the average proxy VQ ' value
deviates from one more significantly than for the categories k=2, 3.
Despite its considerable averaging, consider the classified v^ ' approach with for data
summary score a single statistic. For each category k, we categorize the retained 195
data value combinations by defining first the binary indicator of each point data value
as:
I{k) (u) = S. 1 i£Uj E category k
[ 0 otherwise
Then we define the data event D, B, C score as the average of its twelve point indicator
)res for k = 1, 2, 3:
= ^ E / C f e ) ( ^ ) (5-22)
da t a values; there are three such scores for k = 1, 2, 3:
m ( t > 12
By mapping these 195 sets of three scores into the 3D score space we can determine
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 145
(Ti
m*r l "
* *" *" „ * 1
0.2^--*"**!"
0.15.
0.K
0.05.
_,
_ . , •
* *
_,
„ ^ •
- ' * '
* • » ' " '
* - •
*
» ' " ' * ' "
*-**'"'
* - ' " "
i *"* -
. 4
*
*-K * - * * • * ^
i „ > - "* vM K t * < £ •
"**-*
"- ^ * ^ * l ^
1 " » * 1 " " I .
r - i , *» ' * i " ,
* * • ' . " >
i*» I i
^. i t'^-
^ El "v&. 1
0 0,5
Figure 5.9: Classification of scores rrS (!) m( 2 ) m(3) 771
m (1) m ( 2 ) TO (3) , mA -1. Each axis represents the scores
respectively. Note, the available 195 data value combinations cluster into only 11 points in the 3D m^ space. The resulting classification of these 11 points is shown by different geometric shapes.
classes of similar scores by performing, for example, a k-mean algorithm [59] which (k)
partitions the points into classes. Each training class prototype VQ value is the
average of the proxy Z/Q values falling into that class. Figure 5.9 shows such a
classification. Each axis represents the scores m^\ m^2\ m^3\ respectively. Note,
the available 195 data value combinations cluster into only 11 points in the 3D m^
space. The resulting classification of these 11 points is shown by different geometric
shapes. In this case, there are four classes marked my stars, triangles, diamonds and
squares, respectively.
For each set of actual data values (i.e. of conditioning data values), we then look for
146 CHAPTER 5. APPLICATION TO NON-BINARY DATA
the training class (out of four possible) that is closest to the actual data statistics, and
use the corresponding class prototype v0 value (instead of VQ — 1) for building the
fully conditioned probability P(A = A;|D,B,C). To ensure consistent probabilities,
i.e. that
P*(A = 1 | D,B,C) + P*(A = 2 | D,B,C) + P*(A = 3 | D,B,C) = 1,
the consistency relation (5.11) is applied to modify the third interaction weight UQ .
That is, the third weight VQ is calculated from the I/Q and u0 ' values as:
v = ^ = (2 + g ( 1 ) ^ 1 ) + g ( 2 ) * ^ 2 ) ) ( 5 ^ 0 5W * 5W * 5(3) * ^ 1 } * i/J2) - 5W
where S^\ S@\ S^ are defined according to relation (5.15).
Figure 5.10 shows the scattergrams of the estimate based on the classified I/Q ap
proach versus the reference P(A = A;|D,B,C) for k=l, 2, 3 for the 195 data value
combinations retained. The coefficient of correlations between the reference true pro
portions and the estimates based on classified VQ approach for the 195 data value
combinations are equal to 0.70, 0.67, and 0.55 for /c=l, 2, 3, respectively.
The statistics for the reference proportions and for its classified I/Q estimate are given
in Table 5.2.
the spatial mean and the variance are close to the reference values.
The classified I/Q ' model is well correlated with the reference and at the same time,
Note that without the constraint (5.23) the classified I/Q approach may lead to in
consistent probability estimates:
P*(A = 1 | D,B,C) + P*{A = 2 | D,B,C) + P*(A = 3 | D,B,C) ^ 1 (k)
Had we used the original exact training I/Q values without any classification and
consequent averaging, the consistency relation (5.23) would have been met exactly.
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 147
reference reference reference
Figure 5.10: Scattergram of reference proportion P{A = fcjD,B,C) along x-axis versus estimate P*(A = fc|D,B,C) based on classified v^ ' model for k=l (left), k = 2 (center), k = 3 (right) along y-axis. The highest correlation is attributed to the categories k = 1 and k = 2. The smallest correlation is attributed to category k — 3 which has less spatial structure than the other two categories.
5.2.6 Inference robustness
To test the robustness of the inference paradigm proposed, we now consider a training
data set which is different from the reference data set. The reference data set pro
vides the conditional data and the exact conditional proportions, the training data
set provides the proxy interaction nu parameter values.
For this purpose we draw 50 new realizations, S®, I = 1, ...,50, utilizing the same
image generator code [56] previously used to draw the reference data set. These 50
training images are again of size 100x100x50. The average of the 50 eroded realiza
tions ( S f ) means are given in Table 5.3 for k = 1,2,3. The corresponding reference
means over SQ are also shown in that Table.
148 CHAPTER 5. APPLICATION TO NON-BINARY DATA
k = l reference
classified VQ ' model
k=2 reference
classified Z/Q model
k=3 reference:
constrained classified v$
mean 0.61
0.60
mean 0.24
0.22
mean 0.15
0.19
variance 0.0161
0.0107
variance 0.0117
0.0073
variance 0.0030
0.0011
Table 5.2: Summary statistics: spatial means and variances of reference conditional probabilities and of estimates based on a classified nu representation.
The single datum event conditioned probabilities P(A|B), P(A\C), P(A\D) are re
trieved from the reference data set So shown in Figure 5.4. The proxy i/Q values are
retrieved from all 50 training realizations (So ) pooled into a single inference pool.
k = l k=2 k=3
training 0.665 0.222 0.113
reference 0.636 0.240 0.115
Table 5.3: The average means of the 50 eroded training data sets for k = 1, 2, 3. For comparison, the right column shows reference means.
In Table 5.4, we compare the spatial mean and variance of the reference 195 probabil
ity values P( J4|D,B,C) to the corresponding spatial statistics of the estimate based
on the nu representation. We define classified proxy u^ ' model as the model based
on the 50 training images all different from the reference data set. In Table 5.5, the
correlation coefficients of 195 reference probability values P(A|D,B,C) with the es
timate based on the nu representation are given for k = 1,2,3.
5.2. LARGE NON-GAUSSIAN TERNARY CASE STUDY 149
k = l reference
classified proxy u^ model
k=2 reference
classified proxy UQ model
k=3 reference
constrained classified proxy v$ model
mean 0.61
0.61.
mean 0.24
0.22
mean 0.15
0.17
variance 0.0161
0.0092
variance 0.0117
0.0072
variance 0.0030
0.0005
Table 5.4: Summary statistics: spatial means and variances of the 195 reference conditional probabilities and of the estimates built from a nu representation.
Comparing these two tables with Figure 5.10 and Table 5.2, the classified proxy VQ
model appears quite robust since the two Tables provide very similar results. For
example, in Table 5.4 where we tend to overestimate (underestimate) the spatial
statistics, the similar trend can be observed in Table 5.2. This inference allows us
to conclude that no matter how approximative is the training image it significantly
improves the results provided by v$ — 1 model as such training image will provide
insight on data interaction seen the study field.
classified proxy VQ
k = l
0.69
k=2
0.66
k=3
0.52
Table 5.5: Correlations of 195 reference proportion values P(A|D,B,C) with estimates based on a classified nu representation.
The key lesson learned from these case studies is that we must check the assumptions
underlying any model (whether it is a no-data-interaction model or models based
on data independence). The applicability of each model ultimately depends on the
physics of the data. For example, if conditional independence is inappropriate (as
is often the case in geology-related applications) it should not be imposed for mere
150 CHAPTER 5. APPLICATION TO NON-BINARY DATA
convenience, as it might lead to large bias and various order relation violations.
Chapter 6
Summary and conclusions
6.1 Summary of major theoretical developments
This thesis addresses the problem of integrating diverse data sources while account
ing for the interaction between these data. We consider n data events Di,...,Dn that
inform the same unknown event A. Each of the n + 1 events can be very complex
involving multiple locations in time and/or space. As an example, the unknown A
could be indicative of the presence of channel sand connecting two wells. Data event
Di is the indicator of facies at these two wells. Data event D2 is the result of a
seismic survey providing " soft" probabilities of presence of channel at or around the
same two wells.
In this thesis, we assume that each of n + 1 events had been previously processed
providing the following probabilities:
1. prior probability P(A = a) available from historic data, and
2. datum-specific conditional probabilities P(A = a\Di = di). Each of these prob
abilities captures the specific information about the unknown event A brought
by the datum event Di taken alone. This step is crucial. Many algorithms
exist to process information brought by a single individual data event into such
conditioned probabilities P(A = a\Di = di), e.g. indicator kriging [34], various
151
152 CHAPTER 6. SUMMARY AND CONCLUSIONS
regressions including neural networks [3],[40], and [41] among others). However,
the task of obtaining the probabilities P(A = a\Di = d\) is out of the scope of
this thesis.
The goal of this thesis is to combine the prior probability P(A — a) and the n
individually conditioned probabilities P(A — a\Di = di) into an estimate or model of
the fully conditioned probability P(A = a\D\ = di,..., Dn — dn):
P(A = a\Dx = di,...,£)„ = dn) = ip(P(A = a),P(A = a\Dt = di), i = 1, ...,n).(6.1)
• The exact combination function ty depends on data values di and unknown
value a. Approximations amount to propose functions ip that are either inde
pendent of the data values or dependent only on a few summaries of these data
values. One of our objectives was to decompose the task of determining the
fully conditioned probability into two easier tasks: obtaining the n individually
conditioned probabilities P(A = a\D{) and the prior probability P(A = a) and
then combining these probabilities into a function of type (6.1) while accounting
for data interaction.
The traditional approach for approximating of the function (6.1) is built around the
simplifying assumption of data conditional independence. This assumption states
that the n data events become independent of each other given knowledge of a spe
cific realization/value of the unknown A. Let the notation Z>j_i represent all data
Di,..., A - i in the sequence up to data Di excluded. Then conditional independence
between the data events D± and -Dj-i giving the unknown A — a is written:
P(Di = di\A = a,'Di-l = d\-1) = P{Di = di\A = a), W^ (6.2)
Assumption of conditional independence between the n data events Di,...,Dn giving
the unknown A = a leads to:
n
P(D1 = d1,...,Dn = dn\A = a) = 11 P(Di = di\A = a) (6.3) i = l
6.2. THE NU EXPRESSION: THEORY 153
However, this assumption of independence is rarely checked in practice. Geological
data tend to be related through their common geological origin and should not be
a priori considered as independent or conditionally independent. Typically a datum
event Dj independent of all other Dj is often also independent of A and, therefore,
should not be retained.
Accounting for data interaction is critical in any data integration algorithm. Such
interaction between data and unknown may change the naive assessment made from
an association of data ignoring their interaction. This interaction is typically data-
values and unknown-value dependent, requires considering all data jointly.
6.2 The nu expression: Theory
6.2.1 Tau expression
The nu expression proposed in this thesis expands from the tau model proposed by
Journel [48]. The tau model uses the well-known paradigm of permanence of ratios
which states that ratio of probabilities are typically more stable than the probabilities
themselves.
The Bordley-Journel expression is written:
xo \xoy \xo/ \xoJ
where Xi, x, and x0 are inverse of odds ratios and defined as datum-specific distances
to the unknown event A occurring. The fully conditioned distance
x = ~PIA\£,11,'"DJ) to the unknown event A is expressed in equation (6.4) as the
tau-weighted product of the elementary distances Xi = pj^m.-) scaled by the prior
distance XQ = ~PA\ • All of these distances necessarily lie in interval [0, oo]. In his
work, Journel stopped short of providing the expression of the interaction weights T;.
Without such expression or approximation of it, the expression (6.4) can be seen only
154 CHAPTER 6. SUMMARY AND CONCLUSIONS
as a heuristic model for the sought-after posterior probability P(A — a\D\, ...,Dn).
The tau model remained a model until the contribution of Krishnan [50], who de
veloped the exact expression of the r interaction weights. The central equation to
Krishnan work is:
j / ^ A = di\A = nona, A - i = di-i) / , , N P(Di = di | A = a, Di_i = dj_i) , .
Ti{d1,...,dn,a) = -— . ' r -— e [ - o o , + o o | log-P(Di = dj\A = nona) L ' J ' (6.5)
P(Di = di\A = a
Tl = l
where Di_i represent all data Di, ...,Di_i in the sequence up to A excluded. Criti
cally, this Tj expression is associated to a specific ordering of n data events Di, ...,Dn.
The consideration of a different data sequence results in different T; weights.
The denominator of the Tj-expression (6.5) measures how datum Di — di discrimi
nates the outcome A — a from nona. The numerator measures the same but in the
presence of the previous data D{-\ = dj_i = {D\ = d\, ...,Dj_i = dj_i}. Thus the
ratio (of ratios) r» indicates how the discrimination power of Di = di is changed by
knowledge of the previous data Dj_i — dj_x taken all together. In this regard, the r,
parameter can be seen as a data interaction measure.
6.2.2 Nu expression
Krishnan's derivation (6.5) provided the exact expression of the conditional proba
bility P(A\Di,..., Dn) while accounting for data interaction through the weights Tt.
These tau interaction weights in addition to being dependent on the specific ordering
of the n data events D\,..., Dn are data-values and unknown-value dependent.
While data-specific and data sequence-dependent interaction weights are important
in some applications, most often it suffices to evaluate the global or compound impact
of data interaction. Krishnan's derivation fails to provide a measure of such global
data interaction. Moreover, the Tj weights are likely to be unstable versus data values
since, when the information is non-discriminating, the denominator of expression (6.5)
6.2. THE NU EXPRESSION: THEORY 155
tends toward log 1 = 0 leading to an infinite tau weight (TJ —*• oo) and consequent
stability problems.
The nu expression proposed in this thesis tries to overcome these shortcomings of the
tau expression. This nu expression is written as:
n n
= Y[ vi~ = ^o XT —; wnere o = \\ Vi > 0 (6.6)
» = I i = i i = i
with:
P(DitA,.Di_1)
= P ^ I A ^ - 0 > w i t h = ( 6 J )
P(Di\A)
where A = nonA.
The expression (6.6) remains an exact representation of the posterior probability
P(A = a\Di = di,..., Dn — dn). However, compared to the tau expression (6.4), the
nu representation leads to a single, data sequence-independent, interaction parameter
UQ. That single I/Q interaction parameter is still data-values and unknown value de
pendent, as expected if the nu expression (6.6) is to be an exact representation of the
fully conditioned probability. The instability of the TJ parameter linked to division by
a log ratio possibly close to zero as in relation (6.4) does not affect the nu expression
(6.7).
Although the nu representation avoids some of the inference drawbacks of the tau
expression, the two representations are strictly equivalent through the parameter
relations:
U Xo
Characteristic unit nu values
• If Vi — 1, the ability of the datum (or data event) Di = di to discriminate
a from nona is unchanged by knowledge of the previous (i — 1) data events
156 CHAPTER 6. SUMMARY AND CONCLUSIONS
dj_! = {Di — di,...,Di-i = <ij_i}. Therefore Vi = 1 can be seen as the case
of "non-interaction" of the two data events Di and D^-i when it comes to
discriminating a from nana. The deviation | 1 — z/j | is thus a measure of data
interaction. That measure is (a,dj,j = l,...,i) values-dependent. Similarly
when considering the single z/0 parameter, the deviation | 1 — i/0 | is a measure of
global data interaction. Note that for u0 = 1 it is not necessary that all z/j = 1.
For example, v\ = 1, z/2 = 3 and z/3 = 1/3 would result in z/0 = l*3*(l/3) = 1. In
other words, different elementary data interaction (z/j 7 1) may cancel out into
no global data interaction VQ — 1. The major contribution of the nu expression
is that it accounts not only for the elementary data interactions but also for the
global interaction as measured by | 1 — z/0 |.
• Vi — 1 in equation (6.7) requires that the two ratios p ^ u ' g * ) anc* PI-DIA) ^ e
equal to each other. The traditional hypothesis of conditional independence of
Di and D^\ given only A = a as in expression (6.2) does not suffice to ensure
Vi — \. Conversely, Vi = 1 does not imply necessarily conditional independence
given only A = a or A = nona. It was shown in the text that data independence,
data conditional independence and data non-interaction (z/j = 1) are different
states.
6.3 Approximations of the nu representation
The nu representation (6.6) is an exact representation of the fully conditioned pos
terior probability P(A = a\Dx — d\,..., Dn = dn). Critically, the nu parameter values
are data values and unknown value dependent. While such form of dependence allows
for an exact representation of the fully conditioned probability, it is unpractical. This
calls for approximations built from the exact nu expression (6.7).
The VQ=1 model
One straightforward approximation of the fully posterior probability P(A\Di,..., Dn)
is to assume global cancellation of individual data interaction by setting z/0 = 1. The
6.3. APPROXIMATIONS OF THE NU REPRESENTATION 157
estimate of the posterior probability P(A\D\,..., Dn) under the uQ = 1 model is then
written:
n
i=l
This model is remarkably simple, yet it is more comprehensive than the traditional
models stemming from various assumptions of data independence. For example, con
ditional independence of Dt and -Dj_i (as defined in equation (6.2)) given only A = a
does not suffice to ensure VQ — \. The u0 = 1 model ignores the interactions between
data, not because they do not exist but because they are assumed to globally cancel
out.
Approximations are made possible by the fact that in the nu expression data inter
action has been completely separated from individual data information; the single
parameter u0 suffices to measure the compound or global data interaction. That
global data interaction can be obtained from a training image where a similar inter
action is observed. Because of the previous separation, that training image can be
different from the actual set from which the actual data originate.
Classified u0 model
We suggest that the single global interaction parameter u0 be obtained (lifted) from
a training image. The training image provides proxy replicates of the interaction
between the actual data and the unknown. We stress that the training image is
necessarily only for retrieval of the global interaction parameter VQ. The individual
probabilities or, equivalently, distances involved, should be obtained from the actual
data. This is very similar to a kriging application where the linear dependence model,
or covariance, between data can be obtained from an outcrop while the data used for
the estimated value originate from the real field. With a paradigm similar to that of
kriging:
• the information content which is obtained through the actual data
158 CHAPTER 6. SUMMARY AND CONCLUSIONS
• the interaction parameter u0 which can be obtained from training images
The algorithm for obtaining the posterior probability P(A\Di,..., Dn) comprises two
distinct phases, the training phase and the application phase. In the training phase:
1. find or build a training data set that approximates the actual data interaction.
From that set, retrieve the training data-values dependent f0-values, called
proxy uQ-values
2. summarize each set of training data values into a few filter scores. Based on
these scores, classify the proxy values u0. Each class is identified by a single
(average or median) z/0-value, called a "class z^-prototype"
In the application phase:
1. Find the training class closest to the scores of the actual conditioning data.
2. Retrieve that class "prototype" value z/0
3. Use that u0 value to combine the elementary probabilities. Once again, these
elementary probabilities must be evaluated from the actual study field, not from
the training data set.
6.4 Final conclusions
In this section, we summarize the key results and findings presented in this thesis.
Synthetic case studies were used to check the conclusions hereafter. The key results
are:
• An analytical expression (the nu expression) was established to account for
data interactions. This methodology is based on an effective separation of data
information content and data interaction.
• The nu expression is the sister of the previously proposed tau expression. The
tau expression did also allow for the separation of information content and
individual data interaction. The nu expression reduced these complex individual
6.5. FUTURE WORK 159
data interactions into a single global interaction parameter u0 which is both
data-values and unknown-value dependent. The tau expression did not reveal
such global parameter.
• The extremely concise and simple v^ — \ model provides favorable results when
compared to estimates obtained from traditional data independence assump
tions. While the vo = 1 assumes no-data-interaction. That hypothesis is less
demanding than either data independence or data conditional independence.
• The classified v0 model handles effectively the problem of data integration by
borrowing the global data interaction (u0 ^ 1) from a training image mimicking,
even only roughly, that data interaction.
6.5 Future work
This study has allowed us to draw several conclusions which may have a significant
impact on how we address the problem of data integration is addressed in the future.
Combining the contributions of data events from different sources is not a simple
task. Each of these data events could be quite complex and their contributions are
generally related one to another, that is any data event can influence the information
brought by another data or combination of other data.
The remarkably simple u0 model verifies all limit properties of probabilities and has
been shown in this thesis to consistently outperform traditional models based on vari
ous forms of data independence. The contribution of the nu representation compared
to the tau representation is to express the impact of joint data interaction by a single
correction parameter VQ whose known exact expression can lead to case-dependent
approximations. We suggest an inference paradigm where the v0-values would be
obtained from a catalog of "similar" cases with approximately similar data configura
tions and data values. The future practicality of the nu model would depend on our
ability in generating such proxy training data from which to export the ^-parameters.
160 CHAPTER 6. SUMMARY AND CONCLUSIONS
The theoretical development have led to the definition a measure of data interaction
given a particular estimation goal. Joint no data-interaction, i.e. the u0 — 1 model,
is a concept much richer and more useful than that of data conditional independence.
Understanding data interaction is critical to a correct utilization of diverse data orig
inating from a variety of sources of unequal reliability, overlapping and sometimes
contradictory information. In an era of simulation made possible by high power
computers, it is time to graduate from the hypotheses of data independence or condi
tional independence which are hidden behind many traditional statistical prediction
algorithms. Covariance analysis, including principal component and kriging analysis,
would at best remove linear two-point correlation, it does not (cannot) address the
problem of joint, multiple-point, multiple data events dependence and their joint in
teraction itself dependent on the specific unknown being assessed.
A rigorous and complete solution calls for the joint thus multivariate distribution of
the random variables modeling the unknown and the various data events retained.
An illusion of rigor could be obtained by adopting some Gaussian-related multivari
ate distribution; this is but a diversion which allows reverting back to the previous
covariance analysis with all its limitations. It is time to acknowledge fully that spa
tial distributions, at least at any scale larger than molecular, is not Gaussian: earth
sciences distributions are neither Gaussian nor even symmetric, dependence is rarely
if ever linear, error variances are not homoscedastic (i.e. independent of the signal
value), data interact all together not two by two, etc... We suggest that modern com
puting ability (hardware and software) allows us to contemplate generating proxy
joint realizations of the set (unknown+data), that is in fact a non-parametric, non-
analytical multivariate distribution model. The data interaction can be understood
and modeled from these proxy realizations (training sets) and exported to the actual
field under study, just like one would borrow a covariance model from a proxy outcrop
or proxy data set. However, the actual individual data information content still have
to be evaluated directly from the actual field under study, just like one would not
borrow local information (sample values) from an outcrop.
6.5. FUTURE WORK 161
This study suggests decomposing the difficult task of evaluating joint data information
content into
1. the easier task of evaluating each single data event information
2. recombining the previous elementary data event information into a joint data
information content using data interaction parameters (the ^-parameters) in
ferred from proxy training sets chosen to mimic the actual data interaction.
We suggest that the generation of such proxy data interaction realizations, no matter
how approximative, is better than adopting unrealistic, non-physical hypotheses of
data (conditional) independence even if hidden under the screen of Gaussian distri
butions. This calls for a case-specific, comprehensive review of data interaction and
learning what aspects of interaction are important and what others could be disre
garded. Such understanding would require future research to go beyond the synthetic
case studies presented in this thesis and diving into real datasets, where more often
than not the processes are driven by a complex physical background.
One characteristic of earth sciences is that there usually exists prior knowledge about
these physical processes and our prediction models should build on such prior. The
concept of training image(s) and data interaction parameters lifted from such training
images allows using such prior. Of course, there is uncertainty on any prior, but then
several different priors, aka training images, can be considered. How should different
training images be built then weighted toward the final probabilistic decision remains
a research subject.
Bibliography
[1] Abidi, M. A., and Gonzalez, R. C , 1992, Data fusion in robotics and machine
intelligence, San Diego, CA, Academic Press.
[2] Agnew, C. E., 1985, Multiple probability assessments by dependent experts, Jour
nal of the American Statistical Association, v. 80, p. 343-347.
[3] Arribas, J. I., Cid-Sueiro, J., Adali, T., Figueiras-Vidal, A. R., 1999, Neural
networks to estimate ML multi-class constrained conditional probability density
functions, Neural Networks, v. 2, p. 1429-1432.
[4] Bates, J. M., and Granger, C. W. J., 1969, The combination of forecasts, Opera
tional Research Quarterly, v. 20, p. 451-467.
[5] Benediktsson and Swain, 1992, Consensus theoretic classification methods, IEEE
Trans. Systems, man Cybernet, v. 22, p. 688-704.
[6] Bordley, R.F., 1982, A multiplicative formula for aggregating probability assess
ments, Management Science, v. 28, no. 10, p. 1137-1148.
[7] Budyko, M. I., 1969, The effect of solar radiation variations on the climate of the
Earth: Teelus, v. 21, p. 611-619.
[8] Butz, C. J., and Sanscartier, M. J., 2002, Properties of weak conditional indepen
dence, 3rd International conference on rough sets and current trends in computing
(RSCTC02), p. 349-356.
[9] Bunn, D. W., 1981, Two methodologies for the linear combination of forecasts,
The Journal of the Operational Research Society, v. 32, no. 3, p. 213-222.
162
BIBLIOGRAPHY 163
[10] Cheng, J., and Greiner, R., 2001, Learning Bayesian Belief Network Classifiers:
Algorithms and System, Advances in artificial intelligence: 14th Biennial confer
ence of the Canadian society for computational studies of intelligence, Ottawa,
Canada.
[11] Deutsch, C , and Journel, A. C , 1998, GSLIB: Geostatistical software library
and user's guide, Oxford University Press, New York, p. 87.
[12] Drton, M., Andersson, S. A., and Perlman, M. D., 2006, Conditional indepen
dence models for seemingly unrelated regressions with incomplete data, Jour, of
Multivariate Analysis, v. 97, no. 2, p. 385-411.
[13] Caers, J., and Hoffman, T. B., 2006, The probability perturbation method: an
alternative Bayesian approach for solving inverse problems, Math. GeoL, v. 38, no.
1, p. 81-100.
[14] Causeur, D., and Dhorne, T., 2003, Linear regression models under conditional
independence restrictions, Scandinavian Journal of Statistics, v. 30, no. 3, p. 637-
650.
[15] Clemen, R. T., 1987, Combining overlapping information, Management Science,
v. 33, no. 3, p. 373-380.
[16] Clement, R. T., and Reilly, T., 1999, Correlations and copulas for decision and
risk analysis, Management Science, v. 45, no. 2, p. 208-224.
[17] Chatterjee, N., and Carroll, R. J., 2005, Semiparametric maximum likelihood
estimation exploiting gene-environment independence in case-control studies, v.
92, no. 2, p. 399-418.
[18] Dawid, A. P., 1979, Conditional independence in statistical theory, Journal of
Royal Statistical Society, Series B (Methodological), v. 41, no. 1, p. 1-31.
[19] Downs, G. W., and Rocke, D. M., 1979, Interpreting heteroscedasticity, American
Journal of Political Science, v. 23, no. 4, p. 816-828.
164 BIBLIOGRAPHY
[20] Dickinson, J. P., 1973, Some statistical results in the combination of forecasts,
Operational Research Quarterly, v. 24, p. 253-256.
[21] Dickinson, J. P., 1975, Some comments on the combination of forecasts, Opera
tional Research Quarterly, v. 26, p. 205-210.
[22] Dubrule, O., 1983, Two methods with different objectives: Splines and kriging,
Math. Geology, v. 15, no. 2, p. 245-257.
[23] Fleiss, J. L., 1981, Statistical methods for rated and proportions, 2nd Ed. New
York, John Wiley.
[24] Freeling, A. N., 1981, Reconciliation of multiple probability assessments, Orga
nizational Behavior and Human Performance, v. 28, p. 395-414.
[25] French, S., 1980, Updating of belief in the light of someone's else opinion, Journal
of the Royal Statistical Society A, v. 143, p. 43-48.
[26] Friedman, N., Nachman I., and Pe'er, D., 2000, Using Bayesian networks to
analyze expression data, Journal of Comput. Biology, v. 7 (3-4), p. 601-620.
[27] Fuchs, C , and Greenhouse, J. B., 1988, The EM algorithm for maximum likeli
hood estimation in the mover-stayer model, Biometrics, v. 44, no. 2, p. 605-613.
[28] Galton, F., 1894, Natural Inheritance (5th ed.), Macmillan and Company, New
York.
[29] Gelfand, A. E., and Smith, A. F. M., 1990, Sampling-based approaches to cal
culating marginal densities, Journal of American Statistical Association, v. 85, p.
398-409.
[30] Genest, C., and McConway, K. J., 1990, Allocating the weights in the linear
opinion pool, Journal of Forecasting, v. 9, p. 53-73.
[31] Genest, C., and Zidek, J. V., 1986, Combining probability distributions: A cri
tique and an annotated bibliography, Statistical Science, v. 1, p. 114-118.
BIBLIOGRAPHY 165
[32] Roback, P. J., and Givens, P. J., 2001, Supra-Bayesian pooling of priors linked
by deterministic simulation model, Communications in Statistics, v. 30, no. 3, p.
447-476.
[33] Goovaerts, P., 1997, Geostatistics for natural resources evaluation, Oxford Press.
[34] Goovaerts, P., 1994, Comparative performance of indicator algorithms for mod
eling conditional probability distribution functions, Math. Geology, v. 26, no. 3, p.
389-411.
[35] Guardiano, F., and Srivastava, M., 1992, Multivariate geostatistics: beyond bi-
variate moments, In A. Soarces (Ed.), Geostatistics-Troia, p. 133-144. Kluwer Aca
demic Publ., Dordrecht.
[36] Haldorsen, H. H., and Lake, L. W., 1984, A new approach to shale management
in field-scale models: SPE Jour., v. 24, no. 8, p. 447-452.
[37] Halperin, M., 1961, Almost linearly optimum combination of unbiased estima
tors, American Statistical Association Journal, p. 36-43.
[38] Hanushek, E., and Jackson, J. E., 1977, Statistical methods for social scientists,
New York, Academic Press.
[39] Harrison, J. M., 1977, Independence and calibration in decision analysis, Man
agement Science, v. 24, no. 3, p. 320.
[40] Husmeier, D., 1999, Neural networks for conditional probability estimation: fore
casting beyond point predictions (perspectives in neural computing), Springer.
[41] Husmeier, D., and Taylor, J. G., 1998, Neural networks for predicting conditional
probability densities: improved training scheme combining EM and RVFL, Neural
Networks, v. 11, no. 1, p. 89-116.
[42] Kmenta, 1971, Elements of econometrics, New York, Macmillan.
[43] Jacobs R. A., 1995, Methods for combining experts' probability assessments,
Neural Computation, v.7, p. 867-888.
166 BIBLIOGRAPHY
[44] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E., 1991, Adaptive
mixtures of local experts, Neural Computation, v. 3, p. 79-87.
[45] Journel, A. G., 1983, Non parametric estimation of spatial distributions, Math.
Geology, v. 15, no. 3, p. 793-806.
[46] Journel, A. G., 1989, Fundamentals of Geostatistics in five lessons, Short course
in Geology, American Geophysical Union, Washington, D. C., v. 8.
[47] Journel, A. G., 1992, Geostatistics: roadblocks and challenges. In A. Soares
(Ed.), Geostatistics-Troia, Kluwer Academic PubL, Dordrecht, p. 133-144.
[48] Journel, A.G., 2002, Combining knowledge form diverse information sources: an
alternative to Bayesian analysis, Math. Geology, v. 34, No. 5, 573-598.
[49] Journel, A. G., and Huijbregts, C. J., 1978, Mining Geostatistics, Academic
Press, New York, p. 566.
[50] Krishnan, S., 2005, Combining diverse and partially redundant information in
the earth sciences, Ph. D. thesis, Department of Geological and Environmental
Sciences, Stanford University.
[51] Lindley, D. V., 1979, In discussion of Dawid's paper: conditional independence
in statistical theory, v. 41, no. 1, p. 15-16.
[52] Lindley, D. V., 1982, The improvement of probability judgments, Journal of the
Royal Statistical Society A, v. 145, p. 117-126.
[53] Lindley, D. V., 1985, Reconciliation of discrete probability distributions, In J.
M. Bernardo, M.H. DeGroot, D. V. Lindley, and A. F. M. Smith (Eds), Bayesian
Statistics 2, Amsterdam: North-Holland.
[54] Lindley, D. V., 1988, The use of probability statements, In C. A. Clarotti and
D. V. Lindley (Eds), Accelerated Life Testing and Experts' Opinions in Reliability,
Amsterdam: North-Holland.
BIBLIOGRAPHY 167
[55] Lindley, D. V., Tversky, A., and Brown, R. V, 1979, On the reconciliation of
probability assessments, Journal of the Royal Statistical Society A, v. 142, p. 146-
180.
[56] Maharaja, A., 2007, Global net-to-gross uncertainty assessment of reservoir ap
praisal stage, Ph. D. thesis, Department of Geological and Environmental Sciences,
Stanford University.
[57] Markov, A. A., 2006, An example of statistical investigation of the text Eugene
Onegin concerning the connection of samples in chains, trans. David Link. Science
in Context, v. 19, no. 4, p. 591-600.
[58] McLaren, A. D., 1979, In discussion of Dawid's paper: conditional independence
in statistical theory, v. 41, no. 1, p. 16-17.
[59] MacQueen, J. B., 1967, Some methods for classification and analysis of mul
tivariate observations, Proceedings of 5th Berkley Symposium on Mathematical
Statistics and Probability, Berkley, University of California Press, v. 1, p. 281-297.
[60] Morris, P. A., 1974, Decision analysis expert use, Management Science, v. 20, p.
1923-1241.
[61] Omre, H., and Tjelmeland, H., 1997, Petroleum geostatistics, in Baafi E. Y., and
Schofield N. A., eds., Geostatistics Wollongong '96, v. 1: Kluwer, Dordrecht, The
Netherlands, p. 41-52.
[62] Park, N. W., and Chi, K. H., 2003, A probabilistic approach to predictive spa
tial data fusion for geological hazard assessment, Geoscience and Remote Sensing
Symposium, IGARSS apos; Proceedings. IEEE International, v. 4, no. 21-25, p.
2425-2427.
[63] Pearl, J., 1988, Probabilistic reasoning in intelligent systems: networks of plau
sible inference, San Mateo, CA, USA: Morgan Kaufmann Publishers.
[64] Polyakova, E. I., and Journel, A. G., 2007, The Nu expression for probabilistic
data integration, Math. Geology, v. 39, No. 8, 715-733.
168 BIBLIOGRAPHY
[65] Rabe-Hesketh, S., Skrondal, A., and Pickles, A., 2004, Maximum likelihood es
timation of limited and discrete dependent variable models with nested random
effects, Journal of Econometrics, v. 128, no. 2, p. 301-323.
[66] Remy, N., 2004, The Stanford geostatistical modeling software (S-Gems), SCRF
Lab, Stanford University.
[67] Ross, S., 1998, A first course in probability, Prentice Hall, Upper Saddle River,
New Jersey 07458.
[68] Specht, D., 1990, Probabilistic neural networks, Neural Networks, v. 3, p. 109-
118.
[69] Stone, M., 1961, The opinion pool, Annals of Mathematical Statistics, v. 32, p.
1339-1342.
[70] Strebelle, S., 2001, Conditional simulation of complex geological structures using
multiple-point statistics, Math. Geol., v. 334, no. 1, p. 1-22.
[71] Tarantola, A., 2004, Inverse problem theory and model parameter estimation,
SIAM (in press).
[72] Theil, H., 1971, Principles of econometrics, New York, Wiley.
[73] Vinnikov, K. Y., Gruza, G. V., Zakharov, V. F., Kirillov, A. A., Kovyneva, N.
P., and Ran'kova E. Y., 1980, Recent climatic changes in the Northern Hemisphere,
Soviet Meteorology and Hydrology, v. 6, p. 1-10.
[74] Zhang, T., Switzer, P., and Journel, A. G., 2006, Filter-based classification of
training image patterns for spatial simulation, Math. Geol., v. 38-1.
[75] Winkler, R. L., 1981, Combining probability distributions from dependent infor
mation sources, Management Science, v. 27, p. 479-488.
[76] Wong, S. K. M., Butz, C. J., and Wu, D., 2000, On implication problem for
probabilistic conditional independence, IEEE transactions on systems, man, and
cybernetics, part A: systems and humans, v. 30, no. 6, p. 785-805.