3D Scene interpretation by combining probability theory and logic: The tower of knowledge

Computer Vision and Image Understanding 115 (2011) 1581–1596

Contents lists available at SciVerse ScienceDirect

Computer Vision and Image Understanding

journal homepage: www.elsevier .com/ locate/cviu

3D Scene interpretation by combining probability theory and logic: The tower ofknowledge

Mai Xu, Maria Petrou ⇑Electrical and Electronic Engineering Department, Imperial College London, South Kensington, SW7 2AZ London, UK

a r t i c l e i n f o

Article history:Received 12 April 2011Accepted 7 August 2011Available online 22 August 2011

Keywords:Scene labelling systemsLogic and probabilitiesMachine learningSystem architecture

1077-3142/$ - see front matter � 2011 Elsevier Inc. Adoi:10.1016/j.cviu.2011.08.001

⇑ Corresponding author.E-mail address: [email protected] (M. P

a b s t r a c t

We explore a newly proposed system architecture, called tower of knowledge (ToK), in the context oflabelling components of building scenes. The ToK architecture allows the incorporation of statistical fea-ture distributions and logic rules concerning the definition of a component, within a probabilistic frame-work. The maximum likelihood method of label assignment is modified by being multiplied with afunction, called utility function, that expresses the information coming from the logic rules programmedto the system. The logic rules are designed to define an object/component by answering the questions‘‘why’’ and ‘‘how’’, referring to the actions in which a particular object may be observed to participateand the characteristics it should have in order to be able to participate in these actions. Two sets of mea-surements are assumed to be available: those made initially for all components routinely, and which sup-ply the initial statistically based inference of possible labels of each component, and those that are madein order to confirm or deny a particular characteristic of the component that would allow it to participatein a specific action. A recursive version of the architecture is also proposed, in which the distributions ofthe former types of measurement may be learnt in the process, having no training data at all. Multi-viewimages are used as input to the system, which uses standard techniques to build the 3D models of thebuildings. The system is tested on labelling the components of 10 3D models of buildings. The compo-nents are identified either manually, or fully automatically. The results are compared with those obtainedby expandable Bayesian networks. The recursive version of ToK proves to be able to cope very well evenwithout any training data, where it learns the characteristics of the various components by simply apply-ing the pre-programmed logic rules that connect labels, actions and attributes.

� 2011 Elsevier Inc. All rights reserved.

1. Introduction

For several years, probabilistic approaches, originating in thepattern recognition community, and logic based approaches, orig-inating in the artificial intelligence community, were used indichotomy. Recently, it has been recognised that a combinationof these approaches may prove very useful in computer vision[36,42]. In particular, it has been realised that statistical (and byextension probabilistic) reasoning on objects may best be inferredvia semantic relationships between the objects [8,41], and thatdynamic scenes with observed relations and actions in temporalsequences may help in cognitive tasks [38,15]. The Bayesian frame-work has solid theoretical foundations and its combination withlogic is expected to be a very powerful tool for building artificialcognitive systems. However, there is strong evidence that humancognition and reasoning, particularly in uncertain situations, isnot based on probabilities. This is the reason alternative ap-proaches have been developed, like fuzzy logic [59], utility theory

ll rights reserved.

etrou).

[13] and naturalistic decision making [58]. In this paper, a new sys-tem architecture is utilised, inspired by linguistic concepts whichmay be used to guide us in the way human reasoning is performed.The proposed architecture combines logic with the statistical prop-erties of object descriptions, and uses a utility function to combinethe two. It is exemplified by being applied to a 3D scene labellingproblem.

In particular, we explore a novel scheme, first proposed in Ref.[41], for organising knowledge, called the tower of knowledge(ToK). The scheme assumes separate networks of labels, function-alities (otherwise known as ‘‘affordances’’) and descriptors. Eachof these networks may be modelled, for example, by an MRF whichmay even be hierarchical. The networks of functionalities anddescriptors influence the network of labels. These influences maybe learnt from many examples of observed data or they may beprogrammed into the system in the form of a utility functionproposed in this paper. A utility function is a function that maybe defined to express personal preferences or subjective opinions,or factors that influence one’s decision without being justifiable byprobability theory. Thus, the proposed system may use conven-tional probabilistic approaches in each of its levels, conceptual

http://dx.doi.org/10.1016/j.cviu.2011.08.001

mailto:[email protected]

http://dx.doi.org/10.1016/j.cviu.2011.08.001

http://www.sciencedirect.com/science/journal/10773142

http://www.elsevier.com/locate/cviu

1582 M. Xu, M. Petrou / Computer Vision and Image Understanding 115 (2011) 1581–1596

logic models to create connections between the different layers,and a utility function with which to modulate the likelihood func-tion for a label of an object, in order to assign labels to objects. Ourapplication domain is the recognition of components in 3D modelsof buildings. This is an important research area, with applications,for example, to 3D city modelling and to navigation aids in a builtenvironment for the visually impaired.

The rest of this paper is organised as follows. In Section 2, wereview related work on scene labelling systems and on machinelearning. We then present the tower of knowledge (ToK) in Section3. Section 4 proposes the method for scene labelling using the ToKarchitecture. A recursive version of ToK is proposed in Section 5.Experimental results are presented in Section 6. Finally, we con-clude in Section 7.

2. Related work

There has been a great deal of interest in setting up scene label-ling systems using, for example, sets of aerial images. Marengoniet al. [33] devised the Ascender II system, capable of identifyingand recognising 3D objects from aerial images. In their system,Bayesian networks and utility theory were used to automate therecognition in aerial images, taking into consideration the variousuncertainties in the data and in the process.

Kim and Nevatia developed an Automatic Building Extractionand Reconstruction System (ABRES) which was used for detectingand describing compositions of buildings with flat or complex roof-tops from multiple aerial images [27]. Probabilistic reasoning, le-vel-details and expandable Bayesian networks were used torecognise the final models, given a set of multiple view images.Since only the roof details are visible from aerial images, groundview images were needed to provide the facade details for buildingreconstruction. So, an advanced system combing aerial and groundimages was proposed in Refs. [30] and [31], to label and recon-struct 3D models of buildings.

As it is not easy and rather expensive to obtain aerial images,from 2000 onwards a great number of papers were published onthe design of labelling systems from multiple ground view uncali-brated images. In 2004, a system based on Bayesian methods wasapplied by Dick et al. [9,10] for acquiring 3D architectural modelsfrom a set of ground images. Since scene labelling, such as recogn-ising windows and doors, is an important step towards 3D modelacquisition, this system processes short uncalibrated image se-quences from multiple ground views to label and model the build-ing architecture in which various low-level features have alreadybeen extracted. A Monte Carlo Markov Chain (MCMC) is used foroptimisation of structure and verification of prior knowledge onthe wall layout or parameters (such as the labels of windows anddoors) of related primitives learnt from training data sets and hu-man experts.

In 2005, Hotz and Neumann proposed the SCENIC system [21]with the aim to interpret real-life scenes. The general approach ta-ken by this system is the configuration technology, a frameworkthat makes it possible to navigate in a space of logically consistentinterpretations.

Moving from systems to actual algorithms, there is a consider-able number of publications on scene labelling. This includes anearly survey of 3D object recognition [2] and a more recent reviewon object recognition in outdoor scenes [3].

Neumann and Möller [37] developed Description Logic (DL) as apotential reasoning tool and a knowledge representation systemfor scene labelling. Ekvall and Kragic used Programming by Dem-onstration (PbD) [11] for dynamic scene labelling in their robot.Another algorithm is based on the informative local features ap-proach using a decision-tree [14]. Hudelot et al. [22] introduced

a support vector machine as another method for learning in scenelabelling. However, a large body of research focuses on the use ofneural networks or advanced neural networks for scene labelling[44,55,49]. A representative work using a neural network is evolu-tionary optimisation (ENN) [48]. In this work, a neural network iscombined with system design optimisation to label the visual ob-jects around the real world.

Recently, a growing trend of scene labelling has focused onsome graphical probabilistic models such as Bayesian networks[50,12,26,5,39,37] and Markov random fields (MRF) or Conditionalrandom fields (CRF) [4,28,19,20,29]. Bayesian approaches havebeen applied to scene labelling so far in two ways: either in theform of probabilistic relaxation (PR) [16,7] or as Bayesian networksof inference [40]. The most relevant work to ours is that of Kim andNevatia [26], who investigated expandable Bayesian networks(EBN) as a method of recognising 3D objects. In their paper, EBNis introduced as a reasoning tool of labelling, to solve the problemof uncertainty in the evaluation of a hypothesis based on evidence,since the number of images used is not fixed and some modalitiesmay not be always available.

The accuracy of the above existing methods, however, reliesheavily on the availability of enough training examples to populateadequately the feature space. The lack of such data may be com-pensated by the use of logic rule-based learning of machine learn-ing approaches. Machine learning approaches, supervised andunsupervised, play an important role in scene labelling systems[32,54,47,17]. In general, scene labelling is a process of learningto answer the ‘‘what’’ question from given images or imagedescriptions such as 3D reconstruction models. Learning is usuallydistinguished [45] into two forms in computer science: statisticallearning (learning by experimentation), which means learningthe prior probabilities or target functions of different cases, and lo-gic rule-based learning (learning by demonstration), which implieslearning logic rules, either in the form of propositional logic orfirst-order logic. De Raedt [42], reviewing the published literature,distinguishes two approaches of combining logic with probability:(i) logical probabilistic models, where a knowledge based model isconstructed and used as a template within the framework of prob-abilistic inference and learning; (ii) probabilistic logical models,where logic is annotated with probabilities, and inference andlearning are performed within the logic framework. Human learn-ing, however, may not follow any of these routes. Humans maylearn even from one or two examples only, something that shedsdoubt to the idea that statistics and probabilities play always majorrole. In fact, there are more than 50 theories of human learning putforward by cognitive psychologists [61]. Looking at human learn-ing, Wisniewski and Medin [56] noticed from experimental studiesof human learning that there is a rich interplay between meta-knowledge and observed data influencing the learning process.Combination of these two kinds of learning for computer sciencehas become a trend, including Markov logic networks [43], in themachine learning community. A clue as to the way humans learnand draw conclusions may come from the structure of our lan-guage and the way children learn: we often understand somethingbetter if we understand the why and the how. The ToK architectureexplored in this paper has been inspired by this observation, and italso offers a natural way of incorporating learning in the system, bybootstrapping the process in a recursive way.

3. The tower of knowledge

The human brain from infancy is bombarded with informationconcerning objects and actions. Very quickly, the recognition of ab-stract shapes follows: we know that small children very quicklylearn to put solids through holes of the same shape, managing to

M. Xu, M. Petrou / Computer Vision and Image Understanding 115 (2011) 1581–1596 1583

recognise the abstract shape of the matching hole for each solid.One may think that knowledge comes as gradually the brain makesconnections between actions, objects and abstract object descrip-tors that allow them to participate in different actions. In buildingan artificial learning system, one must be able to capture this inter-play of actions, that have some time dimension, appearances andidentities. Similar ideas have been expressed by people who workon robotics and human cognition, and who believe that we mayuse robots to understand how the humans learn [46,34]. The towerof knowledge may be used for this purpose. The tower of knowl-edge is designed so that it models the acquired knowledge in theform of rules that express which functionalities, appearances andidentities (labels) are connected and how strongly. It assumesthe existence of layers of networked concepts, be them nouns(indicating identities), verbs (indicating actions), or adjectives(indicating appearances), needed to answer the questions ‘‘what?’’,‘‘why?’’ and ‘‘how?’’, respectively [41]. For example, in choosingthe label ‘‘balcony’’ for a component, the following fragment ofconversation between the various modules of the scheme isenvisaged:

– ‘‘What is this?’’– ‘‘It is a balcony.’’– ‘‘Why?’’– ‘‘Because it is attached to a building and people can stand in it.’’– ‘‘How?’’– ‘‘By offering enough space for a person to stand in and by being

attached to a wall with an opening area to allow people to enterit from the building.’’

– ‘‘Is it really like that? Let me check.’’

According to the above sequence, the tower of knowledge is de-signed as shown in Fig. 1. It consists of four levels: image level,semantic level, functionality level and descriptor level. The image

Image level

Semantic level

Functionality level

Appearance level

Fig. 1. The tower of knowledge. The ellipses represent units in the various levels. The linesame level. The lines with single-headed arrows represent queries and answers betweensingle-headed arrows represent the transfer of information.

level belongs to the low-level vision. The other three levels belongto the high-level vision. Image processing modules, working at theimage level, process the components that are input to the next le-vel for labelling. The next level is that of nouns i.e. the names of ob-jects, i.e. the network of labels (e.g. ‘‘balcony’’, ‘‘window’’), whichinteract with each other in the sense that the presence of a stair-case strongly implies that the label of an adjacent region mightbe ‘‘door’’. Conventional labelling methods, either hierarchical ornot (e.g. MRFs and PR), work solely in this level, using knowledgein the form of frequency of co-occurrence of objects/labels to opti-mise the process of assigning labels to components. Although suchan approach may be combined with the approach we shall use inthis paper, at this stage, we shall concentrate only on the informa-tion coming from the other two levels of ToK, ignoring peer-to-peer contextual influences. The remaining two levels are those offunctionalities and descriptors. The verbs of the functionality levelare functionalities of the objects such as ‘‘to stand in’’ and ‘‘to lookout’’. A functionality may be fulfilled, if the object has certain struc-tural characteristics. These are captured at the descriptors level.Examples of these units are ‘‘having enough space for a person’’and ‘‘there is an opening on the wall’’. The units in the descriptorlevel can interrogate the sensors and the images to verify that a re-quired descriptor applies to an object. This way, the vertical con-nections of the scheme encode implicitly the generic models ofobjects, seen not in isolation or as static entities, but in the contextof what they are used for, in what actions they participate andwhat their generic characteristics are. These generic models effec-tively encode the logic rules of meta-knowledge that have beenlearnt from spatio-temporal activities involving the objects wewish to label.

In this paper, we show how this scheme may be used to labelthe components of 3D models of buildings. We shall also compareour results with those obtained by using Kim and Nevatia’s EBNtheory [26].

verbs/actions

measurements

sensors

wha

t?w

hy?

how

?

is it

rea

lly li

ke t

his?

nouns/labels

adjectivesdescriptors/

s with double-headed arrows represent contextual interactions between units in theunits of different levels. Solid single-headed arrows represent queries while dashed


4. A utility function for the tower of knowledge

In the 18th century, mathematicians were concerned with theexplanation of the way people make some decisions, which accord-ing to probability theory are irrational. For example, if the expectedvalue of the gain from a lottery game is infinite, why are not peoplewilling to bet any amount to enter this game? Mathematicians hadto devise a function, which they called utility, and which wasaimed at capturing the risk each person is prepared to take, the va-lue a particular amount won or lost would have for them, etc. [52].Various forms of utility functions were designed and used to solvevarious problems. Loosely speaking, the essence is that a utilityfunction may be used to model the effect of the innate knowledgein the process of a personal decision making. We consider here thatthe personal knowledge in a recognition problem is the knowledgeacquired through observation of the world around us and whichcreated connections between objects and actions. For example,observing people going in and out of buildings through doors, cre-ated a connection between the action ‘‘go out’’ and the identity‘‘door’’, while having never seen a cake being baked on a door,made no connection between the action ‘‘bake a cake’’ and theidentity ‘‘door’’. In this section we propose the definition of a utilityfunction that will capture this innate logic model we have for thevarious objects and their uses and thus allow the transfer of infor-mation between the different levels of the tower of knowledge. Letus denote by ujk the opinion we have on how important function-ality fk is for an object to have label lj. For example, if fk is ‘‘bake acake’’ and lj is ‘‘door’’, ujk = 0, but if fk is ‘‘go out’’ and lj is ‘‘door’’,ujk = 1. Let us also denote by vkl the opinion we have on how impor-tant a certain observable characteristic dl of an object is in order tobe able to fulfil functionality fk. For example, if dl is ‘‘it is needle-like’’ and fk is ‘‘seat on’’, we believe vkl = 0, but if dl is ‘‘flat horizontalsurface’’ and fk is ‘‘seat on’’, we may say that vkl = 1. If more thanone functionalities are necessary for a label to be valid, we may se-lect the values of ujk so that they sum up to 1 and thus reflect therelative importance we give to each functionality in order to justifya label. Similarly, if more than one descriptors are necessary for aparticular functionality to be possible, we may select the valuesof vkl so they sum up to 1, to reflect the relative importance we giveto the different descriptors for allowing the particular functional-ity. Finally, we may define as cil the confidence we have for a par-ticular descriptor dl to apply to object ai. The value of cil isestimated from the accuracy of the various measurements we per-form in order to check whether the particular descriptor applies tothe particular object.

With the above understanding, we define the utility function forobject ai in relation to label lj as:

Fij �X

k

ujk

Xl

vklcil ð1Þ

whereP

kujk ¼ 1 andP

lvkl ¼ 1 and 0 6 cil 6 1.We use this function to select as label of object ai the one that

maximises Fijpij, where pij is the probability of object ai having labellj, on the basis of the unary measurements we have performed on itfrom the image. pij could also be the result of PR applied at the labellevel, or, in the simplest of cases, the probability computed fromthe original measurements we have made on each object we wantto label. Note that these measurements are not specifically made toconfirm a particular label, but they are generic measurements wemake for all objects. We denote these measurements mðtÞi , wheret = 1,2, . . . ,T, with T being the total number of measurements wemake on each object in the scene. We may denote this set of mea-surements by Mi in order to distinguish from measurement fMi,which are made to confirm or reject a particular label for objectai. Assuming independence of measurements, we may say that

pij � pðai ¼ ljjmðtÞi ; t ¼ 1;2; . . . ; TÞ ¼YT

t¼1

pðljjmðtÞi Þ

/YT

t¼1

pðmðtÞi jljÞpðljÞ ð2Þ

where pðmðtÞi jai ¼ ljÞ is the distribution of the values of measure-ment mðtÞi for objects of class lj, and p(lj) is the prior probability ofclass lj to be encountered.

One practical difficulty in applying Eq. (2) is that it requires suf-ficient training data for the initial knowledge of pðmðtÞi jljÞ and p(li). Ifthere are not enough training data, we may apply a recursive ver-sion of ToK presented in the next section.

5. A recursive version of ToK

At iteration step r of the recursive ToK, the label assigned to ob-ject ai is

lðrÞai¼ arg max

lj2LFij

YT

t¼1

pðrÞðmðtÞi jljÞpðrÞðljÞ ð3Þ

where L is the set of all possible labels.Let us assume that the observed values of m(t) are represented

by a histogram h(t), with a finite number of fixed width bins.Now consider the way of computing p(r)(m(t)jlj) and p(r)(lj). Here,we consider the worst case, namely that of lacking any trainingexamples, which means that there are no training data in the data-base. Therefore, at the initial step, we assume that every label in Lis equally probable a priori (p(0)(lj) = p(0)(lk) for all lj and lk in L).Similarly, for all lj in L, we have the equal conditional probabilitiespð0ÞðmðtÞi jljÞ.

A term called ‘‘innovation’’ is introduced related to a new globaldistribution of each label at each step. It is a common practice [18]to represent the innovation of label lj for step r as a simplefunction:

Inðlj; rÞ ¼ ð1� kÞPn

i¼1dðlðr�1Þai¼ ljÞ

nð4Þ

where dðlðr�1Þai¼ ljÞ is 1 if lðr�1Þ

ai¼ lj, else it is 0. Parameter n is the total

number of components and k(2[0,1]) is a memory factor, which isused to control the impact of the probability mass functions ofthe previous steps to the current step r. The innovations of variousvalues of m(t) conditioned on lj can be represented as,

InðmðtÞjlj; rÞ ¼ ð1� kÞPn

k¼1dðmðtÞ ¼ mðtÞk ; lðr�1Þak¼ ljÞPn

k¼1dðjðr�1Þk ¼ ljÞ

ð5Þ

where dðmðtÞ ¼ mðtÞk ; lðr�1Þak¼ ljÞ is 1 if component ak has label lj in step

r � 1 and the tth measurement value mðtÞk falls into the bin of valuem(t), else it is 0.

At each step r, the innovations can be used in the followingequations,

pðrÞðljÞ ¼ kpðr�1ÞðljÞ þ ð1� kÞPn

i¼1dðlðr�1Þai¼ ljÞ

nð6Þ

pðrÞðmðtÞjljÞ ¼ kpðr�1ÞðmðtÞjljÞ þ ð1� kÞ

�Pn

k¼1dðmðtÞ ¼ mðtÞk ; lðr�1Þak¼ ljÞPn

k¼1dðlðr�1Þak¼ ljÞ

ð7Þ

In Eqs. (6) and (7), the first term on the right-hand side is used toavoid the sudden change of the distributions, which may cause con-vergence to a wrong solution. The second term, the innovation,adapts the new knowledge learnt in the previous steps to this spe-cific example.


Finally, we have lðRÞaiafter R iterations as the output of the recur-

sive ToK. Such a recursive version of ToK learns adaptively usingthe logic rule-based knowledge it already has, and can cope withthe situation of lacking any training data. The overall algorithmis summarised in Table 1.

6. Experimental work in 3D scene interpretation

In this section we exemplify the ideas discussed so far in thecontext of labelling the components of 3D models of buildings.The way we represent 3D models of buildings and their compo-nents is discussed in Section 6.1. The way we set up the logic rulesfor scene interpretation, including the values of ujk, vkl and cil, isdiscussed in Section 6.2. Finally, the experimental results are pre-sented in Section 6.3.

6.1. 3D models of buildings and their components

For evaluating ToK, we need 3D reconstructed building modelsand their components available first. However, problems of 3Dreconstruction do not concern us here since the ToK scheme isindependent of the way the 3D model is reconstructed. It is thusassumed that the 3D models of the buildings have already beenreconstructed with identified components. The labelling resultsof the manually and automatically identified components will bepresented in Section 6.3.

In order to display the 3D model, the 3D coordinate system hasto be defined first. We set its origin at the bottom left corner of thefacade wall plane of the 3D model. Then, we represent the recon-structed 3D buildings and the identified components by usingthe following notation [9].

B
Building to be labelled wk
p
Width of wall k
hkp

Height of wall k

nk
Wall k plane normal pk Wall k plane position nr Number of walls (including planes of roof)n o ak
p
set of planar parameters of wall k;akp ¼ wk
p;hkp;n

k;pk

ap
set of all wall planar parameters, ap ¼Snr
k¼1akp

nkp

Number of components that make up wall k

laik
Label of component ai of wall k
akl
Labels of all components of wall k, ak
l ¼Snk

p

i¼1laik

Mki
Measured parameter values of component ai of wall k ak
c
Measured parameters of all components of wall
k;akc ¼

Snkp

i¼1Mki.

akt

Set of component parameters belonging to wall

k;akt ¼ nk

p;akl ;a

kc

n o
at Component parameters belonging to all walls,
at ¼Snr

k¼1akt

1 Sturges’ formula is an approach [51] for determining the optimal number of bins.With Sturges’ formula, the bin number in our experiments is the ceiling oflogðminlj nlj þ 1Þ, where nlj is the number of components in the training data setwith the same category of label lj.

In the above notation, laik represents the label of component ai

of wall k. Its possible values are given in the first column of Table2. The outputs of our experiments are these labels for each compo-nent. Moreover, Mki, as the input to our experiments, defines themeasured parameters of component ai of wall k. Here, we labelthe components for each individual wall k separately, such thatlaik and Mki can be simplified to be lai

and Mi. Mi has the followingelements:

mð1Þi : the depth of the component;mð2Þi : the ratio of height to width of the component;mð3Þi : the distance of the component from the closest groundplane/roof.

In addition, another different set of measurements fMi also hasto be collected according to the requests and specifications of the

descriptors. In our experiments, fMi ¼ ~mð1Þi ; ~mð2Þ

i ; . . . ; ~mðsÞi

n oinclude:

~mð1Þi : xm, position of the component along axis X;~mð2Þi : ym, position of the component along axis Y;~mð3Þi : zm, position of the component along axis Z;~mð4Þi : wm, the width of the component;~mð5Þi : hm, the height of the component;~mð6Þi : dm, the depth of the component;~mð7Þi : gm, the mean grey value of the component.

To work out the scale of the model, we assume that the lowestidentified component is in the ground floor of the building, andthat the ground floors of all buildings are 2.5 m. Next, the scale isworked out by setting (hm + ym) of the lowest component equalto 2.5 m (see Fig. 2 as an example). Given the scale, measurementsfMi and Mi for all components can be translated into metres, whichis necessary for confirming the functionalities and descriptors ofeach component.

Next we have to encode the logic rules for our applicationdomain.

6.2. Logic object models for 3D scene interpretation

The prior probabilities for each class of component and the

conditional probabilities of (2) (p(lj) and p mðtÞi jlj

� �) have been

learnt using the histograms of distributions of measurements, witha finite number of fixed width bins, trained from 200 manuallyannotated training images of the e-TRIMS database [60]. Note thatthe number of bins were chosen to be 6 for the histograms of each

p mðtÞi jlj

� �using Sturges’ formula.1 In the training set there are 1327

windows, 566 doors, 476 balconies, 322 pillars, 189 dormers and 148chimneys.

Next, the specific meaning of the codes used in the logic rules isgiven in Tables 2 and 3. Assuming that all relative functionalitiesare equally necessary for a label to be true and all relative descrip-tors are equally important for a functionality to be fulfilled, wemay express ujk and vkl as

ujk ¼dðlj ! fkÞPp

m¼1dðlj ! fmÞð8Þ

vkl ¼dðfk ! dlÞPq

n¼1dðfk ! dnÞð9Þ

where d(lj ? fk) takes value 1 if label lj implies functionality fk;else it takes value 0. Similarly, d(fk ? dl) takes value 1 if function-ality fk can be fulfilled by descriptor dl; else it takes value 0.Arranging the values of ujk and vkl in the form of matrices, wemay write:

Table 1Summary of the recursive tower of knowledge.

1.(a) Create the histograms of the m(t), one per possible label.(b) Initialise the pmfs p(0)(lj) and p(0)(m(t)jlj) as uniform distributions, j = 0,1, . . . ,M.

2. Repeat for r = 1,2, . . . ,R(a) Repeat for i = 1,2, . . . ,n

i. Assign to each component ai label lðr�1Þai

by using Eq. (3).

(b) Repeat for j = 1, 2, . . . ,Mi. Update the prior pmf of each component as

pðrÞðljÞ ¼ kpðr�1ÞðljÞ þ ð1� kÞPn

i¼1d lðr�1Þai

¼ lj

� �n

ii. For t = 1,2, . . . ,s and for each possible measurement value m(t) of all components, update the prior conditional pmfs as

pðrÞðmðtÞjljÞ ¼ kpðr�1ÞðmðtÞjljÞ þ ð1� kÞPn

k¼1dðmðtÞ ¼ mðtÞk ; lðr�1Þak

¼ ljÞPnk¼1d lðr�1Þ

ak¼ lj

� �3. By using Eq. (3), assign to component ai label lðRÞai

as the output of the recursive ToK for i 2 {1,2, . . . ,n}.

Table 2The codes of the logic rules encoded in ToK.

Label Functionality Descriptor

l1: Window f1: Lets people walk out d1 The bottom of the component touches the ground planel2: Door f2: Lets people look out d2 The top of the component touches a flat planel3: Balcony f3: Lets people stand in d3 It is high enough for human sizel4: Pillar f4: Lets light in d4 It is glass-likel5: Dormer f5: Makes building stable d5 The depth is large enough for human sizel6: Chimney f6: Creates a usable space in the roof d6 The width is large enough for human sizel7: Other f7: Lets smoke vent from the top d7 There is some opening component next to it

d8 It is on the roofd9 It is taller than its widthd10 The depth is large enough to have a hole insided11 The width is large enough to have a hole inside


U ¼

0 1=2 0 1=2 0 0 01 0 0 0 0 0 00 0 1 0 0 0 00 0 0 0 1 0 00 1=3 0 1=3 0 1=3 00 0 0 0 0 0 1

2666666664

3777777775

V ¼

1=3 0 1=3 0 0 1=3 0 0 0 0 00 0 0 1 0 0 0 0 0 0 00 0 0 0 1=3 1=3 1=3 0 0 0 00 0 0 1 0 0 0 0 0 0 0

1=3 1=3 0 0 0 0 0 0 1=3 0 00 0 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 1=4 1=4 1=4 1=4

0BBBBBBBBBBB@

1CCCCCCCCCCCA

–Note that for label other, it is impossible to find correspondingfunctionalities and descriptors. Thus, if there are no functionalitiesand descriptors fulfilled for a component, this component will beclassified as other (l7). In our experiments, we set a thresholdtu = 0.6 for utility (1), and if the maximum utility of a componentis less than 0.6, we may label it as l7.

Finally, we must have a method that will allow the ToK to workout the values of cil of Eq. (2) using measurementsfMi. In particular,we deal with each cil as follows.

� ci1 and ci2: Does the bottom of the component touch the groundplane?/Does the top of the component touch a flat plane?

Let us denote by Dy the distance of the bottom of the compo-nent from the ground plane along the Y axis. It is clear that d1 onlydepends on Dy and is independent of all other measurements.Thus, obviously one may say ci1 / p(D yjd1). If we do not considermeasurement noise, we may say ci1 = p(Dy = 0jd1) = 1 andci1 = p(Dy – 0jd1) = 0. Now, let us consider how to calculate theconfidence when there is measurement noise. Let us denote byDym, Dye and D yt the measured value, the error and the true valueof Dy, respectively. We may assume additive noise: Dym = Dyt

+ Dye.We assume that the errors in the measurements are Gaussianly

distributed with zero mean and standard deviation rDye.

Then, since

pðDymjd1Þ ¼ pðDyt þ DyejDyt ¼ 0Þ ¼ pðDyejDyt ¼ 0Þ ð10Þ

and Dyt = 0 is obviously independent of Dye, we may say that theprobability density function of Dym is,

pðDymjd1Þ ¼ pðDyeÞ ¼1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2pr2Dye

q e�1

2DymrDye

� �2

ð11Þ

Because ci1 / p(Dymjd1) and cil 2 [0,1], we may model ci1 as

ci1 ðDymÞ ¼ e�1

2DymrDye

� �2

ð12Þ

In a similar way, we may compute ci2.

Table 3Encoding of the connections between each label, func-tionality and descriptor of Table 2.

Label Functionality Descriptor

l1 f2, f4 d4

l2 f1 d1, d3, d6

l3 f3 d5, d6, d7

l4 f5 d1, d2, d9

l5 f2, f4, f6 d4, d8

l6 f7 d8, d9, d10, d11

Fig. 2. An example of the 3D model of a building with the identified components, represented by the grey enclosing boxes. In this figure, (a) shows the identified componentsobtained by hand, and (b) shows components identified automatically. The values of fMi for each identified component can be extracted as an instance shown at (c). For themean grey value gm of a component, we have to project all 3D points, belonging to the identified component, to one of the 2D multiple-view images that are used toreconstruct the 3D model, and then compute the mean of their grey values in the 2D image.


� ci3, ci5, ci6, ci9, ci10 and ci11: Is it high enough for a human?/Is itdeep enough for a human?/Is it wide enough for a human?/Isit taller than its width for dispersing pollutants?/Is it deepenough to have a hole?/Is it wide enough to have a hole?

Here we use an approach similar to the one used for computingci1 and ci2. However, here we are concerned with the depth, widthand height of the components. We denote the measurement value,the error and the true value of depth as dm, de and dt, respectively.Also, we assume that the errors in the measurements are Gaussianlydistributed with zero mean and standard deviation rde . Therefore,

dm ¼ de þ dt ¼ Nðdt;rde Þ ð13Þ

where Nðdt;rde Þ stands for Gaussian distribution with mean dt andstandard deviation rde .

Assuming that a human requires at least dh = 0.3 m to stand in,we then have to ask the question: if we have a measurement dm,how confident are we that dt is greater than dh? Because descriptord5 requires dt P dh, assuming Z = dt, we can answer the above ques-tion with confidence

ci5ðdmÞ / pðdmjd5Þ / pðdm;d5Þ ¼ pðdm;dt P dhÞ

¼Z 1

dh

pðdmðZÞÞdZ ð14Þ

According to Eqs. (13) and (14), ci5 may be finally computed as:

ci5ðdmÞ ¼1ffiffiffiffiffiffiffi

2pp

rde

Z 1

dh

e�ðZ�dmÞ2

r2de dZ ð15Þ

We compute ci3, ci6, ci9, ci10 and ci11 in a similar way.

� ci4: Is it glass-like?

Glass is a perfect reflector and so, if one is directly receiving thereflected beam (situation of presence of a highlight), glass appearsvery bright. Else, it appears very dark. Therefore, the confidencethat a component is glass may be measured by how dark or brightit appears. So, p(gjd4) is assumed to consist of two Gaussian distri-butions: one Gaussian distribution with grey intensity g abovethreshold gh and the other Gaussian distribution with grey inten-sity below gh.

Since ci4 / p(gjd4) and 2[0,1], confidence ci4 may be obtained as

ci4ðgmÞ ¼e�1

2ðgm�la Þ2

r2a if gm P gh

e�1

2ðgm�lb Þ

2

r2b if gm < gh

8><>: ð16Þ

where gm is the detected mean grey value of the component. gh

should be a grey value in the middle of the range. So, we set it to120. From the mean grey values of each manually identified compo-nent of five buildings, it was deduced that mean values la and lb

are 168.4 and 89.9, and standard deviations ra and rb are 32.2and 20.3.

� ci7: Is there any opening component next to it?

To compute ci7 of component ai, we consider a component swith xm values between the xm values of ai and ym values abovethose of ai, in which xm and ym have already been extracted. Com-ponent s has at its bottom to touch component ai. If there is no suchcomponent, ci7 will be assigned value 0. Else, if the bottom of com-ponent s touches the top of component ai, ci7 is computed the sameway as ci1 and ci2, i.e. by considering the difference Dy in the ym

values of the two lines that have to touch.

� ci8: Is it located on the roof?

Given the position of roof yr along the Y axis, we assumeDyr = ym � yr. Note that ym, is the position of the component alongthe Y axis. In order to compute ci8 of component ai, we only need

Fig. 3. There are 11 multiple-view images of this building used for this experiment. Three of them are shown here.

2 The buildings selected to test in this subsection are different from the buildingsused to train for p(lj) and p mðtÞi jlj

� �.


to compute the confidence of D yr P 0. Since it has already beenknown how to compute the confidence ci5 of dm P dh via (15), wemay use here the same way of computing ci5 by using Dyr insteadof dm and 0 instead of dh. Then, (15) may be rewritten as:

ci8ðDyrÞ ¼1ffiffiffiffiffiffiffi

2pp

r

Z 1

0e�ðZ�Dyr Þ2

r2Dye dZ ð17Þ

In the above formulae, the standard deviations of the errors inthe measurements are required. These can be estimated by usingsome training data. The errors may be caused by the inaccuratepositioning of the points we match in the two images in order tocreate the 3D model. If the position of a point is at (i, j), its true po-sition maybe anywhere within a 3 � 3 window around this posi-tion, due to digitisation error and inaccuracies of the featuredetection algorithms. These errors propagate through the steps ofthe model construction process, and they may be amplified in sev-eral steps of the 3D reconstruction, such as the estimation of thefundamental matrix and the computation of the 3D points usingthe camera matrices.

To estimate their variance, we treated the 3D reconstructionprocess as a ‘‘black box’’, and we used a Monte Carlo procedurewith the following steps.

Step 1. The 2D points used for 3D reconstruction were shifted ran-domly by one pixel away from their original positions.Then, the 3D reconstruction was performed with inputthe randomly shifted points.

Step 2. After the reconstruction, the width wi, height hi, depth di

and position along the vertical axis of each component ai

in a 3D model were measured. Note that these values werescaled to be expressed in metres, using the easily esti-mated scale of our images: a 10 m building occupies 500pixels in one of the images we used, so 1 pixel is 2 cm.

Step 3. This process was repeated 50 times for 50 different ran-dom perturbations of the matched points in the two inputimages. This way, for each component, we had 50 mea-sured values for wi; hi; yi and di. From them, the standarddeviations rwi ;rhi ;ryi and rdi were computed.

The results for 26 components of the building shown in Fig. 3are shown in Fig. 4. From Fig. 4 we can say that the depth of a com-ponent is estimated with rde ’ 0:01 m, the height withrhe ’ 0:04 m, the width with rwe ’ 0:06 m and its location alongthe vertical axis with rye

’ 0:03 m. Given that we are interestedin differences in the location along the vertical axis, rDye

’ 0:06 m.

6.3. Experimental results

This section describes the experimental results for labellingcomponents of 3D models of buildings with the proposed ToKand another state-of-the-art version of scene interpretation algo-rithm, namely expandable Bayesian networks [26]. In [26], Kimand Nevatia used EBN to recognise roofs, walls etc. leading to the

detection of buildings, but we apply here EBN to label buildingcomponents. First, 233 manually identified components of 10reconstructed buildings are tested.2 These labelling results are pre-sented in Section 6.3.1. For the same buildings, an automatic compo-nent identification method, described in Section 6.3.2, identified 200components. The labelling results of these components are pre-sented in Section 6.3.3.

6.3.1. Experimental results of labelling manually identifiedcomponents

Fig. 5a shows the images of three buildings we wish to recon-struct and label. We use multiple-view images as the input tothe experiment. The components identified manually in the 3Dmodels of these buildings are shown in Fig. 5b.

We first test the proposed ToK and EBN methods for labellingcomponents of the building in the case that training data are avail-able. We obtain the labels of each component by ToK, identified bythe different colours, as shown in Fig. 5c. For each building, Tables4–6 report the labelling results of ToK and EBN in the form of con-fusion matrices. It can be seen that out of the 97 components, threeare wrongly labelled, and the accuracy of ToK for the three build-ings of this experiment is thereby 96.9%, which is marginally betterthan the accuracy of EBN (95.9%).

Then, we used ToK and recursive ToK for labelling componentsin the case of totally lacking any training data for learning p mðtÞi jlj

� �and p(lj) of Eq. (2). For this purpose, these functions are initially as-sumed to be uniform distributions. The iteration number R ofrecursive ToK was set to 50. Fig. 6 reports the performance ofrecursive ToK for interpreting the buildings of Fig. 5 as a functionof the number of iterations, when no training data are provided.Note that, in this figure, the accuracy of the first iteration corre-sponds to the labelling results of the original ToK without anytraining data. This figure shows the convergence of the accuracyto a better percentage after a few iterations and the speed of suchconvergence with respect to memory factor k, when the recursiveversion of ToK is applied. More details of the statistical labelling re-sults with no training data are shown in Tables 4–6. Then, it is clearthat the recursive ToK improves the accuracy of the original ToKwhen training data are not available for learning the priorprobabilities.

Fig. 7 shows seven more buildings used for further evaluation.The accuracy of scene interpretation of all 10 buildings, shown inFigs. 7 and 5a, is reported in Table 7 for each label class, and in Ta-ble 8 in the form of a confusion matrix. It appears that the systemwithout training data recognises chimneys better than with train-ing data. This is an unexpected result, but given that there are veryfew chimneys in the test images we have, it is possible that thisresult does not have any statistical significance. Fig. 8 further plotsthe accuracy of recursive ToK for all these 10 buildings. Withoutany training data, EBN cannot work at all since the prior

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Components

Stan

dard

dev

iatio

ns (m

eter

) heightwidthdepthpostion along Y axis

Fig. 4. Standard deviations for 26 components of the building of Fig. 3. The numbersalong the horizontal axis represent the components of the building.


probabilities are unknown and its results are thereby wholly ran-dom, so we do not present the results of EBN without any trainingdata. For all these buildings, the results then clearly demonstratethat ToK works better than EBN, and that recursive ToK can actas a learning mechanism that improves the accuracy of ToK for3D scene interpretation when training data are not available forlearning the prior probabilities.

Fig. 5. The labelling results of the ToK system for hand segmented components. (a) Thecomponents. (c) Labels output by the proposed ToK approach. Each colour is an indicati

6.3.2. Automatic component identificationA cognitive vision system is expected to be able to identify, label

and interpret objects in images. However, objects do not exist inisolation, and often it is not clear how to define what an object isin such a context. For example, it is easy to identify individual ob-jects on a table, as being ‘‘plate’’, ‘‘fork’’, ‘‘glass’’, etc.; it is not soeasy to identify objects in an image of the built environment: adoor is part of a building facade, which is part of a building, whilea balcony, being attached to a wall may include a door, a railing oreven a window. There is clearly a cognitive process that made hu-mans of different cultural backgrounds and mother tongues toidentify balconies, chimneys and pipes as distinct entities, and givethem separate linguistic terms for identification. It is possible thatthis process involved the functionality of a component in order toidentify it as such and assign to it a separate identifying word. It isalso possible that brain treats as component any agglomeration oflow level features in the image. If the scene were dynamic, onemight identify this assertion with the ‘‘law of common fate’’ of ge-stalt psychology. Indeed, according to this law, the brain groups to-gether low level features that move together. In static scenes, onemay use instead the law of ‘‘good shape’’ of gestalt psychology.Jahangiri and Petrou [23], inspired by this law, developed an algo-rithm that takes as input a 2D image and produces as output a setof blobs, on the basis of their texture content and convexity. In

buildings used in these experiments. (b) The 3D models with manually identifiedon of one of the seven class labels.

Table 4The results of labelling manually identified components using ToK/EBN/ToK without training data/recursive ToK without training data, for the first building of Fig. 5.

Ground truth ToK output

Window Door Balcony Pillar Dormer Chimney

Window 36/36/30/36 0/0/6/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Door 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Balcony 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Pillar 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Dormer 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Chimney 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0

Table 5The results of labelling manually identified components using ToK/EBN/ToK without training data/recursive ToK without training data, for the second building of Fig. 5.




Table 6The results of labelling manually identified components using ToK/EBN/ToK without training data/recursive ToK without training data, for the third building of Fig. 5.




0 10 20 30 40 500.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Iterations

Accu

racy

λ=0.9λ=0.95λ=0.99

0 10 20 30 40 500.65

0.7

0.75

0.8

0.85

0.9

Iterations

Accu

racy

λ=0.9λ=0.95λ=0.99

0 10 20 30 40 500.895

0.9

0.905

0.91

0.915

0.92

0.925

0.93

0.935

0.94

Iterations

Accu

racy

λ=0.9λ=0.95λ=0.99

Fig. 6. Performance of the recursive ToK approach for the three buildings of Fig. 5, as a function of iterations.


other words, the produced blobs are roughly convex regions whichhave high gradient values. Further, they showed [24] that these re-gions correlate in a statistically significant way with regions thatattract the attention of the viewer when he/she is viewing a scenein an investigative way. We are making use here of this algorithm,which is freely downloadable from the web [25].

Let us assume that we have J different view images, from whichwe have reconstructed the 3D model of a scene and specified the3 � 4 camera matrices that allow the projection from the worldcoordinate system of the 3D model to the coordinate systems of

each image plane. Let us call the camera matrix for image j, Pj.Let us also assume that for each 2D image, we have run the blobdetection algorithm and extracted the corresponding blob map.In a blob map, a pixel that belongs to a blob has value 1, while apixel that does not belong to a blob has value 0. We postulate thata 3D coordinate position will belong to a component, if it projectsto a blob position in several of the 2D view images. Extensiveexperiments [57] showed that using as threshold half the numberof available images rounded up to the nearest integer produced thebest results.

Fig. 7. Buildings used for our experiments, chosen from the eTRIMS and Oxford databases [60].

Table 7Summary of percentage accuracy of the EBN algorithm with training data, ToK withand without training data, and recursive ToK without training data for 3D sceneinterpretation. There are in total 233 manually identified components of the 10buildings.

With training data Without training data

EBN (%) ToK (%) ToK (%) Recursive ToK (%)

Window 97.7 98.3 77.4 89.3Door 76.9 96.2 92.3 100Balcony 63.6 100 72.7 100Pillar 100 100 100 100Dormer 22.2 77.8 55.6 22.2Chimney 0.0 62.5 100 100Overall 87.6 96.1 79.0 88.8

0 10 20 30 40 500.78

0.8

0.82

0.84

0.86

0.88

0.9

Iterations

Accu

racy

Fig. 8. Performance of recursive ToK for labelling hand segmented components, as afunction of iterations, on all 10 buildings. The maximum iteration number andmemory factor k are set to 50 and 0.95, respectively.


So, we create an accumulator array A(X,Y,Z), where (X,Y,Z) arethe world coordinate positions of the 3D model. We define the ele-ments of the array to be

AðX;Y; ZÞ ¼XJ

j¼1

BjðPjXÞ ð18Þ

where X = (X,Y,Z,1)T and Bj(�) is the pixel value of the blob map ofimage j. Note that the value of A(X,Y,Z) will be at most J for some(X,Y,Z), indicating that this point projects inside a blob in all 2Dimages that support these 3D points, and at least 0 for some others,implying that these points project outside blobs in all 2D images.

We may use a threshold tA and set all values of A(X,Y,Z) belowtA to be 0, otherwise to be 1. We may next use connected compo-nent analysis, in 3D, to identify ‘‘islands’’ of elements in A, the

values of which are equal to 1. Here we use 26-connectivity. Then,parallelepiped regions are created in A, with the minimum enclos-ing boxes, to cover each identified component.

With the help of this method, 200 components of the 10 build-ings used in the experiments of the previous section were auto-matically identified. We present in this section the performance

Table 8The overall results of labelling manually identified components for all buildings using ToK/EBN/ToK without prior training data/recursive ToK without prior training data.




Fig. 9. The labelling results of the ToK approach for the automatically identified components. (a) The selected three buildings used for these experiments. (b) The 3D modelswith automatically identified components. (c) Labels output by the proposed ToK approach. Each colour is an indication of one of the seven classes.


of ToK, in labelling these components. It must be noted that someof these components are wrongly extracted.

6.3.3. Experimental results of labelling automatically identifiedcomponents

First, consider the three buildings shown in Fig. 9a for evalu-ating the performance of ToK in detail. The automatic compo-nent identification results for these three buildings are

reported in Fig. 9b, and their labelling results with ToK areshown in Fig. 9c. For performance comparison of the differentlabelling methods, see the confusion matrices of Tables 9–11.Notice that as seen in Fig. 9, TOK made a mistake when labellingthe five windows of the ground floor of the first building asdoors, because these components were identified close to theground, a condition necessary for the functionality of letting aperson walk out.

Table 9The results of labelling automatically identified components using ToK/EBN/ToK without training data/recursive ToK without training data, for the first building of Fig. 9.


Window Door Balcony Pillar Dormer Chimney Others

Window 24/9/26/36 5/5/3/0 0/0/0/0 0/0/1/0 0/0/0/0 0/0/0/0 7/22/6/0Door 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Balcony 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Pillar 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Dormer 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Chimney 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0Others 1/0/3/3 1/1/0/0 1/2/0/0 0/0/0/0 0/0/0/0 0/0/0/0 0/0/0/0

Table 10The results of labelling automatically identified components using ToK/EBN/ToK without training data/recursive ToK without training data, for the second building of Fig. 9.




Table 11The results of labelling automatically identified components using ToK/EBN/ToK without training data/recursive ToK without training data, for the third building of Fig. 9.




0 10 20 30 40 500.740.760.780.8

0.820.840.860.880.9

0.920.94

Iterations

Accu

racy

λ=0.9λ=0.95λ=0.99

0 10 20 30 40 500.53

0.535

0.54

0.545

0.55

0.555

0.56

0.565

0.57

Iterations

Accu

racy

λ=0.9λ=0.95λ=0.99

0 10 20 30 40 500.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

Iterations

Accu

racy

λ=0.9λ=0.95λ=0.99

Fig. 10. Performance of the recursive ToK for the three buildings of Fig. 9, as a function of iterations.


Next, Fig. 10 shows the performance of recursive ToK as a func-tion of the number of iterations, when lacking training data forlearning p(lj) and p mðtÞi jlj

� �.

The performance of the methods of EBN, ToK and recursive ToKis shown in Tables 12 and 13 for the general labelling task of the200 automatically identified components of all 10 buildings shown

in Figs. 9a and 7. These results show that given automatically iden-tified components, ToK improves over the performance of EBN, andthe recursive ToK improves over the performance of the originalToK with no training data. Finally, Fig. 11 shows the improvementof the recursive ToK as a function of the number of iterations for all10 buildings.

0 10 20 30 40 500.65

0.655

0.66

0.665

0.67

0.675

0.68

0.685

0.69

0.695

0.7

Iterations

Accu

racy

Fig. 11. Accuracy of the recursive ToK for labelling automatically identifiedcomponents, as a function of iterations, on all 10 buildings. The maximum iterationnumber and memory factor k are set to 50 and 0.95, respectively.

Table 12Summary of percentage accuracy of the EBN algorithm with training data, ToK with and without training data, and recursive ToK withouttraining data for 3D scene interpretation. There are in total 200 automatically identified components of the 10 buildings.

With training data Without training data

EBN (%) ToK (%) ToK (%) Recursive ToK (%)

Window 64.0 74.3 70.6 80.9Door 78.6 78.6 42.9 42.9Balcony 62.5 87.5 62.5 62.5Pillar 50.0 50.0 50.0 50.0Dormer 33.3 50.0 66.7 0.0Chimney 0.0 75.0 100 100Others 30.0 36.7 36.7 36.7Overall 57.5 68.5 63.5 68.5

Table 13The overall results of labelling automatically identified components for all 10 buildings using ToK/EBN/ToK without prior training data/Recursive ToK without prior training data.

True ToK



Table 14The time required (in seconds) for labelling the manually identified components ofthe buildings shown in Fig. 5.

High-level processing

EBN withtraining data

ToK withtraining data

Recursive ToKwithout training data

First building 2.26 2.16 160.03Second building 2.18 1.99 144.12Third building 1.85 1.89 141.91


6.4. Computation time

The experiments presented were performed using MatlabR2007a on a computer with Pentium(R) Duo 2.8 GHz CPU and 3GB RAM. For manually identified components of the buildings asshown in Fig. 5, the computational time of the various labellingapproaches is reported in Table 14. Note that since the componentswere manually segmented, no computation time for low-levelprocessing was recorded and listed in this table. For the buildingsof Fig. 9, the computation time of automatically identifying andlabelling the components with the different approaches is shownin Table 15.

7. Conclusions

In this paper, a novel architecture was explored, namely thetower of knowledge (ToK), for organising the knowledge referring

to a particular labelling task, for scene interpretation in computervision. It was argued that the input knowledge may be distilledin sets of logic rules concerning functionality and structure neces-sary for a label to be applicable. Once these rules are utilised,experimental results demonstrated the effectiveness of theproposed approach in 3D scene interpretation, accurate enoughto compete with the case when EBN is used and training examplesare available.

A recursive version of ToK was presented to deal with theabsence of any training data. This recursive ToK uses the resultsof each iteration, decided by the inserted logic rule-based knowl-edge and training results of previous iterations, to learn andupdate the distributions of measurement values for variousclasses and the prior probabilities of various labels for the nextiteration. It is attractive for its self-learning capability byiteratively integrating probability theory and logic. Experimentalresults on several 3D building models also showed the ability ofrecursive ToK to solve the problem of lacking training data for3D scene interpretation.

Although the experimental results presented were limited tothe task of labelling the components of 3D models of buildings,they demonstrated the way in which the ToK or recursive ToKmay be used to organise prior knowledge. Ideally this knowledgeshould be combined with contextual information coming fromthe labels of near-by objects. We wish to stress that conventionalcontext-based labelling approaches, typically based on Markovrandom fields and probabilistic relaxation, operate in a single levelof the tower of knowledge, and their incorporation is possible in

Table 15The time required (in seconds) for automatically identifying and labelling the components of the buildings shown in Fig. 9.

Low-level processing High-level processing

Component identification EBN with training data ToK with training data Recursive ToK without training data

First building 1065.2 2.29 2.24 180.69Second building 683.7 2.01 1.94 153.27Third building 442.6 1.12 1.05 81.66


the proposed scheme. Hierarchical approaches, involving agglom-erates, e.g. reasoning based on a premise like ‘‘a facade consistsof doors, windows and balconies’’, may also be incorporated asthey operate at a single level of the tower of knowledge, namelythat of nouns. Knowledge represented in the verbs and action levelmay be incorporated too, from the analysis of videos that involveactions in relation to the objects that are to be recognised [8]. Thistype of information may be extracted with the help of human ac-tion recognition modules, which might be based on Nagel’s Situa-tion Graph Trees (SGTs) [35,1] or Thonnat’s scenarios [6,53]. Thus,ToK is an architecture much more general than demonstrated here,that implicitly allows one to encode generic logic models of objects(through its inter-layer connections), and generic models of sen-sors/measurement/descriptors (through the feed back loop that al-lows the higher level modules to interrogate sensors even otherthan those used when the process started), dynamic built up ofknowledge (through recursive learning), and the use of videos aswell as static images involving the scene to be interpreted. Thisway, it offers a framework for relating actions with visual appear-ances and semantic meanings and a link to human cognitive pro-cesses as expressed by the structure of natural language.

The system in its present form does not cater for re-segmenta-tion of the scene, thus giving the impression that it works onlywith perfectly segmented objects. This is partly true, but as theexperiments with automatically segmented components haveshown, the system is robust enough to tolerate a certain degreeof bad segmentation. If the system had to incorporate the optionof re-segmentation of the scene, an external loop should be incor-porated, where objects recognised with low confidence should ad-dress a module that would re-examine the segmentation result.Another drawback of the system in its present form is that it doesnot scale up for problems with many alternatives. To do that, effi-cient methods, taking into consideration the fact that most connec-tions between the different layers will have to have 0 weight,should be utilised in its implementation.

Acknowledgments

This work was partly supported by the EU project eTRIMS andpartly by a Chinese Scholarship.

References

[1] M. Arens, H.H. Nagel, Behavioural knowledge representation for theunderstanding and creation of video sequences, in: Proceedings of the 26thGerman Conference on Artificial Intelligence (KI-2003), LNAI, Springer-Verlag,Berlin, Heidelberg, New York, 2003, pp. 149–163.

[2] P. Besl, R.C. Jain, Three-dimensional object recognition, Comput. Surv. 17(1985) 75–145.

[3] J. Batllea, A. Casalsb, J. Freixeneta, J. Marti, A review on strategies forrecognising natural objects in colour images of outdoor scenes, Image VisionComput. 18 (6) (2000) 515–530.

[4] P. Carbonetto, N. Freitas, K. Barnard, A statistical model for general contextualobject recognition, in: Proceedings of ECCV, 2004, pp. 350–362.

[5] L. Cheng, T. Caelli, A. Sanchez-Azofeifa, Component optimisation for imageunderstanding: a Bayesian approach, IEEE Trans. Pattern Anal. Mach. Intell. 28(5) (2006) 684–693.

[6] N. Chleq, M. Thonnat, Real time image sequence interpretaion for videosurvellance applications, in: International Conference on Image Processing,IEEE ICIP’96, vol. 2, Lausanne, Switzerland, September 1996, pp 801–804.

[7] W.J. Christmas, J. Kittler, M. Petrou, Structural matching in computer vision,IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 749–764.

[8] D. Damen, Constraint-based scene interpretation, in: Dagstuhl Logic andProbability for Scene Interpretation Workshop, 2008. <http://kathrin.dagstuhl.de/08091/Materials2/>.

[9] A.R. Dick, P.H.S. Torr, R. Cipolla, Modelling and interpretation of architecturefrom several images, Int. J. Comput. Vision 60 (2) (2004) 111–134.

[10] A.R. Dick, P.H.S. Torr, S.J. Ruffle, R. Cipolla, Combining single view recognitionand multiple view stereo for architectural, in: Proceeding of ICCV, 2001, pp.268–274.

[11] S. Ekvall, D. Kragic, Integrating object and grasp recognition for dynamic sceneinterpretation, in: Proceeding of 12th ICAR, 2005, pp. 331–336.

[12] L. Fei-fei, P. Perona, A Bayesian hierarchical model for learning natural scenecategories, in: Proceedings of CVPR, 2005, pp. 524–531.

[13] P.C. Fishburn, Utility Theory for Decision Making, Robert E. Krieger PublishingCompany, Huntington, NY, 1970, ISBN 978-0471260608.

[14] G. Fritz, C. Seifert, L.Paletta, Urban object recognition from informative localfeatures, in: Proceedings of the 2005 ICAR, 2005, pp. 131–137.

[15] G. Guerra-Filho, Y. Aloimonos, A language for human action, IEEE Comput. 40(5) (2007) 42–51.

[16] E.R. Hancock, J. Kittler, Discrete relaxation, Pattern Recogn. 23 (7) (1990) 711–733.

[17] J. Hartz, B. Neumann, Learning a knowledge base of ontological concepts forhigh-level scene interpretation, in: International Conference on MachineLearning and Applications, 2007, pp. 350–362.

[18] S. Haykin, Adaptive Filter Theory, first ed., Princeton University Press, 1986.[19] D. Heesch, M. Petrou, Non-Gibbsian Markov random fields for object

recognition, in: Proceedings of BMVC, 2007.[20] D. Heesch, M. Petrou, Markov random fields with asymmetric interactions for

modelling spatial context in structured scene labelling, J. Sig. Proc. Syst. 10(07) (2009). online.

[21] L. Hotz, B. Neumann, Scene interpretation as a configuration task, KunstlicheIntelligenz 3 (2005) 59–65.

[22] C. Hudelot, N. Maillot, M. Thonnat, Symbol grounding for semantic imageinterpretation: from image data to semantics, in: Workshop on SemanticKnowledge in Computer Vision, in Association with ICCV05, 16 October 2005,Beijing, China, pp. 1875–1883.

[23] M. Jahangiri, M. Petrou, Fully bottom-up blob extraction in buildingfacades, in: Proceedings of the 9th International Conference on PatternRecognition and Image Analysis: New Information Technologies, PRIA-9-2008, September 14–20, Nizhni Novgorod, Russian Federation, vol. 1, pp238–241, 2008.

[24] M. Jahangiri, M. Petrou, An attention model for extracting components thatmerit identification, in: Proceedings of ICIP, 2009.

[25] M. Jahangiri, Blob detector tool, 2009. <www.commsp.ee.ic.ac.uk/jahangiri/blobextraction.zip>

[26] Z. Kim, R. Nevatia, Expandable Bayesian networks for 3D object descriptionfrom multiple views and multiple mode inputs, IEEE Trans. Pattern Anal. Mach.Intell. 25 (6) (2003) 769–774.

[27] Z. Kim, R. Nevatia, Automatic description of complex buildings from multiplefimages, Comput. Vision Image Und. 96 (1) (2004) 60–95.

[28] N. Komodakis, G. Tziritas, N. Paragios, Fast approximately optimal solutions forsingle and dynamic MRFs, in: Proceedings of CVPR, 2007, pp. 1–8.

[29] P. Kohli, J. Rihan, M. Bray, P. Torr, Simultaneous segmentation and poseestimation of humans using dynamic graph cuts, Int. J. Comput. Vision 79 (3)(2008) 59–80.

[30] S. Lee, R. Nevatia, Interactive 3D building modelling using a hierarchicalrepresentation, in: Proceedings of the First IEEE International Workshop onHigher-Level Knowledge in 3D Modelling and Motion Analysis, Los Alamitos,CA, USA, 2003, pp. 58–65.

[31] S. Lee, S. Jung, R. Nevatia, Integrating ground and aerial views for urban sitemodelling, in: Proceeding of ICPR, 2002, pp. 107–112 vol.4.

[32] I. Levner, V. Bulitko, Machine learning for adaptive image interpretation, in:Proceedings of the 16th Innovative Applications of Artificial Intelligence’04Conference, San Jose, CA, USA, 2004.

[33] M. Marengoni, A. Hanson, S. Zilberstein, E. Riseman, Decision making anduncertainty management in a 3D reconstruction system, IEEE Trans. PatternAnal. Mach. Intell. 25 (7) (2003) 852–858.

[34] G. Metta, D. Vernon, G. Sandini, The RobotCub Approach to the Development ofCognition, in: L. Berthouze, F. Kaplan, H. Kozima, H. Yano, J. Konczak, G. Metta,J. Nadel, G. Sandini, G. Stojanov, C. Balkenius (Eds.), Proceedings of the 5thInternational Workshop on Epigenic Robotics: Modelling CognitiveDevelopment in Robotic Systems, Lund University Cognitive Studies, vol.123, 2004, ISBN 91-974741-4-2.

http://kathrin.dagstuhl.de/08091/Materials2/

http://kathrin.dagstuhl.de/08091/Materials2/

http://www.commsp.ee.ic.ac.uk/jahangiri/blobextraction.zip

http://www.commsp.ee.ic.ac.uk/jahangiri/blobextraction.zip


[35] H.H. Nagel, From image sequences towards conceptual descriptions, ImageVision Comput. 6 (2) (1988) 59–74.

[36] B. Neumann, R. Moller, A.G. Cohn, D.C. Hogg, Logic and probability for sceneinterpretation, in: Dagstuhl Workshop, 2008. <http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=08091>.

[37] B. Neumann, R. Möller, On scene interpretation with description logics, ImageVision Comput. 26 (1) (2008) 82–101.

[38] A. Ogale, A. Karapurkar, Y. Aloimonos, View-invariant modeling andrecognition of human actions using grammars, in: Workshop on DynamicalVision at ICCV, 2005, pp. 115–126.

[39] B. Ommer, J.M. Buhmann, Learning the compositional nature of visual objects,in: Proceedings of CVPR, 2007, pp. 1–8.

[40] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference, Morgan Kaufman Publishers Inc., 1988.

[41] M. Petrou, Learning in computer vision: some thoughts, in: Proceeding ofCIARP, Santiago, Cile, 2007.

[42] L. de Raedt, Logical and Relational Learning, Springer, 2008.[43] M. Richardson, P. Domingos, Markov logic networks, Mach. Learn. 62 (1-2)

(2006) 107–136.[44] B.D. Ripley, N.L. Hjort, Pattern Recognition and Neural Networks, 1st ed.,

Cambridge University Press, 1995.[45] S.J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, second ed.,

Prentice-Hall Series In Artificial Intelligence, 2003.[46] G. Sandini, G. Metta, D. Vernon, RobotCub: an open framework for research in

embodied cognition, Int. J. Humanoid Robotics 8 (2) (2004).[47] S. Savarese, L. Fei-fei, 3D Generic object categorisation, localisation and pose

estimation, in: Proceedings of ICCV, 2007, pp. 1–8.[48] G. Schneider, H. Wersing, B. Sendhoff, E. Korner, Evolutionary optimisation of a

hierarchical object recognition model, IEEE Trans. Syst., Man, Cybern. – Part B:Cybern. 35 (3) (2005) 426–437.

[49] C. Siagian, L. Itti, Rapid biologically-inspired scene classification using featuresshared with visual attention, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2)(2007) 300–312.

[50] J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, W.T. Freeman, Discoveringobjects and their location in images, in: Proceedings of ICCV, 2005, pp.370–377.

[51] H.A. Sturges, The choice of a class interval, J. Am. Stat. Assoc. 21 (153) (1926)65–66.

[52] J. von Neumann, O. Morgenstern, Theory of Games and Economic Behavior,second ed., Princeton University Press, 1947.

[53] V.-T. Vu, F. Bremond, M. Thonnat, Temporal constraints for videointerpretation, in: The 15th European Conference on Artificial Intelligence(ECAI’2002), Lyon, France, 21–26 July 2002.

[54] M. Weber, M. Welling, P. Perona, Unsupervised learning of models forrecognition, in: Proceedings of ECCV, 2000, pp. 18–32.

[55] L. Wiskott, C. Malsburg, A neural system for the recognition of partiallyoccluded objects in cluttered scenes, Int. J. Pattern Recogn. Artif. Intell. 7 (4)(1993) 735–748.

[56] E.J. Wisniewsk, D. Medin, The Interaction of Theory and Similarity in RuleInduction, first ed., MIT Press, 1995.

[57] M. Xu, Scene interpretation with the Tower of Knowledge, PhD thesis, ImperialCollege London, 2010.

[58] C.E. Zhambok, G. Klein, Naturalistic Decision Making, Lawrence ErlbaumAssociates, Mahwah, NJ, 1997.

[59] H. Zimmermann, Fuzzy set theory and its applications, Kluwer AcademicPublishers, Boston, ISBN 0-7923-7435-5.

[60] IST06, E-training for interpreting images of man-made scenes. <http://www.ipb.uni-bonn.de/projects/etrims/>.

[61] http://www.emtech.net/learning_theories.htm.

http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=08091

http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=08091

http://www.ipb.uni-bonn.de/projects/etrims/

http://www.ipb.uni-bonn.de/projects/etrims/

http://www.emtech.net/learning_theories.htm

Documents

3D Scene interpretation by combining probability theory and logic: The tower of knowledge