Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
UNIVERSIDAD DE CASTILLA-LA MANCHA
ESCUELA SUPERIOR DE INGENIERIA INFORMATICA
GRADO EN INGENIERIA INFORMATICA
TRABAJO FIN DE GRADO
TECNOLOGIA ESPECIFICA DE COMPUTACION
Dynamic Bayesian Networks for semantic
localization in robotics
Fernando Rubio Perona
July 2014
UNIVERSIDAD DE CASTILLA-LA MANCHA
ESCUELA SUPERIOR DE INGENIERIA INFORMATICA
Departamento de Sistemas Informaticos
TRABAJO FIN DE GRADO
TECNOLOGIA ESPECIFICA DE COMPUTACION
Dynamic Bayesian Networks for semantic
localization in robotics
Author: Fernando Rubio Perona
Supervisors: Marıa Julia Flores Gallego
Jesus Martınez Gomez
Collaborators: Ann Nicholson
Alex Black
July 2014
Abstract
This project presents a solution based on Bayesian Artificial Intelligent for the
problem of semantic localization in Autonomous Robots. We have developed a met-
hodology that covers the following steps: (1) Image processing and discretization for
creating feature-based scene descriptors. (2) Learning of static Bayesian Networks and
Naıve Bayes classifier. (3) Learning of Dynamical Bayesian Networks. (4) Evaluation
of the models. (5) Comparison. We must pay attention to DBNs, which have proven
to be a solution to consider.
This process includes the use of different software tools, since it is not possible to
cover all these fields with only one. That implies a great effort, because we must first
know all the tools in order to solve the problem. Moreover, we have to implement our
own techniques for the task of tool integration, as well as, a discretization process for
histograms and a method of constructing DBNs.
All this process has been tested in a real case: the KTH-IDOL2 (Image Database
for rObot Localization) dataset for scene classification. Our experimental results show
that BN models obtain good accuracy values.
Resumen
Este proyecto presenta una solucion basada en la Inteligencia Artificial Bayesiana
para el problema de la localizacion semantica en Robotica Autonoma. Desarrollaremos
una metodologıa dividida en los siguientes pasos: (1) Procesamiento de imagenes y
discretizado para descriptores basados en la extracion de caracterısticas. (2) Apren-
dizaje de redes Bayesianas estaticas y el clasificador Naıve Bayes. (3) Aprendizaje de
redes Bayesianas dinamicas. (4) Evaluacion de modelos. (5) Comparacion. Deberemos
prestar atencion a las DBNs, las cuales han demostrado ser una solucion a considerar.
Este proceso incluye el uso de diferentes herramientas, ya que no es posible cu-
brir todos estos campos con solo una. Esto implica un gran esfuerzo, porque debemos
conocer todas ellas. Ademas, tenemos que implementar nuestros propios metodos pa-
ra la tarea de integrar dichas herramientas, ası como el proceso de discretizado para
histogramas y un constructor de DBNs.
Todo este proceso ha sido probado en un caso real: la base de datos KTH-IDOL2
para la clasificacion de escenarios. El resultado de nuestros experimentos muestran que
las redes Bayesianas obtienen buenas tasas de acierto.
A mis profesores, companeros, familia y en especial a Vanessa.
Agradecimientos
Agradecer a mis directores Julia y Jesus, que tanto me han ayudado en la realizacion
de este proyecto. Se que sin ellos no habrıa podido presentar un trabajo tan completo
y trabajado como este. Agradecer a ellos tambien la publicacion de mi primer artıculo.
A mi familia y companeros el apoyo que han prestado en la realizacion de este
proyecto y durante toda la carrera.
TABLE OF CONTENTS
LIST OF FIGURES IX
LIST OF TABLES XIII
1. Introduction 1
1.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Report structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Introduccion (en espanol) 5
1.1. Objetivos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2. Estructura de la memoria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. Bayesian Artificial Intelligence 9
2.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2. The probability in our solution . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3. Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4. Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5. Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6. Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7. CaMML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3. Robot Vision and Localization 27
3.1. Image encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2. Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3. Global Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4. PHOG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5. Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4. Experimentation 37
4.1. Dataset: Image CLEF 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2. Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3. First step: Descriptor Generation . . . . . . . . . . . . . . . . . . . . . . . 41
vii
viii TABLE OF CONTENTS
4.4. Second step: Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5. Third step: Network Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 49
5. Results 51
5.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2. Variable reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3. Dynamic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6. Conclusions and further work 63
6.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2. Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Conclusiones y trabajo futuro (en espanol) 67
6.1. Conclusiones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2. Trabajo Futuro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
REFERENCES 74
LIST OF FIGURES
2.1. Bayes Network with 3 variables and the probabilities associated to each
node in the form P (X ∣pa(X)). . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2. Relationships between variables in a Bayesian Network. . . . . . . . . . . 17
2.3. Bayesian Network example with 5 variables. . . . . . . . . . . . . . . . . . 18
2.4. General structure of a Dynamic Bayesian Network. . . . . . . . . . . . . 20
2.5. General structure for a Naıve Bayes Classifier. . . . . . . . . . . . . . . . 22
2.6. Example of two DAGs within the same SEC: (a) chain; (b) common cause 25
3.1. Passive sensors: Digital Temperature and Humidity sensor (left), Sound
Sensor Microphone (center), Canon VC-C4 Camera (right) . . . . . . . . 28
3.2. Active sensors: Ultrasonic sensor (left), Laser range finder (right) . . . . 28
3.3. Most commonly used resolutions. . . . . . . . . . . . . . . . . . . . . . . . 29
3.4. Example of different models for colour display. . . . . . . . . . . . . . . . 30
3.5. Visualization of the SIFT descriptor computation. For each (orientation-
normalized) scale invariant region, image gradients are sampled in a
regular grid and are then entered into a larger grid of local gradient
orientation histograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6. Process for visual word generation from a set of images. . . . . . . . . . 32
3.7. Histograms of frecuency: number of evidences (left) and percentage (right) 33
3.8. Example of a spatial pyramid with depth level 2 . . . . . . . . . . . . . . 33
3.9. Example of level depth in PHOG. . . . . . . . . . . . . . . . . . . . . . . . 34
4.1. Schema for the entire process, separated by steps. . . . . . . . . . . . . . 38
4.2. Robot platforms: Dumbo (left) and Minnie (right). . . . . . . . . . . . . 38
4.3. Semantic labels used in the IDOL2 dataset. . . . . . . . . . . . . . . . . . 39
4.4. Map of the IDOL2 environment. . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5. First step schema: data extraction. . . . . . . . . . . . . . . . . . . . . . . 42
4.6. Image taken by the robot (left). HOG descriptor (center). Histogram
with 360 variables from HOG (right). . . . . . . . . . . . . . . . . . . . . . 42
4.7. HOG histogram with 360 variables (left). HOG histogram with variable
reduction n = 30 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
ix
x LIST OF FIGURES
4.8. Original Histogram (top right) extracted from sample image (top right).
Bottom: four histograms obtained with differents n values. The colours
in the bottom histograms represents the generated class labels or catego-
ries: low=red, medium/low=orange, medium/high=yellow and high=blue. 44
4.9. Example of arff file (left) and csv (right) . . . . . . . . . . . . . . . . . . 44
4.10. Example of a csv test file for DBNs . . . . . . . . . . . . . . . . . . . . . . 45
4.11. Second step schema: learning. . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.12. Example of two static networks with the same structure disconnected
between them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.13. Example of a Dynamic Bayesian Network with related classes but inde-
pendent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.14. Example of final Dynamic Bayesian Network obtained with our DBN
constructor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.15. Schema for Dynamic Naıve Bayes (DNB) classifier . . . . . . . . . . . . . 49
4.16. Third step schema: network evaluation. . . . . . . . . . . . . . . . . . . . 50
5.1. Sequence distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2. Example of BNs with 10 variables and the Class: CaMML (left) and
Naıve Bayes (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3. Graph of learning time for BNs (s) for the Cloudy case with different
number of variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4. Rates of test under the same illumination conditions as the training set:
CaMML (left), Naıve (right). . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5. Rates of test Night and Sunny under Cloudy illumination condition:
CaMML (left), Naıve (right). . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6. Rates of test Cloudy and Sunny under Night illumination condition:
CaMML (left), Naıve (right). . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7. Rates of test Cloudy and Night under Sunny illumination condition:
CaMML (left), Naıve (right). . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.8. Sequence distribution for “All” sequence. . . . . . . . . . . . . . . . . . . 57
5.9. Rate comparative for CaMML and Naıve Bayes training and test with
“All” sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.10. Dynamic Network with 5 variables created with CaMML. . . . . . . . . 59
5.11. Example of a DBN created with DBN constructor on a CaMML model
with 5 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.12. Rate comparative for DBNs training and test with “All” sequence with
different class transition: ModelA (left), ModelB (right). . . . . . . . . . 60
5.13. Case All classification rates evolution when using dynamic and static
classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
LIST OF FIGURES xi
6.1. Grid example with size 4x4 in the IDOL2 environment. . . . . . . . . . . 64
LIST OF TABLES
2.1. Example table of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2. Room Probability P (Room) . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3. Probability of variable Mirror P (Mirror) (left). Probability of variable
Fridge P (Fridge) (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4. Joint probability for Mirror and Fridge variables P (Mirror,Fridge) . . 12
2.5. Joint probability for Mirror and Room variables P (Room,Mirror) . . 12
2.6. Cond. Prob. of the rooms known the value for mirror: P (Room∣Mirror) 13
2.7. Cond. Prob. of the rooms known the value for lamp: P (Room∣Lamp) . 14
2.8. Cond. Prob. of the rooms known the value for Desk: P (Room∣Desk) . . 14
2.9. Cond. Prob. of the rooms known Desk and TV: P (Room∣Desk,TV ) . . 14
2.10. Conditional Probability of see a desk known the value for the room
P (Desk∣Room) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1. Example of distribution labels in a set of images. . . . . . . . . . . . . . 32
4.1. CPT for C1 before applying the link with C0 (left) and after (right). . . 47
4.2. Class transition table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3. CPT for C1 combined with class transition table (left) and normalized
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1. Learning time for BNs (s) for the Cloudy case with different number of
variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2. Accuracy ( %) of CaMML model and Naıve Bayes classifier training and
test with Cloudy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3. Accuracy ( %) of CaMML model and Naıve Bayes classifier training and
test with Night. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4. Accuracy ( %) of CaMML model and Naıve Bayes classifier training and
test with Sunny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5. Accuracy of CaMML model and Naıve Bayes classifier training with
Cloudy and test with Night and Sunny. . . . . . . . . . . . . . . . . . . . 55
xiii
xiv LIST OF TABLES
5.6. Accuracy of CaMML model and Naıve Bayes classifier training with
Night and test with Cloudy and Sunny. . . . . . . . . . . . . . . . . . . . 55
5.7. Accuracy of CaMML model and Naıve Bayes classifier training with
Sunny and test with Cloudy and Night. . . . . . . . . . . . . . . . . . . . 55
5.8. Accuracy of CaMML model and Naıve Bayes classifier training and test
with “All” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.9. Class transitions used for the dynamic models. . . . . . . . . . . . . . . . 58
5.10. Accuracy of DBNs training and test with “All” sequence. . . . . . . . . 60
Chapter 1
Introduction
Localization is one of the main problems in autonomous robots. The information
about its localization can be obtained with tools like a compass or a GPS device,
but this information is hard to be used by people. Some data like coordinates are
complicated to understand so we search semantic solutions. These semantic solutions
mean that the robot can identify the surrounding environment. It is able to know if
its localization is outside or inside or identify if it is in a park or in a bedroom, for
example.
In this project I provide the basis of a possible solution for this semantic localization
problem based on Bayesian Artificial Intelligence (Bayesian A.I.) [Korb & Nicholson,
2010]. This solution consists in obtaining a learned network based on the Bayesian con-
cepts. To make it we need to create variables from data and discover the relationships
between them. This project is focusing on two main themes, the first one is to know
the behaviour of variables and their discretization in the semantic localization problem.
The second one is to study the possibilities of a new dynamic approach in Bayesian
Networks [Russell & Norvig, 2003].
The robots are able to obtain many types of information from the environment with
different sensors. These sensors are called Exteroceptive. We will focus on a camera
which is a Passive Exteroceptive sensor. A Passive Exteroceptive sensor can measure
environmental energy. The camera obtains images of the environment and these are
the data we will use for creating the models, in our case Bayesian Networks (BNs)
[Jensen & Nielsen, 2007; Pearl, 1988]. But this visual information has to be processed
to obtain variables for the networks. There are many techniques to obtain information
from images, some more complex than others but the project does not have the aim of
exploring this field, so we explore one descriptor: the Pyramid Histogram of Oriented
Gradients or PHOG. We will only use PHOG for results because the main theme is the
behaviour of the variables in the same conditions and not the characteristics obtained
with different techniques.
1
2 CHAPTER 1. INTRODUCTION
The solutions obtained can not be understood without an explanation of some
important concepts of Bayesian Artificial Intelligence. In this project an extended ap-
proach is shown, Dynamic Bayesian Networks (DBNs). We also explain in detail the
basics of the algorithms used in the experimentation. First of them is Naıve Bayes, one
of the most famous and used classifiers based on Bayes’ Theorem. The second one is
CaMML, which attempts to learn the best causal network, using an MML metric and
an MCMC search [Korb & Nicholson, 2010; Wallace, 2005].
Once we have described all the previous concepts we can explain the programs used
in our development process and how to obtain the information that we need with them.
One of the main characteristics of this project is the use and integration of different
tools, each of which is responsible for various tasks. We define the dataset that we will
use and the results that we have obtained.
The last step consists in studying the results obtained along this project. We use
different techniques for graphical representation of data, such as charts and graphs.
These display methods allow us to have a more structured look of information.
1.1. Objectives
As far as we know, there is not a way to obtain a Bayesian Network from a set of
images using a simple procedure. So our first aim is to develop a process that allows us
generate a dataset from sets of images. Then, we will have to learn Bayesian Networks
from this dataset information and, finally, we will show and analyse the results obtained
by our experiments.
In order to fulfil this development process, it will be necessary to divide it into steps
and to store the intermediate results, because a set of images can generate different
datasets and this, at the same time, can create different networks. It is important to
store these intermediate results to avoid repeating the same operations and to prevent
a waste of time, that is, we will have to devise a consistent and efficient methodology,
which involves different stages and software platforms that we will interrelate.
Then, this development needs different tools, since it is not possible to perform the
entire process with only one due to the distinct nature of the domains here studied,
mainly robotics (image processing), supervised classification and Bayesian Networks
learning and inference. So, in order to perform this work, an initial objective is to
know how to use these tools and what their inputs and outputs are. As a result of this
study, we will have to design the format for those files storing the intermediate results,
so that they match, otherwise the process will be useless.
In each part of the development we need to create methods that automatize that
specific part of the process. This allows us launch many experiments at the same time.
1.2. REPORT STRUCTURE 3
This is essential in this project, since we need to compare many different cases in order
to understand correctly the results.
Once the development process is over, we have two more objectives related with the
obtained results. The first of them is studying the behaviour of the number of variables
in the Bayesian Networks and how that affects to our problem. This aim consists in
create networks from the datasets with different number of variables, prove them with
test and validate data and finally compare the results obtained and draw conclusions
from this information.
The second one is the study of DBNs for our problem. This consists in generating
different dynamic networks and comparing them with the static ones. This problem has
a strong temporal component, so this dynamic approach should get better results. The
process is the same as in the previous target, we need to generate different networks,
test them and compare results.
As an extra objective and a direct result of this project, we want to remark that
most of the work here presented, has been submitted, accepted and presented – in a
summarized version – in a conference paper, whose bibliographical reference is [Rubio
et al., 2014]1.
1.2. Report structure
This project is composed of six chapters organized as follows. The first two chapters
deal with the state of the art. This is divided into two parts because these are two very
different issues. The first of them gives an overall introduction to Bayesian Artificial
Intelligence, including the use of static and dynamic models. We also talk about the
probabilistic classifiers, more specifically Naıve Bayes and CAMML learning procedure
are introduced.
The second part of the state of the art is the Robot Vision. This chapter sum-
marizes some basic concepts of the artificial vision and image processing. We see the
representation of images, the extraction of local invariant features, the global descrip-
tors and the PHOG descriptor. Finally we explain the localization from two points of
view: topological and semantic.
Chapter 4 explains in detail all the development process and defines the first aim of
the project. The first section defines the dataset and the second one describes the tools
used. The remaining sections of this chapter represent a step in the experimentation
process, where is specified the input and output data and the corresponding tool used
on it. We include a scheme with the different functionalities in each step in order to
have a global vision of the entire process.
1This paper can be downloaded from http://waf2014.redaf.es/media/189.pdf
4 CHAPTER 1. INTRODUCTION
Chapter 5 is totally devoted to the results. In this chapter we try to accomplish the
two objectives related with the information extraction. We visualize different graphs
and charts in order to obtain information about the behaviour of the number of varia-
bles and the Dynamic Bayesian Networks.
The last chapter corresponds to the conclusions and the future work we propose.
Introduccion
En la robotica autonoma la localizacion es uno de los principales problemas. La
localizacion topologica puede ser obtenida con herramientas como una brujula o un
dispositivo GPS, pero para mucha gente esta informacion puede resultar difıcil de
manejar. Algunos datos como las coordenadas son difıciles de entender, por lo que se
buscan soluciones semanticas que permiten al robot identificar el entorno que le rodea.
Un ejemplo de esto puede ser reconocer si su localizacion es dentro de un edificio o en
el exterior, o ser capaz de identificar si esta en un parque o en una habitacion.
En este proyecto explicamos las bases de una posible solucion para el problema
de la localizacion semantica, basada en la Inteligencia Artificial Bayesiana [Korb &
Nicholson, 2010]. Esta solucion consiste en la obtencion de una red basada en los
conceptos Bayesianos. Para ello necesitamos crear variables a partir de los datos y
descubrir las relaciones entre ellas. Este proyecto se centrara en dos temas principales: el
primero es conocer el comportamiento de las variables y su discretizacion en el problema
de la localizacion semantica; el segundo tema trata de estudiar las posibilidades de un
enfoque dinamico en las Redes Bayesianas [Russell & Norvig, 2003].
Los robots son capaces de obtener muchos tipos de informacion acerca del entorno
con diferentes sensores, los cuales se denominan Exteroceptivos. Nosotros nos centra-
remos en una camara, un sensor Exteroceptivo Pasivo capaz de medir la energıa del
entorno. La camara obtiene imagenes del espacio que le rodea y esos datos son los que
usaremos para crear modelos, que en nuestro caso seran Redes Bayesianas [Jensen &
Nielsen, 2007; Pearl, 1988]. Esta informacion visual tiene que ser procesada para poder
obtener las variables necesarias para las redes. Existen muchas tecnicas para obtener
la informacion de las imagenes, algunas mas complejas que otras, pero el objetivo de
este proyecto no es explorar este campo, ası que solo expondremos un descriptor: la
Piramide de Histogramas de Gradientes Orientados o PHOG. Con este unico descriptor
obtendremos los resultados, ya que el tema principal no es el analisis de las caracterısti-
cas obtenidas con diferentes tecnicas, sino analizar el comportamiento de las variables
en las mismas condiciones.
Las soluciones obtenidas no pueden comprenderse sin una explicacion de concep-
tos importantes sobre la Inteligencia Artificial Bayesiana. Ademas en nuestro proyecto
tambien mostraremos una extension a estos conceptos, las Redes Dinamicas Bayesianas
5
6 CHAPTER 1. INTRODUCTION
(DBNs). Explicaremos en detalle las bases de los modelos usados en la experimenta-
cion. El primero de ellos es Naıve Bayes o Bayes Ingenuo, uno de los clasificadores
mas utilizados basados en el Teorema de Bayes. El segundo es CaMML, que intenta
aprender la mejor red causal, usando una metrica MML y una busqueda MCMC [Korb
& Nicholson, 2010; Wallace, 2005].
Una vez que conocemos todos los conceptos previos, definiremos los programas em-
pleados durante todo el proceso de desarrollo realizado y como obtener la informacion
que necesitamos. Una de las caracterısticas principales que definen a este proyecto es
el uso e integracion de diferente Software, donde cada una de estas herramientas es
responsable de varias tareas. Tambien definiremos el conjunto de datos que vamos a
usar y el tipo de resultados que obtendremos.
El ultimo paso consiste en estudiar los resultdos obtenidos a lo largo de este pro-
yecto. Usaremos diferentes tecnicas para la representacion grafica de los datos, como
tablas y graficas. Estos metodos de visualizacion nos permiten realizar un resumen mas
detallado de la informacion.
1.1. Objetivos
Hasta donde sabemos, no hay ninguna forma de obtener una red Bayesiana de un
conjunto de imagenes usando un procedimiento simple. Ası, nuestro primer objetivo
es desarrollar un proceso que nos permita generar un conjunto de datos a partir de
conjuntos de imagenes. Despues, tendremos que aprender redes Bayesianas de la in-
formacion de este conjunto de datos, y finalmente, mostraremos y analizaremos los
resultados obtenidos a traves de nuestros experimentos.
Para realizar este proceso de desarrollo necesitaremos dividirlo en pasos y guardar
los resultados intermedios, porque un conjunto de imagenes puede generar diferentes
conjuntos de datos y estos, al mismo tiempo, pueden aprender diferentes redes. Es
importante guardar estos resultados intermedios para evitar repetir las mismas opera-
ciones y prevenir ası el gasto de tiempo innecesario. Esto quiere decir que tendremos
que crear una metodologıa consistente y eficiente que abarque las diferentes etapas del
proceso y la interrelacion entre las plataformas software.
Ası, este desarrollo necesita diferentes herramientas; puesto que no es posible rea-
lizar el proceso entero solo con una debido a la distinta naturaleza de los dominios
aquı estudiados; principalmente la robotica (procesamiento de imagenes), clasificacion
supervisada y el aprendizaje e inferencia de las Redes Bayesianas. Para ello, a fin de
realizar este trabajo, un objetivo inicial serıa saber usar estas herramientas, y cuales
son sus entradas y salidas de informacion. Como resultado de este estudio, tendremos
que disenar el formato para estos archivos, almacenando los resultados intermedios
1.2. ESTRUCTURA DE LA MEMORIA 7
para que coincidan; de otra forma el proceso serıa inutil.
En cada parte del desarrollo necesitamos crear metodos que automaticen cada una
de las partes especıficas del proceso. Esto nos permite ejecutar varios experimentos al
mismo tiempo. Esto es esencial en este proyecto, ya que necesitamos comparar diferen-
tes casos para poder entender correctamente los resultados.
Una vez finalizado el proceso de desarrollo, tenemos dos objetivos mas relacionados
con los resultados que obtendremos. El primero de ellos es el estudio del comportamien-
to del numero de variables en las redes Bayesianas y como afecta a nuestro problema.
Este objetivo consiste en aprender redes a partir de los conjuntos de datos con dife-
rentes numeros de variables, probarlas con datos de test y validacion y, finalmente,
comparar los resultados obtenidos y extraer conclusiones sobre esta informacion.
El segundo objetivo relacionado con los resultados es el estudio de las DBNs para
nuestro problema. Esto consiste en generar diferentes redes dinamicas y compararlas
con las estaticas. Este problema tiene una fuerte componente temporal, ası que esta
aproximacion dinamica deberıa obtener mejores resultados. El proceso es el mismo que
en el objetivo anterior, necesitamos generar diferentes redes, probarlas y comparar los
resultados.
Como un objetivo extra y un resultado directo de este proyecto, queremos remarcar
que la mayorıa del trabajo aquı presentado ha sido enviado, aceptado y presentado -en
una version resumida- en un artıculo de conferencia, cuya referencia bibliografica es
[Rubio et al., 2014]2.
1.2. Estructura de la memoria
Este proyecto se compone de seis capıtulos organizados de la siguiente forma. Los
dos primeros capıtulos tratan sobre el estado del arte. Esta parte esta dividida en
dos porque se habla de dos temas muy diferentes. El primero de ellos nos aporta una
introduccion general a la Inteligencia Artificial Bayesiana, incluyendo el uso de modelos
estaticos y dinamicos. Tambien hablamos sobre los clasificadores probabilısticos y, mas
especıficamente, introduciremos los procedimientos de aprendizaje de Naıve Bayes y
CaMML.
La segunda parte del estado del arte es la Vision Artificial. Este capıtulo resume
algunos conceptos basicos sobre la vision artificial y el procesamiento de imagenes.
Veremos la representacion de imagenes, la extraccion de caracterısticas invariantes
locales, el proceso de descripcion global y el descriptor PHOG antes mencionado. Por
ultimo, explicaremos brevemente la localizacion desde dos puntos de vista: el topologico
y el semantico.
2Este artıculo puede ser descargado de http://waf2014.redaf.es/media/189.pdf
8 CHAPTER 1. INTRODUCTION
El capıtulo 4 explica en detalle todo el proceso de desarrollo y define el primer ob-
jetivo del proyecto. La primera seccion define el conjunto de datos y la segunda seccion
se encarga de las herramientas utilizadas. Las secciones restantes de este capıtulo re-
presentan cada uno de los pasos del proceso de experimentacion, en el cual se especifica
la entrada y salida de informacion y la herramienta utilizada. Incluimos un esquema
con la diferente funcionalidad en cada paso, para poder tener una vision global del
proceso entero.
El capıtulo 5 esta dedicado totalmente a los resultados. En este capıtulo intenta-
mos alcanzar los dos objetivos relacionados con la extraccion de la informacion. Vi-
sualizaremos las diferentes graficas y tablas para poder obtener informacion acerca del
comportamiento del numero de variables y las DBNs.
El ultimo capıtulo corresponde a las conclusiones y el trabajo futuro que nosotros
proponemos.
Chapter 2
Bayesian Artificial Intelligence
2.1. Notation
Random variables are represented by upper case roman letters: X, Y , etc. A set of
variables is written with the same letters, but in bold: XXX, YYY , etc. And the variables of
these sets are represented by the same letter, as random variables but with a subscript.
XXX = X1,X2, . . . ,Xn (2.1)
The set of values that a variable can take is written by the greek character Ω and
its arity by ∣Ω∣, it usually has a subscript indicates the variable that refers: ΩX
The values of a variable is represented by lower case letters: a, b, etc. And these
values below to a variable X if a ∈ ΩX
If the variable is binary we can write one of the values with the same character as
the variable in lower case and the opposite is represented with the same letter but with
a horizontal line above: ΩX = x,xThe probabilities are represented by the upper chase character P . P (X) represents
all the probabilities of each value that the variable X can take. For example we have
a ∈ ΩX , the probability of this specific value is represented as P (X = a) or P (a).The most common way to represent probability is a table, where each value for the
variable has a probability. For example we have a variable X with ΩX = a1, a2, . . . , am,
we can see in Table 2.1 how the probabilities are represented.
X Probability
a1 P (X = a1)a2 P (X = a2)⋮ ⋮am P (X = am)
Table 2.1: Example table of probabilities
9
10 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
We represent the intersection between variables with a comma(,) or a ∩. This
represents the joint probability as we will see in the correspondent section.
The conditional relationship is represented by the vertical line( ∣ ). The term in the
left is the hypothesis we want to know and the right term is the events we already
know. A example for this notation is P (X ∣Y )The symbol á represents the independence between variables. This also is repre-
sented by the upper case I but the expressions are different depending on the character
used. We will see these differences later.
2.2. The probability in our solution
The solutions explained in this project and most concepts and techniques of Ba-
yesian A.I. are based on probabilistic inference. First of all, what is probabilistic infe-
rence? It is the ability to obtain some evidences with a certain probability degree by
the observation of other evidences in a set of variables. This means that if we see some
characteristic in an image, we can obtain a probability for the place the robot is. For
example, if we are in a house and we see a fridge, we can say with a high probability
the robot is in the kitchen, but it is possible to stay in other room. If we see a mirror,
the robot can stay in a bathroom or in a bedroom with the same probability or maybe
in other room with less probability.
To calculate the probability distribution for each variable we need to capture the
relationships between variables. As we will see later, for this purpose we will use models
able to represent those dependences and independences, which are Bayesian Networks
[Jensen & Nielsen, 2007; Korb & Nicholson, 2010; Pearl, 1988]. The construction of
these networks will normally need two phases: structure learning and parametric lear-
ning, since this second phase depends on the first one. The construction of a model
can be obtained from an expert using knowledge engineering, this process is difficult
and slow, and needs specific techniques for knowledge elicitation. Besides, it involves
the availability of an expert in the domain to be modelled. Because of this difficulties
and thanks to the increasing popularity and development of Machine Learning tech-
niques, automatic learning of Bayesian Networks from data is also possible, and it is
very usually done [Cooper & Herskovits, 1992; Heckerman et al., 1995; Neapolitan,
2003]. We will mainly focus on this second approach. Then, we will usually calculate
the probability distribution from previous experiences for example, these experienced
data is called training data.
In the following subsections, we will then introduce basic concepts about probability,
which are necessary to understand how Bayesian Networks work.
2.2. THE PROBABILITY IN OUR SOLUTION 11
2.2.1. Marginal Probability
This distribution is the probability that each value or state of the variable has to
appear or happen and it is calculated differently depending on the type of variable
(discrete or continuous).
Discrete variables have a probability for each value and we calculate it counting from
data how many times the value appears and divided it for the number of training data
cases. As the variables we work with are random these values can change depending
on the train data. So normally we work with frequencies.
When the variables are continuous, we calculate the probabilities with an estimation
approach like the Gaussian distribution.
An example of discrete variable could be to determinate where the robot is. We have
a training data of images and a label with five types of room, so the variable Room has
five values: kt(kitchen), be(bedroom), ba(bathroom), lr(living-room) and cr(corridor).
Now we obtain the probabilities counting how many times the room is in the labels
and divide it by the total of images. We can see in Table 2.2 that the robot has more
possibilities to stay in a big room like the living-room.
Room Probability
kt 0.20
be 0.20
ba 0.15
lr 0.30
cr 0.15
Table 2.2: Room Probability P (Room)
Now we see all the images and count how many times we see a mirror and a fridge.
The variable values are two: if we see the object or not.
But the margin probabilities are not enough to obtain results. As we can see in
Table 2.3 if there is a mirror in the image we don’t know where the robot is and the
same for the fridge. But we know if we see the fridge the robot have a high probability
of be in the kitchen, so we have to relate the room variable with the fridge probability.
Mirror Probability
m 0.70
m 0.30
Fridge Probability
f 0.90
f 0.10
Table 2.3: Probability of variable Mirror P (Mirror) (left). Probability of variable
Fridge P (Fridge) (right)
12 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
2.2.2. Joint Probability
The union of variables is calculated by the joint probability distribution. It consists
in the probability to observe several evidences at the same time. It is a probability dis-
tribution between the observed variables and it has as many values as each combination
of the possibles values of each variable.
In the previous example we have a mirror and a fridge, these are our variables and
each one has the value to stay or not. With this, the table of joint probability will be
like Table 2.4. We obtain this probability by counting how many times the objects are
in the same image, when only there is one of them and when there is none.
m m
f 0.61 0.29
f 0.09 0.01
Table 2.4: Joint probability for Mirror and Fridge variables P (Mirror,Fridge)
In this case there is only two variables, but the size of the table with N variables
grows exponentially with them, as we can see in equation 2.2. That means if we have
50 binary variables the size of the table will be 250 ≃ 1015 and this is impossible to
calculate and store which will be alleviated by factorisation in BNs. And now we can
relate the room and the objects information as we see in Table 2.5.
size of the table =N
∏k=1
∣Ωk∣ (2.2)
kt be ba lr cr
m 0.20 0.15 0.05 0.20 0.10
m 0 0.05 0.10 0.10 0.05
Table 2.5: Joint probability for Mirror and Room variables P (Room,Mirror)
2.2.3. Conditional Probability
We can measure the probability distribution of a variable when we know the values
of other variables by the conditional probability. For example, what is the probability
to stay in the bedroom when we see a mirror? The conditional probability is calculated
by the equation 2.3.
P (X ∣Y ) = P (X,Y )P (Y ) (2.3)
We try to answer the question before with the conditional probability to stay in a
room when we see or not an object, in this case a mirror. We observe in Table 2.6 this
2.2. THE PROBABILITY IN OUR SOLUTION 13
probabilities, these are obtained with the tables above. If we see a mirror, it is more
likely to be in the bathroom or in the living-room than in other rooms.
kt be ba lr cr
m 0.29 0.21 0.07 0.29 0.14
m 0 0.17 0.33 0.33 0.17
Table 2.6: Cond. Prob. of the rooms known the value for mirror: P (Room∣Mirror)
2.2.4. Chain Rule
This rule permits to calculate any joint distribution probability through conditio-
nal probabilities. If we have a set of variables XXX = X1,X2, . . . ,Xn we can link the
conditional probabilities with the joint by the equation 2.3 that we have seen before
and, then, we obtain this:
P (Xn, . . . ,X1) = P (Xn∣Xn−1, . . . ,X1) ∗ P (Xn−1, . . . ,X1) (2.4)
Now we repeat the process with Xn−1:
P (Xn−1, . . . ,X1) = P (Xn−1∣Xn−2, . . . ,X1) ∗ P (Xn−2, . . . ,X1) (2.5)
And repeat this until we have P (X1). Finally we join all this operations and obtain
the product:
P (Xn, . . . ,X1) = P (X1) ∗n
∏k=2
P (Xk∣Xk−1, . . . ,X1) (2.6)
In a case with three variables like the Room, Fridge and Mirror, to calculate the
joint probability of this three variables we can use the Chain Rule as we can see in the
equation 2.7. Notice that this way of ordering the variables to produce this chain is
not arbitrary, it will depend on the structure of dependences between variables, but,
as we will show in subsection 2.4.1, the fact that the graph underlying the Bayesian
Network is acyclic guarantees that a possible ordering can be found.
P (Room,Fridge,Mirror) = P (Room∣Fridge,Mirror)∗P (Fridge∣Mirror)∗P (Mirror)(2.7)
2.2.5. Independence
When we work with conditional probabilities a basic concept is the independence.
A variable X is independent of other variable Y when the knowledge of Y do not affect
14 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
to the probability of X(2.8).
XáY ≡ P (X ∣Y ) = P (X) (2.8)
This is known like marginal independence an also can be represented with I(X ∣0∣Y )An example of independence in our case will be to see a lamp with the probabilities of
Table 2.7. The lamp does not give us information about the robot is, the probabilities
are the same if we see a lamp or not.
kt be ba lr cr
l 0.20 0.20 0.15 0.30 0.15
l 0.20 0.20 0.15 0.30 0.15
Table 2.7: Cond. Prob. of the rooms known the value for lamp: P (Room∣Lamp)
It is not easy to find marginal independences between variables, so we need other
kind of independence. The conditional independence occurs when we observe a event
Y and this do not affect to the conditional probability of P (X ∣Z)
XáY ∣Z ≡ P (X ∣Y,Z) = P (X ∣Z) (2.9)
Like before the conditional independence can be represented like I(X ∣Z ∣Y ). In this
example we see a desk and a TV, and we have Tables 2.8 and 2.9. We can see how
knowing the value of TV does not affect the conditional probability of Room known
Desk.
kt be ba lr cr
d 0.28 0.07 0.25 0.21 0.19
d 0.09 0.37 0.02 0.42 0.10
Table 2.8: Cond. Prob. of the rooms known the value for Desk: P (Room∣Desk)
kt be ba lr cr
d,tv 0.28 0.07 0.25 0.21 0.19
d,tv 0.28 0.07 0.25 0.21 0.19
d,tv 0.09 0.37 0.02 0.42 0.10
d,tv 0.09 0.37 0.02 0.42 0.10
Table 2.9: Cond. Prob. of the rooms known Desk and TV: P (Room∣Desk,TV )
2.3. BAYES’ THEOREM 15
2.3. Bayes’ Theorem
Bayes’ Theorem is one of the most important formula in probability theory. This
theorem formulated by Reverend Thomas Bayes is the result of the mathematical
manipulation of conditional probabilities. If we have the P (X ∣Y ) and P (Y ∣X) we can
express it as the same joint probability as we see in Equations 2.10 and 2.11.
P (X ∣Y ) = P (Y,X)P (Y ) (2.10)
P (X ∣Y ) = P (Y,X)P (X) (2.11)
Then we equate both equation and we clear one of the conditional probabilities
obtaining the Bayes’ Theorem of Equation 2.12.
P (X ∣Y ) = P (Y ∣X)P (X)P (Y ) (2.12)
It asserts that the probability of a hypothesis X conditioned by Y is equal to its
likelihood P (Y ∣X) multiplies by the probability P (X), then it is normalized dividing
by P (Y ) to obtain a conditional probability that sums 1.
We will see the importance of Bayes’ Theorem with the next example. We suppose
that we haven’t accessed the train data and the information we have is given by an
expert. He says us the probabilities of seeing different objects known the room the
robot is and the probabilities the robot has to stay in each of these rooms 2.2. An
example of this is the desk information represented in Table 2.10.
d d
kt 0.80 0.20
be 0.20 0.80
ba 0.95 0.05
lr 0.40 0.60
cr 0.70 0.30
Table 2.10: Conditional Probability of see a desk known the value for the room
P (Desk∣Room)
This probabilities are more easy to obtain than the probabilities of Table 2.8. That
is why the Bayes’ Theorem is too important. It allows us to link the probabilities to
see an object in a room with the probabilities to be in a room when we see an object.
Other example of the importance of the Bayes’ Theorem is its use in the field
of medicine. It links the probability of the symptoms known the disease with the
probability to have a certain disease when we know the symptoms.
16 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
2.4. Bayesian Networks
A Bayesian Network is a Directed Acyclic Graph (DAG) where each node has
associated a conditional probability distribution.
The nodes are random variables and the directed arcs represent a direct dependency
between them. If we have the link X → Y , it means that the variable X is a parent of
Y . The CPTs include the conditional probability for the variable in the node known
its parents P (X ∣pa(X)). In Figure 2.1 we can see an example of a Bayesian Network
with the probabilities related to each node.
Figure 2.1: Bayes Network with 3 variables and the probabilities associated to each
node in the form P (X ∣pa(X)).
There are many methods to obtain these relationships and probability tables. The
main form to get them is to analyse a set of training data with different techniques. We
can also add links previously established by an expert to this process, or we can use
a hybrid approach where the learning algorithms accept expert information as priors
which can be input into the algorithm. We talk more about this methods in CaMML:
Learning Bayesian Networks section.
2.4.1. The Markov Property
In Figure 2.2 we see the different relationships between the variable A with the rest
in the graph representing a Bayesian Network. Variables in the green are, labelled as C,
are its parents and are represented by pa(A). Those in the blue zone (labelled as B) are
the non-descendants of the variable A without the parents and we identify them with
nonde(A). And those in red are its descendants, noted as de(A). Markov property says
that a variable is conditional independent of its non-descendants given its parents. We
could reach this reasoning using the concept of d-separation (see [Jensen & Nielsen,
2007] – chapter 2 or [Korb & Nicholson, 2010] – chapter 2, for further detail).
Aánonde(A)∣pa(A) (2.13)
2.4. BAYESIAN NETWORKS 17
Figure 2.2: Relationships between variables in a Bayesian Network.
2.4.2. Inference
As we know a joint probability can be expressed by conditional probabilities through
the chain rule (Equation 2.6). Once we know the relationships (links/edges in the
Bayesian Network) and given the Markov property, we can reduce the expression to
the equation 2.14, because a variable is independent from its non-descendent known
its parents.
P (Xn, . . . ,X1) =n
∏k=1
P (Xk∣pa(Xk)) (2.14)
We only say that a variable is independent for the non-descendent known the pa-
rents, but we erase also the descendent from the equation. We can do this because
the Bayesian Networks are DAGs and there will always be a configuration in the de-
composition into conditional probabilities that does not leave descendants in the right
side. We can see in Figure 2.3 a Bayesian Network. In this example we will find the
configuration that does not leave descendants in the right term. The best way to do
this is starting with the nodes that have no ascendant.
The first step is to apply the Chain Rule starting with the nodes without descendent
and then with their parents and so on:
P (A,B,C,D,E) = P (E∣D,C,B,A) ∗ P (D∣C,B,A) ∗ P (C ∣B,A) ∗ P (B∣A) ∗ P (A)(2.15)
18 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
Figure 2.3: Bayesian Network example with 5 variables.
Then we use the Markov property adding the parents to the right term and erase
the non-descendant:
P (A,B,C,D,E) = P (E∣D) ∗ P (D∣C,B) ∗ P (C ∣A) ∗ P (B∣A) ∗ P (A) (2.16)
If we suppose that all the variables are binary, now the biggest table than we have
to calculate have a size of 23, before we have a table with size 25. The reduction in
memory cost is significant and we only have 5 binary variables, from 25 = 32 to 3 × 4
(P (E∣D), P (C ∣A), P (B∣A) with four entries in the CPT) plus 8 (P (D∣C,B)) + 2
(P (A)) = 22.
This reduction is much clearer for larger networks. Suppose a network with 50 binary
variables, which can be considered small, with the following structure: 10 variables
have no parents (20 entries), 10 have 1 parent (10 × 4 = 40 entries), 10 have 2 parents
(10 × 8 = 80 entries), 10 have 3 parents (10 × 16 = 160 entries) and 10 have 4 parents
(10×32 = 320 entries). That means in total we need 620 entries/values to store 1, which
implies a huge reduction with respect to 250 ⋍ 1015. Indeed, the latter is not manageable.
This simple example shows how important and necessary a factorisation is, and how
Bayesian Networks succeed in performing this factorisation using the independences
the network structure is able to model.
If we remember, the conditional tables store in a Bayesian Network correspond to
the variable known the parents. If we have a case with different values for the variables
in the example above like P (a, b, c, d, e) we only have to find the correspondent values
in the tables and multiply them:
P (a, b, c, d, e) = P (e∣d) ∗ P (d∣c, b) ∗ P (c∣a) ∗ P (b∣a) ∗ P (a) (2.17)
Remark that usually we will perform queries to Bayesian Networks, so that we
1These computations omit that we can compute some derived values, as we know that some pro-
babilities sump up to 1.0. for example, given P (x) we can get P (x) as 1.0 − P (x)
2.5. DYNAMIC BAYESIAN NETWORKS 19
won’t directly ask for joint probabilities, but we will want to know to compute pos-
terior probabilities given some evidence or observations. Besides, these queries won’t
normally involve all variables (for big structures), thanks to independences. For that
purpose, Bayes’ Theorem will be used, and the computations involved will be optimi-
zed internally using inference techniques, whose description is out of the scope of this
work (see [Korb & Nicholson, 2010] – chapter 3).
2.5. Dynamic Bayesian Networks
The term dynamic refers to the temporal relationships between variables. In this
approach we consider that variables have different states in the time. In cases like
the robot localization or a meteorological problem could be very significant to have
temporal information. For example, if your robot is in a bedroom and it moves half
meter, probably is still in the same place or at most in the corridor.
Bayesian Networks are not able to model temporal relationships between variables.
One possible way for represent the temporal links is adding a copy of the variables that
represents a different time moment, changing their names so that we can identify the
variable and the time instant. We have to define a new domain of interpretation for
this propose.
If our domain have n variables V = V1, V2, ..., Vn, each one represented a node in
the static network. And the current time step is represented by t, the previous steps
are represented by t − 1, ..., t −m where t − (i + 1) is the immediate previous step of
t − i and the posterior steps are represented by t + 1, ..., t + r where t + (i + 1) is the
immediate posterior step of t + i. Each time step is called time-slice.
Once we have the nodes it is the turn for the arcs. Now we have two types of
relationships between nodes. First the relationship between variables in the same time-
slice, this are called intra-slices arcs, X ti → X t
j . Usually the intra-slices arcs are the
same in each time-slice, because the structure doesn’t usually change over time.
The second relationship between variables is called inter-slices or temporal arcs.
This includes the relationships between the same variable over time X ti → X t+1
i and
different variables over time X ti → X t+1
j . In most cases, the value of a variable at one
time affects the value of the same value in the next step.
There are some rules with this temporal arcs, you can’t have a variable of a posterior
time-slice as antecedent of a variable in a previous time-slice. It has no sense that a
previous state is modified by a posterior value. The other rule is that a variable can’t
span more than a single time step. This is because the state of the world at a particular
time depends only on the previous state and any action taken in it.
Then to obtain the Conditional Probability Table for a node X ti we can use the
20 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
Figure 2.4: General structure of a Dynamic Bayesian Network.
same method as Bayesian Networks, but now we have two types of arcs, for this we
have two types of parents, the inter-slices X ti − 1 and Zt−1
1 , ..., Zt−1r and the intra-slices
Y t1 , ..., Y
tm. The CPT is:
P (X ti ∣Y t
1 , ..., Ytm,X
t−1i , Zt−1
1 , ..., Zt−1r )
Once we have the relationships and the CPTs the inference process is the same as
in the Bayesian Networks.
2.6. Classification
This section is dedicated to the statistical classification in machine learning [Mit-
chell, 1997]. Classification is a problem that tries to categorize a new input case when
we have previous knowledge of the problem based in a training set whose category is
known. If we consider that all the categories are values of a variable, the problem tries
to predict the output of this variable when we know or observe the values of other va-
riables. These other variables (input values) are called attributes, features or predictive
variables. The variable to be predicted (output) is known as Class variable, it can be
seen as a labelling task, where the possible labels are the possible values or states that
the Class can take.
Algorithms that implement classification are called classifiers. This task of classi-
fication in machine learning is also known as supervised learning. This is a way to
distinguish it from unsupervised learning, or clustering. This supervised adjective co-
mes from the fact that classification algorithms (classifiers) can learn from previously
classified instances, and the possible values for the Class are previously known, while
in unsupervised learning, the algorithm would have to extract a way to group varia-
bles that is initially unknown. Then, in classification algorithms learn from a training
2.6. CLASSIFICATION 21
dataset where we know the value for the Class. Once a classifier is trained, it will be
able to give us a value/label for the Class when a new instance or case is input, and
this instance has observations only for the predictive attributes, and we ask about the
Class variable.
In order to prove the effectiveness of the classifier we use test data, already labelled
but not used for training the model, for the shake of fairness and evaluate generalization
– we have to avoid overfitting2. These data have different input cases. Test process
consists in obtaining a value for the Class and compare them with the real value of
the case. In this process we obtain information such as the accuracy (or hit rate) and the
confusion matrix. Hit rate represents the percentage of correctly classified instances.
Confusion matrix, or error matrix, gives these information in a more detailed way:
each column of the matrix represents the instances in a predicted class, while each row
represents the instances in an actual class.
2.6.1. Probabilistic Classifier
A probabilistic classifier gives us a probability distribution for the Class variable
when it has a sample input. The basis of the classifier is a conditional distribution
P (C ∣Y ) where the input Y is the right and known part and C is the variable Class
that we want to classify. The target of the classifier is to obtain the value for the
variable C that maximizes that probability. This will be the value predicted by the
classifier as we can see in the equation 2.18.
c = argmaxc(C = c ∣Y ) (2.18)
The Bayesian Networks can be used like a Probabilistic Classifier. The Bayesian
classifiers use the Bayes’ Theorem and the inference process of the networks to obtain
the probabilities of the class and give us the best value as we can see in the equation
2.19.
c = argmaxc (P (Y ∣C = c)P (C = c)
P (Y ) ) (2.19)
As the value of P(Y) is the same for all the cases of c and it is using for normalize,
we can remove this part of the operation leave us the next equation:
c = argmaxc(P (Y ∣C = c)P (C = c)) (2.20)
For example, given a binary class, if P (c∣YYY ) = 0.21 and P (c∣YYY ) = 0.79, YYY will be
assigned to class c.
2Overfitting occurs when a model begins to memorize training data rather than learning to gene-
ralize from trend.
22 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
2.6.2. Naıve Bayes Classifier
Naıve Bayes [Domingos & Pazzani, 1997] is the most simple classifier based in Bayes’
Theorem because it assumes that all variables are independent given the Class. Naıve
Bayes links the Class with all the variables and, therefore, it is parent of all them. The
graph structure of Naıve Bayes is like the one shown in Figure 2.5. As we can see, the
CPTs of the variables are very simple, since it involves a marginal probability for the
class variable (P (C)), and P (Xi∣C) for the rest. Then, the number of entries for these
tables is ∣ΩXi∣ × ∣ΩC ∣.
Figure 2.5: General structure for a Naıve Bayes Classifier.
This classifier is very simple and easy to implement. It reduces training time and
its results are good in some areas, like the spam detection [Sahami et al., 1998]. We use
Naıve Bayes to compare the results of more complex models, this is a good baseline
to test them. If the most simple classifier have better results suggest the networks
obtained with complex methods are useless.
In the Bayes Network section we have explained how the probability is obtained
and that is Naıve Bayes obtains its probabilities. If the variable Class is C and the
rest are X1,X2, . . . ,Xn; then P (X1,X2, . . . ,Xn∣C) is obtained by equation 2.21 and
replacing it in 2.20 we yield equation 2.22.
P (X1,X2, . . . ,Xn∣C) = P (X1∣C = c) ∗ P (X2∣C = c) ∗ . . . ∗ P (Xn∣C = c) (2.21)
c = argmaxc(P (X1∣C = c) ∗ P (X2∣C = c) ∗ . . . ∗ P (Xn∣C = c) ∗ P (C = c)) (2.22)
When learning Naıve Bayes, the structure is already fixed, which makes the process
much faster, since this structural learning is a complex task, where current research is
still being developed. Naıve Bayes only performs parametrical learning, which implies
the estimation for the values/parameters in the CPTs, which are already simple, as
indicated below.
2.7. CAMML 23
2.7. CaMML
As already introduced, Bayesian A.I. is a powerful framework, and Bayesian Net-
works (BNs) allows us make predictions, classification, study the behaviour of variables
and the relationships between them in a simple way. They have been broadly used, since
mid-eighties until nowadays, due to its double capacity: (1) knowledge and uncertainty
representation and (2) well-established algorithms for inference. That is why we have
chosen this approach.
Once this framework is accepted as a reasonably option for intelligent systems, we
have to work on the construction of a particular Bayesian Network able to represent
the problem domain we aim to model. One of the possibilities for BNs construction is
expert elicitation. However, experts do not usually know how to perform this modelling
process or give us useless or adverse information. We may need a complex and long
process of knowledge engineering in order to obtain a good model. Another possibility
is to use an algorithm able to learn the model, which involves the use of Machine
Learning techniques.
2.7.1. Learning Bayesian Networks
In order to learn a network (semi)automatically from data, the first needed element
is the dataset to learn from. The most common format of dataset consists of a list where
each row is a case. If our problem has n variables, each case in the dataset may have n
values that represent a record, concerning the particular value for every variable. This
values can be discrete or continuous, but CaMML [Wallace et al., 2005] algorithm is
only able to work with discrete data. In some cases is possible that a few values are
missing, we represents this with “?” or “*”. Usually the BN learner is capable of dealing
with these data using specific techniques. In our case, CaMML is not able to use any
of this techniques and it does not accept cases with missing values. Notice this is not a
critical problem, since there exist algorithms for imputing missing values [Farhangfar
et al., 2008].
As long as a BN has a Class variable (C), this can be used as a classifier, since
we can compute P (C ∣X) for all states of C (see Equation 2.18), being X = X1, ...Xnthe set of predictive variables. To construction a classifier, in this case a BN, we will
use a training dataset. As indicate before, other datasets can be use to evaluate the
performance of the learned classifier: test and validation data. Finally, the aim of a
classifier is to predict the class value for a new case whose label is unknown so that
the model can automatically classify new instances. So, the use of datasets in machine
learning is for initially learn the model, but this kind of the information will also be
used for future prediction, classification, validation, etc...
24 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
Algorithms for learning BNs are to provide techniques for learning the DAG struc-
ture and also mechanisms for estimating the parameters of the CPTs from data. There
is one key limitation when learning BNs from observational data only – there is usually
no unique BN that represents the joint distribution. More formally, two BNs in the
same statistical equivalence class (SEC) can be parametrized to give an identical joint
probability distribution. There is no way to distinguish between the two using only ob-
servational data (although they may be distinguished given experimental data). That is
why many algorithms based on search techniques use the SEC space, which obviously,
is also smaller and the search will be more efficient [Chickering, 1995].
BN structural learning algorithms can be classified into constraint-based and metric-
based. Constraint-based methods (e.g., PC [Spirtes et al., 2000], RAI [Yehezkel &
Lerner, 2009]) use information about conditional independences gained by performing
statistical significance tests on the data. Metric-based methods (e.g., K2 [Cooper &
Herskovits, 1992], CaMML [Wallace & Korb, 1999]) search for a BN to minimize or
maximize a metric; many different metrics have been used, (e.g. K2 uses the BDe
metric, CaMML uses an MML metric: [Korb & Nicholson, 2010, Ch 9]). Metric-based
BN structural learners also vary in the search method used and in what is returned
from the search; some learners (e.g., K2) return a DAG, others (e.g., GES [Chickering,
2003]) learn only the SEC.
The metric-based methods can incorporate expert knowledge about the relations-
hips between variables by using them as structural priors that alter the “score” given
to a BN. Here we use CaMML, as it provides more types of structural priors than any
other BN learner, metric or constraint based.
2.7.2. CaMML: a tool for learning BNs
CaMML3 attempts to learn the best causal structure to account for the data, using
a minimum message length (MML) metric [Wallace, 2005] with a two-phase search,
simulated annealing followed by Markov Chain Monte Carlo (MCMC) search, over the
model space. Both MML and the better known MDL are inspired by information theory,
and make a trade-off between prior probability (model complexity) and goodness of fit.
With both, the problem becomes one of encoding both the model and the data, and
the best model is then one that minimizes the message length for that encoding.
The differences between MDL and MML are largely ideological: MDL is offered
specifically as a non-Bayesian inference method, which eschews the probabilistic inter-
pretation of its code, whereas MML specifically is a Bayesian technique.
The full details of MML encoding are not required for this project, but we can write
3This software is downloadable from https://github.com/rodneyodonnell/CaMML/ [Last acces-
sed on 5th July 2014].
2.7. CAMML 25
the relationship between the message length, the model and the data given the model
as:
msgLen∝ − log(P (Model)) − log(P (Data∣Model)). (2.23)
The CaMML metric is a combination of the message of the MML encoding of the
BN, incorporating three parts: (1) the network structure, (2) the parameters given this
structure, and (3) the data given the network structure and these parameters.
In contrast to other metric learners that use a uniform prior over DAGs or SECs
for their search, CaMML uses a uniform prior over Totally Ordered Models (TOMs).
A TOM is a DAG specified at a somewhat deeper level; it can be thought of as a DAG
together with a total ordering of its variables. Just as an SEC is a set of DAGs, a DAG
is a set of TOMs. In the figure 2.6 we see two different DAGs within the same SEC:
the chain has only a total ordering < A,B,C > while the common cause structure has
two – < B,A,C > and < B,C,A >.
Figure 2.6: Example of two DAGs within the same SEC: (a) chain; (b) common cause
By applying a uniform prior over the TOM space to represent an uninformed state
of knowledge, we are following the common practice in Bayesian inference, which uses
uniform distributions at the lowest available level of description.
CaMML also differs from other learners in using Metropolis sampling to estimate
a distribution over the model space. CaMML builds a hierarchy of models. It samples
TOM space moving from TOM to TOM with probabilistic pressure applied by the
MML metric. Every time a TOM is visited a visit to that TOM’s DAG and SEC is
recorded. CaMML also records a visit to the DAG’s “clean” representative — that is
the DAG with all spurious arcs removed — and to that clean DAG’s SEC. SECs are
also joined in a process analogous to cleaning.
2.7.3. Learning Dynamical Bayesian Networks
However, causal discovery is not limited to the discovery of static Bayesian Net-
works; Dynamic Bayesian Networks readily represent time series, without the spurious
correlations, and they can be learned as well.
Learning Dynamical Bayesian Networks is also possible, even though, if the struc-
ture gets more complex, the same happens with the learning algorithms and their
26 CHAPTER 2. BAYESIAN ARTIFICIAL INTELLIGENCE
possibilities, which increase also enormously. Nowadays, there is a small and growing
literature on the subject. However, this is an issue still under development, there are
many algorithms for specific cases, but the most known tools do not provide algorithms
to learn DBNs specifically.
Recently, an extension of CaMML to learn DBNs has been included 4. Basically,
this extension will divide the process of learning a DBN in these three steps [Black et
al., 2013]:
1. Learn an order of variables within each time slice, assuming this order is identical
in all time slices.
2. Learn the intraslice arcs for the given variable order. In Figure 2.4 we see those
arcs refers to every time-slice t = i, those connecting X t=ik , that is in the dotted
framebox. DBNs assume they are the same for every slice.
3. Learn the temporal (or interslice) arcs. In Figure 2.4 we see those arcs refers to
arcs between slices t = i (previous) and t = i+1 (next), those connecting variables
of the previous instant to the following one.
If static CaMML used the space of TOMs for searching the graphical structure for
a BN, for DBNs authors have defined DTOMs (Dynamic TOMs), which is essentially
a TOM plus an NXN binary matrix specifying the presence/absence of arcs between
each of the N nodes in the first time slice and the N nodes in the second time slice.
Thus, it is required a structure metric to specify the code length for encoding a DTOM.
The first time slice parameters are learned from the data directly (given the intraslice
network structure learned during the search). The Metropolis search has then been
adapted to the dynamical environment, in learning a DBN, the original CaMML mu-
tation operations remain, however, they are used for modifying the variable order and
the intraslice arcs of a DBN only. As such, additional mutation types were required
for modifying the intraslice arcs: temporal arc change, double temporal arc change and
cross arc change.
4https://github.com/AlexDBlack/CaMML [Last accessed on 28-March-2014]
Chapter 3
Robot Vision and Localization
The subject of this chapter discusses how robots obtain information through per-
ception. In the case of humans, perception is a process where the senses allows us to
receive, process and interpret information about our environment. We are not aware
of the amount and complexity of information processed by the brain when we perform
everyday actions. However, it is not enough to interpret the information, we need to
coordinate it with the actions we are developing at the time.
Everyday tasks involve an enormous complexity in a robotic system. It is necessary
to have a knowledge of the applicable sensory systems to robotics that allow us to know
which is best suited for the development of a particular task.
Sensors provide information about both the work environment and the internal
state of the robot. Propioceptive sensors measure values internal to the robot like motor
speed and wheel load. Sensors that retrieve information from the environment are called
Exteroceptive. These sensors acquire information from the robot’s environment, like
light intensity or distance measurements. Exteroceptive sensors are divided into two
groups:
1. Passive sensors measure ambient environmental energy entering the sensor. Tem-
perature probe, microphones and cameras are example of this type of sensors
(Figure 3.1).
2. Active sensors emit energy into the environment, then measure the environmental
reaction. Examples of active sensors include ultrasonic sensors and laser range-
finders (Figure 3.2).
Vision can be considered as the most powerful sense for humans. It provides us with
an enormous amount of information about our environment. For this reason, there have
been made great efforts to provide machines with sensors that mimic the human vision.
Regarding passive sensors, cameras are the most common since they are cheap devices
that give us a lot of information. Cameras acquire images of the environment in the
27
28 CHAPTER 3. ROBOT VISION AND LOCALIZATION
Figure 3.1: Passive sensors: Digital Temperature and Humidity sensor (left), Sound
Sensor Microphone (center), Canon VC-C4 Camera (right)
Figure 3.2: Active sensors: Ultrasonic sensor (left), Laser range finder (right)
same way like human vision. They interpret our environment by light rays that reach
the sensor and they convert the information into a digital image.
3.1. Image encoding
Digital images are usually stored as a set of pixels, those are the smallest element in
an display device. Each pixel contains information about colour, position in the image
and the encoded information. The resolution of an image is related to the number of
pixels the image have. In Figure 3.3 we can see an example of the most used resolutions
where values have a (width x height) format, in pixels.
The colour of the pixels is usually represented by the RGB (Red, Green and Blue)
colour model (Figure 3.4). This model is based on the idea that different colours can
be represented as a weighted sum of red, green and blue components. This model is
the most commonly used in computer graphics but there are several alternatives:
3.2. LOCAL FEATURES 29
Figure 3.3: Most commonly used resolutions.
1. CMYK: is a subtractive colour model based in Cyan, Magenta, Yellow and Key
(black) components used in printing devices.
2. YUV, YIQ, YCbCr and YPbPr: these luma-chroma models are used to trans-
mit the TV signal. Value Y represents luminance and the other two represent
chrominance.
3. HSV, HSL: In both models the two first values represent the Hue and Satura-
tion. In the first model the V component is the Value and in the second model the
L component is the Lightness. Both models pursue a more intuitive representation
of colour for users.
The information stored in pixels can be used as input for different problems such
as the semantic localization, but these data are difficult to handle due to their size.
Besides, the information given by pixels can be useless for complex tasks where rotation
(for example) matters, as object recognition. It is generally more effective to extract
features from the images and use them as input. Moreover, real-time solutions require
working with small descriptors.
3.2. Local Features
There are a huge variety of algorithms and techniques for image processing due
to the large amount of information that cameras catch. This set of algorithms and
techniques are called computer vision, and it is responsible for extracting features of
digital images like the human brain interprets the information that the eye catches.
Some of the different approaches for computer vision are: working on colour features
[Hsu et al., 2002], detecting edges [Ziou et al., 1998], visual words [Yang et al., 2007]
or using stereo information [Hirschmuller, 2005].
30 CHAPTER 3. ROBOT VISION AND LOCALIZATION
RGB model. CMYK model.
HSV model. YCbCr model with Y = 0.5.
Figure 3.4: Example of different models for colour display.
In this section some computer vision techniques are reviewed with a special focus
on local invariant features. Those features allow an application to find local image
structures in a repeatable fashion, and also to encode them in a representation that is
invariant to a range of image transformations, such as translation, rotation, scaling, and
affine deformation. The resulting features then form the basis of current approaches
for recognizing specific objects.
Local features are usually extracted by following pipeline:
1. Find a set of distinctive keypoints.
2. Define a region around each keypoint in a scale- or affine-invariant manner.
3. Extract and normalize the region content.
4. Compute a descriptor from the normalized region.
5. Match the local descriptors.
SIFT Descriptor. The Scale Invariant Feature Transform (SIFT) was originally
introduced by Lowe as combination of a Difference-of-Gaussian (DoG) interest region
3.3. GLOBAL DESCRIPTORS 31
detector [Lowe, 1999] and a corresponding feature descriptor. This descriptor aims to
achieve robustness to lighting variations and small positional shifts by encoding the
image information in a localized set of gradient orientation histograms.
Figure 3.5: Visualization of the SIFT descriptor computation. For each (orientation-
normalized) scale invariant region, image gradients are sampled in a regular grid and
are then entered into a larger grid of local gradient orientation histograms.
SURF Detector/Descriptor. Speeded-Up Robust Features (SURF) approach,
which has been designed as an efficient alternative to SIFT. SURF combines a Hessian-
Laplace region detector [Bay et al., 2006] with an own gradient orientation based feature
descriptor.
3.3. Global Descriptors
However the number of local invariant features depends on the image used as input
and therefore big variations can arise when using different images. So it is complicated
to compare the features when their numbers are different and we use Global descriptors
that allow us contrast this information easily. In this section we will explain some
properties of these descriptors.
The basic concept of a global descriptor is to classify the different features into
categories. In an example of semantic localization an image can be categorized with a
label Kitchen if we see features like a fridge or a microwave. This reasoning is similar
to that performed by humans: we usually link the elements that we find within a room
with its purpose.
Some features encode information capable of being represented in the frequency
space using histograms. On the other hand in some cases there are not categories for
32 CHAPTER 3. ROBOT VISION AND LOCALIZATION
the features. However there are techniques such as Bag-of-words based in clustering
that assigns labels or words to the different clusters (Figure 3.6).
Figure 3.6: Process for visual word generation from a set of images.
3.3.1. Histograms
A histogram is a graphical representation of the distribution of data. It is a repre-
sentation of tabulated frequencies, shown as adjacent rectangles, erected over discrete
intervals (bins), with an area proportional to the frequency of the observations in the
interval. If we have a set of 1000 images and they are labelled with the distribution of
Table 3.1, we can represent the information with the histograms of Figure 3.7.
label no of evidences
kitchen 200
bedroom 200
bathroom 150
living − room 300
corridor 150
Table 3.1: Example of distribution labels in a set of images.
3.4. PHOG 33
Figure 3.7: Histograms of frecuency: number of evidences (left) and percentage (right)
3.3.2. Pyramid
In some cases the images show large spaces where the features come from different
objects. They can be mixed in the bag-of-words or a histogram. That is, we lose infor-
mation in this process. A spatial pyramid [Lazebnik et al., 2006] tries to prevent this
problem. It is a collection of orderless feature histograms computed over cells defined
by a multi-level recursive image decomposition. At level 0, the decomposition consists
of just a single cell, and the representation is equivalent to a standard bag-of-words. At
level 1, the image is subdivided into four quadrants, yielding four feature histograms,
and so on. In Figure 3.8 we see an example of this process.
Figure 3.8: Example of a spatial pyramid with depth level 2
3.4. PHOG
The Pyramid Histogram of Oriented Gradients (PHOG) obtains the angle of highest
invariance of an initial set of key-points in the image. This descriptor firstly encodes the
images in a Grey Scale format. Then some key-points are selected using edge detection
34 CHAPTER 3. ROBOT VISION AND LOCALIZATION
[Canny, 1986].
PHOG performs the spatial pyramid process, allows us select the level of depth.
Each level divides the image into 4 squares with the same size. This means that if we
select n level of depth we have 4n squares. As we can see in Figure 3.9, level 0 has the
complete image, in level 1 we divide the image by 4 and in level 2 the 4 squares from
level 1 are divided by 4, giving us 16 squares.
Figure 3.9: Example of level depth in PHOG.
Then the angles of highest invariance are obtained and the descriptor categorized
with values from 0 to 360. PHOG gives us a 360-sized histogram for each square, where
each entry represents the frequency of that angle in the square. The frequency of data
is shown as percentage. PHOG concatenates each histogram from the squares of the
level in a histogram, where level n has a histogram of size 4n ∗ 360. As we can see in
the bottom of the figure 3.9, level 0 has a histogram of 360 values, the histogram for
level 1 has 4 ∗ 360 values and so on.
When you select a depth level in PHOG, the algorithm also performs the descriptor
for the previous levels. Once it has obtained all the histogram from the different levels,
concatenates again the histograms and return their values. The size of the returned
histogram is:
size of histogram =n
∑k=0
(4k ∗ 360) (3.1)
3.5. LOCALIZATION 35
3.5. Localization
The generic problem of localization consists in answering the question Where am I?
[Leonard & Durrant-Whyte, 1991]. That is, a robot has to estimate its location within
the environment given an specific environment representation. The localization problem
can be tackled from two points of view: metric localization and semantic localization.
The first type is related to the estimation of < x, y, θ > locations and assumes the
use of a map as environment representation. On the other hand, semantic localization
describes the environment using semantic labels instead of coordinates. There are se-
veral types of label that can be used to define de environment, but scene categories
are one of the most common. Using this approach, semantic terms as “corridor” or
“kitchen” are directly associated to environment localizations.
Chapter 4
Experimentation
Our solution for the semantic localization problem consists of a visual place classifi-
cation in an indoor environment. So, we are interested in extracting semantic informa-
tion (room categories) instead of localizations (environment coordinates). To perform
this task we will use the PHOG descriptor, or rather, the HOG descriptor since we will
only use level 0 of the PHOG. In order to do this classification task, we will use two
different models of Bayesian Networks: Naıve Bayes classifiers and CaMML-learned
models. A process of variable reduction and discretization will be made, and we will
analyse the performance of the different networks we have learned together with the
problem parametrizations in the next chapter.
The localization problem, as we said before, has a strong temporal component.
So, we have also used and developed some techniques for obtaining different Dynamic
Bayesian Networks that we will compare later with the results given by the static
approach.
This chapter presents the dataset that we have used: the KTH-IDOL2 [Luo et al.,
2006a], and the tools used during the development and the process to obtain the results
(Figure 4.1). This will also describe the work methodology for the experimentation,
which is divided into three main steps:
1. Descriptor Generation: first we extract the features from the images and we apply
the variable reduction and discretization.
2. Learning : in this step we create the different networks, both static and dynamic,
from the information/dataset extracted at step 1.
3. Network Evaluation: finally we obtain the accuracy and other information (matrix
confusion, etc.) from all the networks created in the previous step.
37
38 CHAPTER 4. EXPERIMENTATION
Figure 4.1: Schema for the entire process, separated by steps.
4.1. Dataset: Image CLEF 2009
In this project, we are facing indoor environments to reduce the values of the class.
Therefore, our semantic localization problem can be seen as an indoor scene classifica-
tion problem. We select the RobotVision@ImageCLEF 2009 challenge, which provides
us with the KTH-IDOL2 dataset [Luo et al., 2006b]. This dataset consists of 24 se-
quences of images acquired using two mobile robots platforms (Figure 4.2). We call
them sequences because they present a chronological order, as these images were taken
by the robot while it moved around the indoor environment.
Figure 4.2: Robot platforms: Dumbo (left) and Minnie (right).
The sequences were taken under three illumination conditions across a span of 6
months. This illumination conditions are cloudy, sunny and night (the lights of the
building were on). The building used in the acquisition consists of five rooms.
These rooms are one-person office (BO), two-person office (EO), corridor (CR),
kitchen (KT) and print area (PA). Some sample images for each one of the five semantic
categories are shown in Figure 4.3. We can see the map of the environment in the figure
4.4.
4.1. DATASET: IMAGE CLEF 2009 39
Figure 4.3: Semantic labels used in the IDOL2 dataset.
Figure 4.4: Map of the IDOL2 environment.
The label of each image has the topological and semantic localization. The label
consists of the time when the image was taken, the x and y location coordinates, the
orientation theta and an abbreviation of the room the robot was. An example of the
40 CHAPTER 4. EXPERIMENTATION
label is: t1153756903.467512 rCR x6.39416 y9.05009 a-1.6, which encodes that time-
stamp → 1153756903.467512s, robot’s pose → (x-coordinate = 6.39416m, y-coordinate =
9.05009m), θ angle = -1.6 radians, room → corridor.
4.2. Tools
In this section we introduce the different tools used throughout this project. For
each software tool an introduction will be done, where we present their main purpose of
each one and why they have been selected. We also specify the file formats for outputs
and inputs, because this is the base for integrating all of them in our methodological
process.
MatLab
MatLab [MATLAB, 2010] is a known tool very useful when we manipulate big
datasets, especially with matrix representation. This program allows us divide the fun-
ctionally of an algorithm into different scripts and use them easily. MatLab is also used
for extracting information from images. Moreover, it is capable of store this information
in different format files.
MatLab is proprietary software, and requires a payment for its license of use. No-
netheless, there is a free software called Octave [Eaton et al., 2009] that implements
almost the same functions that MatLab and use the same language.
Weka
Weka is a collection of machine learning algorithms for data mining tasks [Hall et
al., 2009]. These tools implement different types of variable discretization and the most
important classifiers. Weka is capable of using different files to test their models and
it gives us different results like accuracy, confusion matrix, etc. The official format for
Weka is arff file, but it can read other standard formats like csv.
GeNIe
GeNIe [DSL, 1996-2006] is a program with a graphical environment for creating
probabilistic and decision models. It includes some machine learning algorithms like
the one for learning a Naıve Bayes classifier. Once the models are created it can be
modified and exported to different formats like dne. GeNIe accepts as input csv files.
Bi-CaMML
Bi-CaMML has been developed by Monash University. It is a GUI-version program-
med in Java, based on original CaMML but with extended features, and it is downloada-
ble from http://bayesian-intelligence.com/software/BI-CaMML-1.2.zip. It im-
plements machine learning Causal discovery via MML (CaMML), that learns causal
BNs and DBNs from data. Bi-CaMML can read arff files, but real values must be
discretized and it does not accept missing values. Bi-CaMML output format is dne,
4.3. FIRST STEP: DESCRIPTOR GENERATION 41
that includes the network learned with CaMML.
Python
Python [van Rossum, 2007] is a high-level programming language. It has become
one of the most important and used programming languages because of its simplicity
and versatility. In this project, it has been used for its ability in handling strings. This
allows us modify the different files to add behaviours that other programs can not.
Netica
Netica [Norsys, 2000] is a complete program for working with belief networks and
influence diagrams. It has an intuitive and smooth user interface for drawing the net-
works. The relationships between variables may be entered as individual probabilities,
in the form of equations, or learned from data files (which may be in ordinary tab-
delimited form and it accepts “missing data”).
The official format files of Netica networks is dne, but Netica is capable of testing
networks from data files in different formats like csv.
Netica is payware, but there is a free version that is full-featured yet limited in
model size. The reason why Netica was chosen is that CaMML needs it for working.
Our collaboration with Monash University allowed us to use their software license.
There are other available tools which provide similar or extended capabilities to
learn, edit, infer, etc. Bayesian Networks and other probabilistic graphical models
such as SamIam (http://reasoning.cs.ucla.edu/samiam/), Hugin Expert (http://
www.hugin.com/), JavaBayes (www.cs.cmu.edu/~javabayes/Home/) or Elvira (http:
//www.ia.uned.es/proyectos/elvira/index-en.html, [Consortium, 2002]). An ex-
haustive and interesting survey about BN software packages can be found at [Korb &
Nicholson, 2010, Annex B].
4.3. First step: Descriptor Generation
In the first step of the experimentation process we have to extract the different
features from images and store them in several files. The images are processed with
HOG (PHOG with depth 0) with different variable selection and then we obtain the
histograms. Depending on the number of variables we will perform a particular dis-
cretization. Finally, with the retrieved information we create two different format files:
csv and arff .
4.3.1. Variable reduction and Discretization
For the study of the different behaviour of the variables we need to reduce the large
number of initial variables that the HOG descriptors give to us. It is also necessary to
42 CHAPTER 4. EXPERIMENTATION
Figure 4.5: First step schema: data extraction.
perform a preliminary discretization step in order to use the HOG as input data in a
CaMML learning procedure.
The HOG descriptor gives us a histogram with 360 variables, where each one repre-
sents a degree and its value the frequency which this angle has the highest invariance.
We see in Figure 4.6 how the HOG descriptor extracts features from the image and
then obtain the corresponding histogram.
Figure 4.6: Image taken by the robot (left). HOG descriptor (center). Histogram with
360 variables from HOG (right).
The easiest way to reduce the original 360 variables is to merge them into n groups.
Each group will have the same number of variables and its value will be the sum of all
the frequencies of the variables that it contains. For example, if we want 4 variables in
the HOG descriptor, we have to divide 360 by 4. Now we have 4 variables, so the first
of them groups the first 90 variables of the original histogram and its amount is the
sum of all these 90 values and so on. We can see an example of the variable reduction
for HOG in Figure 4.7. HOG already does this task, it allows us select the value of n
and it returns the corresponding histogram.
Although the number of variables is lower, these variables are still continuous, so
we have to discretize them. There are many ways to discretize a variable, one of them
is to use the discretization function (filters) of a tool like Weka. But these functions
are focused on optimizing the results of the algorithms that the tool use, so it is better
to develop our own discretization functions.
In this project a discretization function has been implemented. It consists in dividing
the variables to four values. As we are working with frequencies, the sum of all the
4.3. FIRST STEP: DESCRIPTOR GENERATION 43
Figure 4.7: HOG histogram with 360 variables (left). HOG histogram with variable
reduction n = 30 (right).
continuous values will be one, so the mean frequency value will be 1/n where n is
the number of variables. With the mean frequency we create four labels and each one
represents a range of frequency:
Low value: [0,1/2n)
Medium/Low value: [1/2n,1/n)
Medium/High value: [1/n,2/n)
High value: [2/n,1]
Then, we just have to replace the continuous values by the labels depending on
the ranges. This function is independent of the values, only depends on the number of
variables.
There are 5 different n values that are evaluated in this project: 5, 10, 20, 50 and 100.
The n selection will affect the generalization/specialization power for HOG descriptors.
The use of large n values involves the generation of specific descriptor that can cause
over-fitting when training the classifier. On the other hand, small values for n generate
too general models incapable to differentiate between classes.
Figure 4.8 shows a sample discretization for an input image and four n values. It can
be observed how n = 100 obtains a fine-grain representation of the original histogram,
while n = 5 simplifies it excessively. Regarding the class labels, red colour is used to
represent bins (angle ranges) with low frequency. Orange and yellow colour represent
medium/low and medium/high frequency, respectively. Finally, blue colour denotes
high frequency. The four processed histograms are shown using the same scale. That
allows to point out how similar values are associated to different labels. Concretely,
it can be seen that last bin presents similar values for n=20 (0.16) and n=10 (0.18).
However, we obtain label high for n=20 and medium/high for n=10. That occurs
because high label is associated to values greater than (2/n), that is, 0.1 and 0.2 with
20 and 10 values respectively.
44 CHAPTER 4. EXPERIMENTATION
Figure 4.8: Original Histogram (top right) extracted from sample image (top right).
Bottom: four histograms obtained with differents n values. The colours in the bot-
tom histograms represents the generated class labels or categories: low=red, me-
dium/low=orange, medium/high=yellow and high=blue.
4.3.2. Data Format
Once we have the discrete variables, we have to create the files with a format than
other programs can manipulate. In our case these formats are .arff and .csv. The two
formats are very similar, in both each line represents a case (in our example an image)
with all the information extracted as shown in Figure 4.9.
Figure 4.9: Example of arff file (left) and csv (right)
Due to the fact that DBNs are built with the same variables in different time steps,
each line of test data must contain the information of all variables in these different
time moments. For example, if we have 5 variables and the DBN have 3 time steps or
slices, each line of the test data must have 15 values.
As the DBNs we will learning have 2 time steps (slices) and the images are ordered
in time, each line of test data includes the information of the previous and current
4.4. SECOND STEP: LEARNING 45
case. So when we create a test or validation file, we have to make another with the
information of the previous and current case in each line.
In the test files for DBNs the class will be the class related with the current case.
So we have to erase the information about the class in the previous step in this file.
We do this because in a real case we do not know the class of the previous case, only
its prediction. So, another change has been done to these files, in the first case there is
not previous data, so we mark them as unknown. When the network tests this case it
assigns the prior probabilities. We see in Figure 4.10 how the variables are represented,
those which end in 0 correspond to the previous case and those which end in 1 belong
to the current data. We also observe that the class 0 is unknown, like the previous data
of the first case.
Figure 4.10: Example of a csv test file for DBNs
4.4. Second step: Learning
In this section we learn the networks using as input the image descriptors generated
in the previous step. This process is divided into two parts. The first part consists in
generating the networks with the different tools we have. These tools are GeNie and
Weka for the Naıve Bayes classifier and Bi-CaMML for the networks learned with
CaMML algorithm.
Weka generates the Naıve Bayes classifier more quickly than GeNie. The problem
is that these networks are not able to be exported. For this reason we created a dne
file that implements the Naıve Bayes classifier with GeNie.
CaMML creates different networks and collapses which only have small effect diffe-
rences, that is the models are identical apart from links which are only weakly suppor-
ted by the data. Using Monte Carlo sampling over this set of models, CaMML samples
posterior probabilities. Directing the search by the MML value of each DAG, the best
solution is the model which is most visited during the search. We always select this
best solution, since its values are well above the other.
46 CHAPTER 4. EXPERIMENTATION
Figure 4.11: Second step schema: learning.
4.4.1. DBN constructor
As we saw, CaMML is capable of creating/learning DBNs directly from data. Ho-
wever, the models it learned didn’t perform well, as we will explain in Section 5.3.
That is why we designed our own method to create Dynamic BNs and Dynamic Naive
Bayes from data, as we will describe next. So, this second step consists in learning
static BNs and constructing DBNs. We will modify BNs to convert them into DBNs.
Once we have the dne file, we can modify it with a text editor like notepad or with a
more powerful tool like Netica. But this work is too hard, especially in problems with
a high number of variables. One way to solve this is automatizing the process, for this
task we create a DBN constructor in Python.
The dynamical networks that our DBN constructor makes have two time slices. Each
slice is a static network with the same structure but variables at state 0 (previous) and
state 1 (current). They are connected by an edge from C0 to C1. For the case of Naıve
Bayes, known C0 (class at moment 0)1, C1 is independent of the predictive attributes at
time 0, if C0 is unknown this information flows, i.e., if C0 is not observed, its prediction
comes from the values of the predictive variables at moment 0, and C1 will depend on
that value and the other variables at instant 1. That is, we can use the information
predicted of C0 to modify the probabilities of C1. In our problem, the relationship
between classes will produce an increase of the probability to stay in the same room
that was predicted in the previous state.
The Naıve case is the most simple, but we need a generic method to construct
Dynamic Bayesian Networks in general. In order to construct a DBN with the above
property (C0 → C1) we divide the process into three parts:
1This also applies if there are no links from time slice 0 to 1, for the other variables (Xi).
4.4. SECOND STEP: LEARNING 47
1. We first learn a static BN (or a Naıve Bayes classifier). With this static network,
we create two networks with the same structure and disconnected between them
in a new single dne file (Figure 4.12). We do this task with Python.
Figure 4.12: Example of two static networks with the same structure disconnected
between them
2. Then we have to link C0 and C1 with Netica. This process modified the CPT of
C1 adding the values of C0, but this does not affect the C1 probabilities. We see
in Table 4.1 an example of this process in C1 CPT where ΩC0 and ΩC1 = a, b, cand C1 have a parent X1 with ΩX1 = x,x. In Figure 4.13 we see the same
network in Figure 4.12, but with the relation between classes. We can observe
how this link does not affect to the probabilities of C1, since there is numerical
independence between C0 and C1 in the new table.
a b c
x 0.33 0.33 0.33
x 0.2 0.4 0.4
a b c
a 0.33 0.33 0.33
x b 0.33 0.33 0.33
c 0.33 0.33 0.33
a 0.2 0.4 0.4
x b 0.2 0.4 0.4
c 0.2 0.4 0.4
Table 4.1: CPT for C1 before applying the link with C0 (left) and after (right).
3. Finally we have to modify the probabilities of C1 CPT, including the estimated
relation between classes, with a class transition table being this combination
based on [Vomlel, 2006]. Following the previous example, in Table 4.2 we see a
class transition table, where the probability of predicting the same class is higher
than the others, like in our problem, but not as high as in CaMML DBN learning
(around 99 %, that’s basically why it was discarded). Then, we combine Table 4.1
48 CHAPTER 4. EXPERIMENTATION
Figure 4.13: Example of a Dynamic Bayesian Network with related classes but inde-
pendent.
with the class transition table and obtain the probabilities of Table 4.3. These
probability values must be normalized to produce a valid CPT. We observe in
Figure 4.14 that now the predicted class in the previous step affects to the current
class, and its prediction changes depending on the value of C0. We used Python
for this step.
a b c
a 0.50 0.25 0.25
b 0.25 0.50 0.25
c 0.25 0.25 0.50
Table 4.2: Class transition table.
a b c
a 0.165 0.0825 0.0825
x b 0.0825 0.165 0.0825
c 0.0825 0.0825 0.165
a 0.1 0.1 0.1
x b 0.05 0.2 0.1
c 0.05 0.1 0.2
a b c
a 0.50 0.25 0.25
x b 0.25 0.50 0.25
c 0.25 0.25 0.50
a 0.33 0.33 0.33
x b 0.14 0.57 0.29
c 0.14 0.29 0.57
Table 4.3: CPT for C1 combined with class transition table (left) and normalized
(right).
Notice, that we can create a Dynamic Naive Bayes (DNB) classifier with the struc-
ture of Figure 4.15 using this process. Moreover, in networks more complex, where C1
have parents, the DBN constructor also works and we can create dynamic networks
with this properties from different machine learning algorithms. This includes static
BNs learned using CaMML.
4.5. THIRD STEP: NETWORK EVALUATION 49
Figure 4.14: Example of final Dynamic Bayesian Network obtained with our DBN
constructor.
Figure 4.15: Schema for Dynamic Naıve Bayes (DNB) classifier
4.5. Third step: Network Evaluation
In this step we obtain the information about how good our networks are. By testing
we obtain different information such as accuracy and confusion matrix. However, in the
results we will only see the accuracy. But if we want to see other information we only
have to follow the steps before to obtain the network and test it with the following
method or any other.
Weka tests automatically the networks that it learns. It is useful to test the Naıve
Bayes quickly, but Weka can not import or export dne files. As we have seen before,
the problem of not being able to export the files is that we can not modify the networks
with the DBN constructor. And now the problem of not being able to import the files
is that we can not test the networks with Weka.
Although we will test with Weka all Naıve Bayes classifiers and we obtain the
accuracies using it. In order to do this, we need the training data to be loaded. Then,
we select the Naıve Bayes classifier and select the data for the testing.
However, we are creating dne files for one reason. Netica uses these dne files to work.
Once Netica has loaded the network, we test them with the different data. In order to
50 CHAPTER 4. EXPERIMENTATION
Figure 4.16: Third step schema: network evaluation.
do this, we have to compile the networks (exact inference), but this process is quite
inefficient and it requires too much memory in problems with a high number of links.
If this step is not possible for the size of the tables (in compilation time), there is other
solution which consists on an updating by sampling. This process is more optimized
than normal compilation and it gives us almost the same results. When the network
is compiled, we only have to load test data and validate data to obtain the results.
Chapter 5
Results
This chapter is exclusive for the results obtained along this project. As we should
know the networks we create are used like probabilistic classifiers. The purpose is
predicting the room where the robot is. This is our Class in the classifiers. The test
process consists in comparing the output of the classifier with the labels in the images
and calculate the accuracy.
5.1. Experimental Setup
The first decision is to choose which dataset from the two different robot will be
use. We don’t use the two robots because we want to study the different objectives
from the same point of view. In this case the dataset we will use is the one from the
Dumbo. This can be interesting for future works, since we have a similar robot available
in SIMD1 laboratory.
As we said before the database is divided into three different illumination conditions:
sunny, cloudy and windy (Figure 5.1).
Figure 5.1: Sequence distribution.
The networks are usually learned with 60 % of all data, leaving 40 % for test or
1http://gruposimd.uclm.es/
51
52 CHAPTER 5. RESULTS
Figure 5.2: Example of BNs with 10 variables and the Class: CaMML (left) and Naıve
Bayes (right)
20 % for validation and 20 % for test. Taken advantage from the fact that we have 4
time series for each illumination condition, we will use the two first for the training,
the third for validation and the last for test.
5.2. Variable reduction
The first aim is studying the behaviour of the variable reduction in our problem. So
the first step is obtaining networks with different number variables. n represents the
number of variables and in the test we only consider that can take 5 different values: 5,
10, 20, 50, and 100. In Figure 5.2 there is an example of BNs learned with 10 variables.
As we can see there is a big difference of complexity in networks and in Table 5.1 we
observe that Naıve Bayes is learned immediately, because the structure is known, but
CaMML learning needs more time (structural learning and then parametric learning)
and it grows with an increasing number of variables.
Variables CaMML Naive
5 1.11s 0.02s
10 1.8s 0.02s
20 9.6s 0s
50 323.99s 0.01s
100 4555.7s 0.01s
Table 5.1: Learning time for BNs (s) for the Cloudy case with different number of
variables.
5.2. VARIABLE REDUCTION 53
Figure 5.3: Graph of learning time for BNs (s) for the Cloudy case with different number
of variables.
5.2.1. Similar lighting condition
In this first test battery we trained both a Naıve Bayes classifier and a Bayesian
Network (CaMML learning procedure) using as input the two training sequences for
each illumination condition. As we have 5 values for n and there are 3 illumination
conditions we have 15 networks for each learning model. Then we evaluate the obtai-
ned models using the test sequences (sunny4, cloudy4 and night4) with the training
networks learned under the same illumination condition and we obtain the accuracy
that show Tables 5.2, 5.3 and 5.4.
Training with Cloudy
Model Test n = 5 n = 10 n = 20 n = 50 n = 100
CaMML Cloudy 56.79 62.76 56.60 61.39 56.30
Naıve Bayes Cloudy 54.45 59.43 56.70 62.07 61.58
Table 5.2: Accuracy ( %) of CaMML model and Naıve Bayes classifier training and test
with Cloudy.
Training with Night
Model Test n = 5 n = 10 n = 20 n = 50 n = 100
CaMML Night 51.68 55.66 62.16 61.11 57.55
Naıve Bayes Night 49.48 52.20 62.16 63.84 62.16
Table 5.3: Accuracy ( %) of CaMML model and Naıve Bayes classifier training and test
with Night.
In these first results we can conclude that Naıve Bayes classifier is a bit better than
CaMML in these cases. The worst results belong to n = 5, the number of variables
54 CHAPTER 5. RESULTS
Training with Sunny
Model Test n = 5 n = 10 n = 20 n = 50 n = 100
CaMML Sunny 53.05 58.16 63.56 62.46 58.76
Naıve Bayes Sunny 53.15 58.26 64.11 62.96 63.16
Table 5.4: Accuracy ( %) of CaMML model and Naıve Bayes classifier training and test
with Sunny.
Figure 5.4: Rates of test under the same illumination conditions as the training set:
CaMML (left), Naıve (right).
is too low for classify correctly. On the other hand, when n = 100 CaMML accuracy
is decreased, the model has lost generalization and is overfit due to the number of
variables. We can see better this information in Figure 5.4
If we see carefully the graphs we observe that Sunny is the model that better works.
As we said before, CaMML loses generalization when the number of variables is high.
However we see that Naıve Bayes remains its accuracy in similar values when the
number of variables is high.
5.2.2. Different lighting condition
We have learned how our networks work under the same illumination condition. But
what happens when we prove them with different lighting conditions? In Tables 5.5,
5.6 and 5.7 we will see the accuracy when we test each illumination condition against
the other two.
As we expected, the results are a bit worse than those for the previous test, since
some extracted features are related with the lighting condition. Naıve Bayes classifier
is still better than CaMML model. Now we compare the different results in Figures 5.5,
5.6 and 5.7. As we see Naıve Bayes classifiers keep the same behaviour in the different
illumination conditions. However CaMML models have diverse results: Cloudy have
good results for Sunny and Night; Night can not classify correctly the other conditions;
and Sunny obtain disastrous results with Night test.
5.2. VARIABLE REDUCTION 55
Training with Cloudy
Model Test n = 5 n = 10 n = 20 n = 50 n = 100
CaMML Night 53.04 56.39 53.35 60.06 51.78
Sunny 52.85 60.86 59.86 59.96 53.15
Naıve Bayes Night 51.78 54.30 58.81 62.68 60.69
Sunny 51.25 59.46 61.36 64.66 63.96
Table 5.5: Accuracy of CaMML model and Naıve Bayes classifier training with Cloudy
and test with Night and Sunny.
Training with Night
Model Test n = 5 n = 10 n = 20 n = 50 n = 100
CaMML Cloudy 51.91 61.09 52.39 56.89 56.60
Sunny 50.65 57.96 53.85 52.35 58.26
Naıve Bayes Cloudy 51.61 55.23 54.15 59.53 59.82
Sunny 50.75 55.76 59.36 64.46 63.66
Table 5.6: Accuracy of CaMML model and Naıve Bayes classifier training with Night
and test with Cloudy and Sunny.
Training with Sunny
Model Test n = 5 n = 10 n = 20 n = 50 n = 100
CaMML Cloudy 51.22 58.65 59.82 62.27 58.46
Night 51.05 47.80 49.16 54.93 45.81
Naıve Bayes Cloudy 51.91 58.94 58.55 61.00 62.27
Night 51.05 54.51 56.60 61.32 61.01
Table 5.7: Accuracy of CaMML model and Naıve Bayes classifier training with Sunny
and test with Cloudy and Night.
Figure 5.5: Rates of test Night and Sunny under Cloudy illumination condition:
CaMML (left), Naıve (right).
56 CHAPTER 5. RESULTS
Figure 5.6: Rates of test Cloudy and Sunny under Night illumination condition:
CaMML (left), Naıve (right).
Figure 5.7: Rates of test Cloudy and Night under Sunny illumination condition:
CaMML (left), Naıve (right).
5.2.3. Multiple sequence integration
In this last test we create a new sequence “All” that integrates all the images from
the different illumination conditions as we have seen in Figure 5.8. We expect that if
we increase the number of images and mix them, the models will have more general
features and they will be capable of increasing their accuracy. We see the results in the
Table 5.8.
Training with All
Model Test n = 5 n = 10 n = 20 n = 50 n = 100
CaMML All 56.75 60.18 66.13 67.61 67.88
Naıve Bayes All 52.69 57.46 59.58 63.24 62.63
Table 5.8: Accuracy of CaMML model and Naıve Bayes classifier training and test with
“All” .
CaMML models improves their results. Meanwhile Naıve Bayes obtains similar
results to those done in the other tests. CaMML model clearly surpasses the accuracy
of Naıve Bayes classifiers as we can see in Figure 5.9.
5.2. VARIABLE REDUCTION 57
Figure 5.8: Sequence distribution for “All” sequence.
Figure 5.9: Rate comparative for CaMML and Naıve Bayes training and test with “All”
sequence
5.2.4. Concluding observations
In this first section we have already studied the behaviour of the different number
of variables in our semantic localization problem with different lighting conditions. In
these results the best values for n are 20 and 50. The networks created with CaMML
have the best values in these cases, and it presents over-fitting when n = 100. On the
other hand, Naıve Bayes classifiers have the best accuracy with 50 variables, and keep
the results with 100.
The networks obtained with CaMML in the illumination cases are not good, the
accuracy values obtained do not surpass the Naıve Bayes results and they work poorly
with the tests of other illumination conditions. However the networks of All in CaMML
generalize better than any other. These networks have extracted the properties of the
rooms correctly regardless of illumination. That is, they perform better when using
58 CHAPTER 5. RESULTS
datasets including different lighting conditions. This increase did not occur for the Naıve
Bayes model, because Naıve Bayes classifiers obtain too specific models (overfitting),
in contrast to the capacity of generalization proved by Bayesian Networks.
5.3. Dynamic Models
The next aim is the study of the dynamic models in our problem. We see the basic
scheme of Dynamic Naıve Bayes models in Figure 4.15. However, we do not know the
relationship between classes. In our problem the images taken by the robot are ordered
in time and we know that if the robot takes a photo in a room the most likely possibility
for the next moment is that it will remain in the same place. So relation between classes
is that if we predict where the robot is, predicting the same in the next case is the
most probable situation. These probabilities can have more or less weight, depending
on the model we want to use.
Two models with distinct distribution of probability will be used in these tests. In
the first case we assumed a medium probability (40 %) of remaining in the same room
for two consecutive frames, while the second one used a more conservative approach and
gives to that case 80 %. Notice that the remaining probability is uniformly distributed
among the other states or rooms. The class transitions for both approaches are shown
in Table 5.9.
CR 1PO 2PO KC PA CR 1PO 2PO KC PA
Model A Model B
CR 0.40 0.15 0.15 0.15 0.15 0.80 0.05 0.05 0.05 0.05
1PO 0.15 0.40 0.15 0.15 0.15 0.05 0.80 0.05 0.05 0.05
2PO 0.15 0.15 0.40 0.15 0.15 0.05 0.05 0.80 0.05 0.05
KC 0.15 0.15 0.15 0.40 0.15 0.05 0.05 0.05 0.80 0.05
PA 0.15 0.15 0.15 0.15 0.40 0.05 0.05 0.05 0.05 0.80
Table 5.9: Class transitions used for the dynamic models.
As we know, the CaMML learning has the option to generate a dynamic Bayesian
Network with two time steps. However the networks it make are useless, because the
relationship between the classes is too strong and the other attributes are disconnected
from them. We can see an example of a dynamic network learned with CaMML in the
figure 5.10 where the classes are separated from the rest.
We then had no choice but to reject the dynamic networks created with CaMML.
One solution for this problem is to apply the DBN constructor as we did for Naıve
Bayes, and whose overall method was explained in Section 4.4. The advantage of this
5.3. DYNAMIC MODELS 59
Figure 5.10: Dynamic Network with 5 variables created with CaMML.
is that we can compare the dynamic networks obtained with CaMML and Naıve Bayes
when the links have the comparable weights, since the transition models are the same.
5.3.1. Class transition probabilities comparative
The dynamic networks used are the networks learned in the previous section, but
they are modified with the DBN constructor (Figure 5.11). Now we have four different
models: two dynamic CaMML models (CaMML - ModelA and CaMML - ModelB), one
for each class transition, and same for Naıve Bayes classifier (Naıve - ModelA and Naıve
- ModelB). In this set of evaluation we obtain the accuracy of the learned networks
with “All” training (Figure 5.8), modified with DBN constructor and we evaluate them
with cases of “All” test. We use the same values for the number of variables n that in
the previous section.
Tables 5.10 show the accuracy values. In the results we appreciate that the values are
a bit better than those with static BNs, this is logical since the problem has a dynamic
component. Now Naıve Bayes is outperformed by the Dynamic BN. This happens,
because Naıve Bayes classifiers obtain too specific models. Bayesian Networks showed
as more appropriate than Naıve Bayes for adding them the dynamic behaviour.
We see in Figure 5.12 how Model B (conservative approach) has better accuracy
than Model A. If we give more weight to continuing in the same room, we will obtain
more information from the previous prediction. However, if this weight is too high, the
prediction of the actual class will be the same as in the previous one, so there is no
sense in using it for transitions.
60 CHAPTER 5. RESULTS
Figure 5.11: Example of a DBN created with DBN constructor on a CaMML model
with 5 variables
Training with All
Model Test n = 5 n = 10 n = 20 n = 50 n = 100
CaMML - ModelA All 57.39 61.26 67.67 69.66 70.09
CaMML - ModelB All 58.00 61.69 69.05 70.56 70.87
Naıve - ModelA All 53.33 58.00 60.35 63.37 62.90
Naıve - ModelB All 52.92 58.20 60.55 63.61 62.87
Table 5.10: Accuracy of DBNs training and test with “All” sequence.
Figure 5.12: Rate comparative for DBNs training and test with “All” sequence with
different class transition: ModelA (left), ModelB (right).
5.3.2. Static vs Dynamic
Finally we have to compare the obtained results with the two approaches: BNs and
DBNs. In Figure 5.13 we can clearly see how dynamic networks improve the static
ones. The single time-slice structure learned by CaMML allows us obtain better results
in the dynamic approach. Meanwhile, Naıve Bayes is not able to adapt to the time
variability.
Structure of Naıve Bayes classifier gave us good solutions when the problem was
simple, as we saw in Subsection 5.2.1. However, when the problem increases its com-
5.3. DYNAMIC MODELS 61
Figure 5.13: Case All classification rates evolution when using dynamic and static
classifiers.
plexity, these classifiers give us results in the same range of quality. On the other hand,
Bayesian Networks learned with CaMML obtained bad results in the first set of test.
But using a more heterogeneous set of images to have a more generalised problem, and
later, including the dynamic component, we achieve to improve its accuracy conside-
rably.
Finally we compare our results with those obtained in CLEF 2009 challenge. Alt-
hough the techniques to extract information have evolved, these accuracy values allow
us to verify whether the results obtained by our solution are really good. The winner
of the CLEF 2009 competition [Martınez-Gomez et al., 2009] using the KTH-IDOL2
dataset correctly classified 63.43 % of test images. With our proposal, we have excee-
ded 70 % accuracy. That is, BNs and, even more, DBNs have been presented as a valid
solution for this problem.
Chapter 6
Conclusions and further work
In this project we have attempted to achieve two types of aims. The first one was
all the process related with the implementation of a solution to solve our problem using
Bayesian classifiers. The second kind was the analysis of the results obtained. In this
last chapter we will analyse if this objectives have been accomplished. Seeing the results
would be interesting to extend the work in different ways. This will be the focus of the
last section.
6.1. Conclusions
We have presented a procedure for using Bayesian classifiers to solve the problem
of semantic localization. This procedure includes the data extraction from images in
different descriptors, together with an unsupervised discretization step optimized for
coping with histograms as input, because most of the visual features extracted from
images present such structure. The project also includes Bayesian learning techniques
as CaMML, and also the use of DBNs to cope with the temporal continuity of the
dataset sequences.
With all the development process we can extract the features from the images and
create a Bayesian classifier, so the first objective is accomplished. One advantage of the
method developed is the process division depending on the function we want to do. This
allows us have techniques that extract information from a set of images and generate a
dataset in different formats. We also have the DBN constructor for networks to use it in
different problems with a temporal component like medical or meteorological domains.
Based on the experiments, we can conclude that we can achieve quite reasonable
results by means of our proposal. The results obtained using a reduced number of
variables give us an acceptable solution capable of working in real time. The variable
reduction allows us generate models able to generalize better than networks with all
the variables.
63
64 CHAPTER 6. CONCLUSIONS AND FURTHER WORK
Finally the dynamic networks have demonstrated their potential. They are able
to surpass the results obtained with the static models. The DBNs are shown as an
appropriate solution for integrating sequences of images. We also demonstrated the
limits of the Naıve Bayes models. They are not able to improve their results when we
increase the information and complexity of the problem. Meanwhile, learned Bayesian
Networks have taken advantage of this and they improve their accuracy.
In the CaMML algorithm is important when we use the dynamic version that the
data must be temporally sorted. But the different procedures presented in this project
do not have this restriction, because we created a static network and then we use the
DBN constructor. This allows us create the model with disordered data. However, if
we want test the networks the data must be ordered.
6.2. Further work
This work can be extended in many different ways. The first of them is studying
the effects of the different levels of the PHOG on the behaviour of the models. Another
possibility is to include new visual features, as SIFT-dense descriptors. This inclusion
would allow to evaluate how Bayesian approaches cope with multiple data fusion. If we
have obtained 70 % accuracy with PHOG in level 0, other more complex descriptors
may give us higher accuracy.
Figure 6.1: Grid example with size 4x4 in the IDOL2 environment.
We also would like to evaluate our proposal using topological information instead of
semantic one. This could be done using the topological annotations that are included
in the KTH-IDOL 2 dataset. These values are continuous. However, one way to use
this information could be to create a grid with the coordinates X and Y where each
6.2. FURTHER WORK 65
cell has a value as we can see in Figure 6.1. This approach turns continuous values to
discrete, where the new class can take any value assigned to cells.
We have been limited by the condition of CaMML that you can only use discrete
values. Other possibility is to study the behaviour of the continuous variables in our
problem with different techniques of machine learning. We also include in this way the
possibility to join continuous and discrete data. A comparison between our approach
and the use of Support Vector Machines is also considered.
This project has helped us to approach to the dynamic networks. One of the ways
of extending this work in the future will be study the different dynamic networks and
the effects in this kind of problems.
Conclusiones y trabajo futuro
En este proyecto intentamos conseguir dos tipos de objetivos. El primero consistıa en
obtener un proceso que nos permitiera obtener soluciones para nuestro problema usando
clasificadores Bayesianos. El segundo tipo de objetivos era el analisis de los resultados
obtenidos. En este ultimo capıtulo analizaremos si estos objetivos se han cumplido.
Por otro lado, serıa interesante observar los distintos resultados y las posibilidades
para extender el trabajo de diferentes maneras. Este sera el punto central sobre el que
tratara la ultima seccion.
6.1. Conclusiones
Hemos presentado un procedimiento para usar clasificadores Bayesianos con el obje-
tivo de solucionar el problema de la localizacion semantica. Este procedimiento incluye
la extraccion de datos de las imagenes con la ayuda de diferentes descriptores, junto
con un proceso optimizado de discretizado no supervisado con histogramas como en-
trada de datos, ya que la mayor parte de las caracterısticas extraıdas de las imagenes
presentan esta estructura. El proyecto tambien incluye tecnicas Bayesianas de apren-
dizaje como CaMML, y el uso de las DBNs para enfrentarse a la continuidad temporal
de las secuencias de los conjuntos de datos.
Con todo el proceso de desarrollo podemos extraer las caracterısticas de las imagenes
y crear un clasificador Bayesiano, por lo que el primer objetivo ya ha sido conseguido.
Una ventaja del metodo desarrollado es la division del proceso en diferentes pasos
dependiendo de la funcion que queramos realizar. Esto nos permite tener, por ejemplo,
una seccion del proceso que extrae informacion de un conjunto de imagenes y genera un
conjunto de datos en diferentes formatos. Ademas, podemos utilizar el constructor DBN
para crear redes para otros casos con una componente temporal, como los problemas
medicos o meteorologicos.
Basandonos en los experimentos, se puede concluir que podemos conseguir resul-
tados bastante razonables por medio de nuestra propuesta. Los resultados obtenidos
usando un numero reducido de variables nos dan una solucion aceptable capaz de tra-
bajar en tiempo real. La reduccion de variables nos permite generar modelos capaces
67
68 CHAPTER 6. CONCLUSIONS AND FURTHER WORK
de generalizar mejor que las redes con todas las variables.
Finalmente, las redes dinamicas han demostrado su potencial. Son capaces de so-
brepasar los resultados obtenidos con los modelos estaticos. Las DBNs se han mostrado
como una solucion apropiada para integrar secuencias de imagenes. Por otro lado, he-
mos visto los lımites de los modelos de Naıve Bayes, incapaces de mejorar sus resultados
cuando aumentamos la informacion y complejidad del problema. Mientras que las redes
Bayesianas aprendidas han sacado ventaja de esto mejorando su tasa de acierto.
En el proceso de aprendizaje de CaMML es importante que los datos sean ordenados
en el tiempo cuando usamos la version dinamica. Pero los diferentes procedimientos
presentados en este proyecto no tienen esta restriccion, porque nosotros creamos una
red estatica y despues aplicamos el constructor DBN. Esto nos permite crear el modelo
con datos desordenados. Sin embargo, si queremos evaluar dichas redes los datos deben
ser ordenados.
6.2. Trabajo Futuro
Este trabajo puede ampliarse de diferentes formas. La primera de ellas es estudiar el
comportamiento de los modelos usando diferentes niveles del PHOG. Otra posibilidad
es incluir nuevas caracterısticas visuales, como descriptores de densidad SIFT. Esta
inclusion permitirıa evaluar como los enfoques Bayesianos hacen frente a la integracion
de multiples datos. Si hemos obtenido un ındice de acierto (accuracy) del 70 % con un
PHOG a nivel 0, otros descriptores mas complejos pueden darnos ındices de acierto
mas elevados.
Ademas nos gustarıa evaluar nuestra propuesta empleando informacion topologica
como entrada de datos en lugar de una entrada semantica. Esto podrıa hacerse utilizan-
do las anotaciones topologicas que se incluyen en el conjunto de datos KTH-IDOL 2.
Estos valores son continuos. Sin embargo, una manera de usar esta infomacion podrıa
ser la creacion de una cuadrıcula con las coordenadas X e Y , donde cada celda tiene
un valor como podemos observar en la Figura 6.1. Este enfoque convierte los valores
continuos en discretos, donde la nueva clase puede tomar cualquier valor asignado a
las celdas.
Hemos estado limitados por la condicion de que CaMML solo puede utilizar valores
discretos. Otra posibilidad es estudiar el comportamiento de las variables continuas en
nuestro problema con diferentes tecnicas de aprendizaje automatico. Tambien inclui-
mos de esta forma la posibilidad de unir los datos continuos y los discretos. Hemos
considerado tambien una comparacion entre nuestro enfoque y el uso de las maquinas
de soporte vectorial (Support Vector Machines).
Este proyecto nos ha ayudado a aproximarnos a las redes dinamicas. Otra de las
6.2. TRABAJO FUTURO 69
formas de extender este trabajo en el futuro serıa estudiar las diferentes redes dinamicas
y los efectos en este tipo de problemas.
REFERENCES
Bay, H., Tuytelaars, T. & Van Gool, L. (2006). Surf: Speeded up robust fea-
tures. In Computer Vision–ECCV 2006 , 404–417, Springer. 31
Black, A., Korb, K.B. & Nicholson, A.E. (2013). Learning Dynamic Bayesian
Networks: Algorithms and issues. Presented as the Fifth Annual Conference of the
Australasian Bayesian Network Modelling Society (ABNMS2013), available at http:
//abnms.org/conferences/abnms2013/. 26
Canny, J. (1986). A computational approach to edge detection. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 679–698. 34
Chickering, D.M. (1995). A transformational characterization of equivalent Bayesian
network structures. In UAI95 – Proceedings of the 11th Conference on Uncertainty
in Artificial Intelligence, 87–98, Morgan Kaufmann, San Francisco, CA. 24
Chickering, D.M. (2003). Optimal structure identification with greedy search. Jour-
nal of Machine Learning Research, 3, 507–554. 24
Consortium, E. (2002). Elvira: An environment for creating and using probabilistic
graphical models. In Probabilistic Graphical Models . 41
Cooper, G.F. & Herskovits, E. (1992). A Bayesian method for the induction of
probabilistic networks from data. Machine Learning , 9, 309–347. 10, 24
Domingos, P. & Pazzani, M. (1997). On the optimality of the simple bayesian
classifier under zero-one loss. Machine Learning , 103–137. 22
DSL, P. (1996-2006). The GeNIe (Graphical Network Interface) software package.
Copyright (c) by Decision Systems Laboratory, University of Pittsburgh. Available
at http://genie.sis.pitt.edu/. (Accessed: 2 July 2014). 40
Eaton, J.W., Bateman, D. & Hauberg, S. (2009). GNU Octave version 3.0.1
manual: a high-level interactive language for numerical computations . CreateSpace
Independent Publishing Platform, ISBN 1441413006. 40
71
72 REFERENCES
Farhangfar, A., Kurgan, L. & Dy, J. (2008). Impact of imputation of missing
values on classification error for discrete data. Pattern Recognition, 41, 3692 – 3705.
23
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Wit-
ten, I.H. (2009). The weka data mining software: an update. SIGKDD Explor.
Newsl . 40
Heckerman, D., Geiger, D. & Chickering, D.M. (1995). Learning Bayesian
networks: The combination of knowledge and statistical data. Machine learning , 20,
197–243. 10
Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global
matching and mutual information. In Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, 807–814, IEEE.
29
Hsu, R.L., Abdel-Mottaleb, M. & Jain, A.K. (2002). Face detection in color
images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24, 696–
706. 29
Jensen, F.V. & Nielsen, T.D. (2007). Bayesian Networks and Decision Graphs .
Springer Verlag, New York, 2nd edn. 1, 5, 10, 16
Korb, K.B. & Nicholson, A.E. (2010). Bayesian Artificial Intelligence. Chapman
& Hall/CRC, 2nd edn. 1, 2, 5, 6, 10, 16, 19, 24, 41
Lazebnik, S., Schmid, C. & Ponce, J. (2006). Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories. In Computer Vision and
Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, 2169–2178,
IEEE. 33
Leonard, J.J. & Durrant-Whyte, H.F. (1991). Mobile robot localization by
tracking geometric beacons. Robotics and Automation, IEEE Transactions on, 7,
376–382. 35
Lowe, D.G. (1999). Object recognition from local scale-invariant features. In Compu-
ter vision, 1999. The proceedings of the seventh IEEE international conference on,
vol. 2, 1150–1157, Ieee. 31
Luo, J., Pronobis, A., Caputo, B. & Jensfelt, P. (2006a). The KTH-IDOL2
Database. Tech. Rep. CVAP304, KTH Royal Institute of Technology, CVAP/CAS,
Stockholm, Sweden. 37
REFERENCES 73
Luo, J., Pronobis, A., Caputo, B. & Jensfelt, P. (2006b). The kth-idol2 da-
tabase. KTH, CAS/CVAP, Tech. Rep, 304. 38
Martınez-Gomez, J., Jimenez-Picazo, A. & Garcıa-Varea, I. (2009). A
particle–filter-based self–localization method using invariant features as visual in-
formation. Working Notes of CLEF . 61
MATLAB (2010). version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachu-
setts. 40
Mitchell, T.M. (1997). Machine Learning . McGraw-Hill, Inc., 1st edn. 20
Neapolitan, R.E. (2003). Learning Bayesian Networks . Prentice Hall. 10
Norsys (2000). Netica. http://www.norsys.com. 41
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems . Morgan Kaufmann,
San Mateo, CA. 1, 5, 10
Rubio, F., Flores, M., Martınez-Gomez, J. & Nicholson, A. (2014). Dynamic
bayesian networks for semantic localization in robotics. In 15th Workshop of Physical
Agents (WAF 14). 3, 7
Russell, S.J. & Norvig, P. (2003). Artificial Intelligence: A Modern Approach.-
Part IV Uncertain Knowledge and Reasoning – Probabilistic Reasoning over Time,
chap. 15. Prentice Hall. 1, 5
Sahami, M., Dumais, S., Heckerman, D. & Horvitz, E. (1998). A Bayesian
Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers
from the 1998 Workshop. 22
Spirtes, P., Glymour, C. & Scheines, R. (2000). Causation, Prediction and
Search. MIT Press, 2nd edn. 24
van Rossum, G. (2007). Python programming language. In USENIX Annual Tech-
nical Conference. 41
Vomlel, J. (2006). Noisy-or classifier. International Journal of Intelligent Systems ,
21, 381–398. 47
Wallace, Korb, O’Donnell, Hope & Twardy (2005). CaMML.
http://www.datamining.monash.edu.au/software/camml, (Accessed: 2 July 2014).
23
Wallace, C. & Korb, K. (1999). Learning linear causal models by MML sampling.
In Causal Models and Intelligent Data Management , 89–111, Springer. 24
74 REFERENCES
Wallace, C.S. (2005). Statistical and Inductive Inference by Minimum Message
Length. Springer, Berlin, Germany. 2, 6, 24
Yang, J., Jiang, Y.G., Hauptmann, A.G. & Ngo, C.W. (2007). Evaluating
bag-of-visual-words representations in scene classification. In Proceedings of the in-
ternational workshop on Workshop on multimedia information retrieval , 197–206,
ACM. 29
Yehezkel, R. & Lerner, B. (2009). Bayesian network structure learning by recur-
sive autonomy identification. Journal of Machine Learning Research, 10, 1527–1570.
24
Ziou, D., Tabbone, S. et al. (1998). Edge detection techniques-an overview. Pattern
Recognition And Image Analysis C/C Of Raspoznavaniye Obrazov I Analiz Izobraz-
henii , 8, 537–559. 29