© 2017 Yang Peng - ufdcimages.uflib.ufl.edu

MULTIMODAL FUSION: A THEORY AND APPLICATIONS

By

YANG PENG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2017

© 2017 Yang Peng

To my parents, sisters and girlfriend

ACKNOWLEDGMENTS

It has been more than five years since I first arrived at University of Florida. I spent

some of my best years being a Gator and a CISE Ph.D. student. In this journey, I must

thank a lot of people who have not only helped me along the way but also guided me

through the hardest times.

I thank my advisor, Prof. Daisy Zhe Wang for her guidance, ideas and encouragement.

It has been a great honor to be her student. I have learned a lot from her, such as her

passion and her rigorous academic attitude. I am also grateful for both the financial and

moral support from her.

I am grateful to Prof. Shigang Chen, Prof. Sartaj Sahni, Prof. Sanjay Ranka and

Prof. Tan Wong for serving as my Ph.D. committee members and their precious time and

constructive opinions.

I am also thankful to my lab mates, especially Dr. Kun Li, Dr. Christan Grant, Dr.

Morteza Shahriari Nia, Dr. Yang Chen, Xiaofeng Zhou, Sean Goldberg, Miguel Rodríguez,

Dihong Gong and Ali Sadeghian, for their help, collaboration and insightful suggestions.

Finally, it is the support of my family which allows me to pursue the Ph.D. degree

from the beginning to the end. It is the company of my girlfriend which got me through

the hardest moments in the last three years. Without them, I wouldn’t imagine I can

overcome so many difficulties along this journey.

My research is partially supported by DARPA under FA8750-12-2-0348 and a

generous gift from Pivotal.

4

TABLE OF CONTENTSpage

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.1.1 Fact Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.1.2 Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . 171.1.3 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 171.1.4 Knowledge Base Completion . . . . . . . . . . . . . . . . . . . . . . 18

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 THEORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Correlative Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Complementary Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.1 Why Multimodal Fusion Works . . . . . . . . . . . . . . . . . . . . 252.4.2 How to Design Multimodal Fusion Algorithms . . . . . . . . . . . . 25

3 STREAMING FACT EXTRACTION . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.1 Entity Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1.2 Wikipedia Citation . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1.3 Slot Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1.4 Constraints and Inference . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 SCALABLE IMAGE RETRIEVAL . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.2 Distributed Clustering Algorithms . . . . . . . . . . . . . . . . . . . 42

5

4.3.2.1 Distributed approximate K-Means . . . . . . . . . . . . . 434.3.2.2 Distributed hierarchical K-Means . . . . . . . . . . . . . . 43

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4.1.1 Oxford . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4.1.2 ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4.2 Performance of Mahout K-Means, d-AKM and d-HKM . . . . . . . 454.4.3 Performance on Large Datasets . . . . . . . . . . . . . . . . . . . . 46

4.4.3.1 Different subsets . . . . . . . . . . . . . . . . . . . . . . . 474.4.3.2 Different cluster numbers . . . . . . . . . . . . . . . . . . 48

5 MULTIMODAL ENSEMBLE FUSION . . . . . . . . . . . . . . . . . . . . . . . 49

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1.1 Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . 495.1.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2.1 Ensemble Fusion Model . . . . . . . . . . . . . . . . . . . . . . . . 525.2.2 Ensemble Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.2.1 Linear rule . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2.2.2 Maximum rule . . . . . . . . . . . . . . . . . . . . . . . . 545.2.2.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 54

5.2.3 Applications (Individual Approaches and Implementation) . . . . . 555.2.3.1 Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . 555.2.3.2 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3.1.1 UIUC-ISD . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3.1.2 Google-MM . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3.2.1 Word sense disambiguation . . . . . . . . . . . . . . . . . 585.3.2.2 Information retrieval . . . . . . . . . . . . . . . . . . . . . 58

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4.2 Complementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4.3 Early Fusion vs Ensemble Fusion . . . . . . . . . . . . . . . . . . . 62

6 KNOWLEDGE BASE COMPLETION . . . . . . . . . . . . . . . . . . . . . . . 63

6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.1.1 Knowledge Base Construction . . . . . . . . . . . . . . . . . . . . . 666.1.2 Inference and Learning . . . . . . . . . . . . . . . . . . . . . . . . . 676.1.3 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.2.1 Ensemble Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.3 Web-Based Question Answering . . . . . . . . . . . . . . . . . . . . . . . . 70

6

6.3.1 WebQA Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.3.1.1 Question generation . . . . . . . . . . . . . . . . . . . . . 716.3.1.2 Data collection . . . . . . . . . . . . . . . . . . . . . . . . 746.3.1.3 Answer extraction . . . . . . . . . . . . . . . . . . . . . . 756.3.1.4 Answer ranking . . . . . . . . . . . . . . . . . . . . . . . . 76

6.3.2 Offline Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.3.2.1 Template selection . . . . . . . . . . . . . . . . . . . . . . 766.3.2.2 Query-driven snippet filtering . . . . . . . . . . . . . . . . 786.3.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . 796.3.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.4 Rule Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.4.1 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.4.2 Ordinary Rule Inference . . . . . . . . . . . . . . . . . . . . . . . . 826.4.3 Augmented Rule Inference . . . . . . . . . . . . . . . . . . . . . . . 82

6.4.3.1 Length-1 rules . . . . . . . . . . . . . . . . . . . . . . . . 836.4.3.2 Length-2 rules . . . . . . . . . . . . . . . . . . . . . . . . 836.4.3.3 Query-driven optimization . . . . . . . . . . . . . . . . . . 84

6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.5.2 WebQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.5.2.1 Question template selection . . . . . . . . . . . . . . . . . 876.5.2.2 Overall performance . . . . . . . . . . . . . . . . . . . . . 896.5.2.3 Performance with snippet filtering . . . . . . . . . . . . . 906.5.2.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.5.3 Rule Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.5.4 Ensemble Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7

LIST OF TABLESTable page

3-1 The set of slot names. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3-2 Server specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3-3 Document chunk distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3-4 Sampled accuracy of the results of the extracted facts. . . . . . . . . . . . . . . 35

3-5 Recall measure: generic slot names like Affiliate had the most recall, comparedto less popular slot names e.g. DateOfDeath. . . . . . . . . . . . . . . . . . . . 36

3-6 Accuracy measure: accuracy of AffiliateOf was the best and Affiliate appliedpoorly due to ambiguity of being an affiliate of somebody/something. . . . . . . 36

4-1 The time complexity of one iteration of Mahout K-Means (d-KM), d-AKM andd-HKM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4-2 Dataset specifics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5-1 The accuracy of image-only, text-only, linear rule fusion, maximum rule fusionand logistic regression fusion on UIUC-ISD dataset for WSD. . . . . . . . . . . 58

5-2 Retrieval quality (MAP) of image-only, text-only, early fusion, linear rule fusion,maximum rule fusion and logistic regression fusion on the Google-MM datasetfor IR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5-3 The coverage, average precision (AP) and average recall (AR) of different approacheson WSD for keyword “bass”. Coverage refers to the percentage of the documentseach approach can effectively disambiguate. . . . . . . . . . . . . . . . . . . . . 61

6-1 Example relations, templates, queries, questions and snippets. . . . . . . . . . . 72

6-2 Overall KBC performance for 8 relations with all snippets. Comparison betweenour system and previous work [1] (denoted as West in the table) are also explained.MAP (mean average precision) measures the KBC performance. Numbers inbold indicate the best results for individual relations. . . . . . . . . . . . . . . . 88

6-3 KBC performance with snippet filtering for different numbers of snippets. Theexperiments are run on our system evaluated with 10 snippets, 20 snippets, 30snippets and all snippets. Performance of previous work [1] is denoted as West.Performance is measured by MAP. . . . . . . . . . . . . . . . . . . . . . . . . . 90

6-4 Average running time of our system using query-driven snippet filtering for relationwasBornIn with 3 questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6-5 KBC performance of OrdRI vs AugRI (measured by MAP). . . . . . . . . . . . 93

8

6-6 KBC performance of individual approaches and ensemble fusion (measured byMAP). WebQA is conducted with 30 snippets. . . . . . . . . . . . . . . . . . . . 93

9

LIST OF FIGURESFigure page

2-1 Examples selected from UIUC-ISD dataset [2] for keyword “bass”. The left figureshows a document carrying sense “bass (fish)” and the right figure shows anotherdocument carrying sense “bass (instrument)”. Photo courtesy of Kate Saenko. . 23

2-2 Examples selected from UIUC-ISD dataset [2] for sense “bass (fish)”. Photocourtesy of Kate Saenko. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3-1 Streaming fact extraction system architecture. . . . . . . . . . . . . . . . . . . . 30

4-1 The process of building the BoVW model. Reprinted with permission from GoogleImages, https://images.google.com/ (October 20, 2017). . . . . . . . . . . . . . 40

4-2 The top-down hierarchical K-Means. . . . . . . . . . . . . . . . . . . . . . . . . 44

4-3 Running time of different algorithms on Oxford dataset. . . . . . . . . . . . . . 46

4-4 Performance comparison between AKM and HKM with larger cluster numbers.Note: k refers to a thousand in the figure. . . . . . . . . . . . . . . . . . . . . . 47

4-5 Experiments on Large Datasets. Note: k refers to a thousand in the figures. . . 48

5-1 The ensemble fusion model. Photo courtesy of Kate Saenko. . . . . . . . . . . . 53

5-2 IR: per-query detailed result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6-1 The query-driven knowledge base system pipeline. . . . . . . . . . . . . . . . . . 69

6-2 The web-based question answering system. . . . . . . . . . . . . . . . . . . . . . 71

6-3 An example rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6-4 Single-literal processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6-5 Two-literal processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6-6 The KBC performance results for three relations with different numbers of questions.k is the number of selected questions. The KBC performance is measured byMAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6-7 The average running time of WebQA with different numbers of questions forrelation wasBornIn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

MULTIMODAL FUSION: A THEORY AND APPLICATIONS

By

Yang Peng

December 2017

Chair: Daisy Zhe WangMajor: Computer Science

As data grows larger and larger nowadays, Big Data and Data Science are becoming

more and more prominent in Computer Science. In Data Science, not only the volume of

data is important for research, but also the variety of data has drawn a lot of attention

from researchers. In recent years, we have seen more and more complex datasets with

multiple kinds of data. For example, Wikipedia is a huge dataset with unstructured text,

semi-structured documents, structured knowledge and images. We call a dataset with

different types of data as a multimodal dataset. This dissertation focuses on employing

multimodal fusion on multimodal data to improve performance for various tasks, as well as

providing scalability and high efficiency.

In this dissertation, I first introduce the concepts of multimodal datasets and

multimodal fusion, and then different applications for multimodal fusion, such as

information extraction, word sense disambiguation, information retrieval and knowledge

base completion. Multimodal fusion is the use of algorithms to combine information from

different kinds of data with the purpose of achieving better performance. Multimodal

datasets studied in this dissertation include images, unstructured text and structured facts

in knowledge bases.

I present the correlative and complementary relations between different modalities

and propose a theory on multimodal fusion based on this observation. Previous work

usually focused on exploiting the correlation between different modalities at the feature

11

level and ignored the complementary relation between modalities. Early fusion and late

fusion have been used as two schemes to combine multimodal data, but little explanation

about how to effectively design multimodal fusion algorithms has been discussed. In

this dissertation, I discuss multimodal fusion from a deeper perspective, explain why

multimodal fusion works and analyze how to design multimodal fusion algorithms to

improve performance for tasks based on the correlative and complementary relations on

different multimodal datasets.

We then present the multimodal ensemble fusion model to combine images and text

for a few applications, including word sense disambiguation and information retrieval.

In our ensemble fusion model, text processing and image processing are conducted on

text and images separately and different fusion algorithms are employed to effectively

combine the results. The ensemble fusion model can effectively exploit the correlative and

complementary relations between images and text to improve performance. Experimental

results demonstrate ensemble approaches outperform image-only and text-only approaches.

We build a query-driven knowledge base completion system based on multimodal

fusion with web-based question answering and rule inference to combine information from

the Web and knowledge bases. We design a novel web-based question answering system to

extract facts from the Web with multimodal features and an effective question template

selection algorithm, which can achieve better performance with much fewer questions

than previous work. We build an augmented rule inference engine to infer new facts using

logical rules learned from knowledge bases and web-based question answering. We design

different fusion algorithms to combine web-based question answering and rule inference

to achieve high performance. We use a few query-driven optimization techniques, such as

query-driven snippet filtering, to improve the efficiency of the whole system.

Scalability and efficiency are also important aspects in this dissertation. We employ

streaming processing for fact extraction, which can efficiently process terabytes of text

data in less than one hour on a single machine. We implement a scalable image retrieval

12

system over millions of images using distributed systems and map-reduce, which can run

much faster than previous work. For knowledge base completion, query-driven techniques

are applied to improve system efficiency.

13

CHAPTER 1INTRODUCTION

The four Vs (volume, variety, velocity, and veracity) are the most important topics

in Data Science. In recent years, we have not only observed larger and larger datasets,

but also more and more complex datasets with different types of data. For example,

Wikipedia [3] is a huge dataset with unstructured text, semi-structured documents,

structured knowledge and images. We call this kind of dataset as a multimodal dataset.

This dissertation focuses on employing different kinds of data to improve performance and

provide scalability and high efficiency for different tasks.

Multimodal fusion is the use of algorithms to combine information from multiple

modalities. The objective of multimodal fusion is usually achieving better task performance

than single-modality approaches. Multimodal datasets may contain various kinds of data,

such as text, images, videos, audios, articles, news, blogs and XML documents. The

challenge of multimodal fusion is how to effectively and efficiently combine data of

different sources and natures.

With abundant multimodal data available on the Internet from 2000s, researchers

have developed many fusion algorithms and models to integrate data of multiple

modalities for various tasks, such as event detection in multimedia analysis. There are

two major fusion schemes divided by levels of fusion [4] used in previous work: early fusion

and late fusion. The most widely used strategy, is to fuse the information at the feature

level, which is known as early fusion or feature fusion. The other approach is decision level

fusion which fuses multiple modalities at the decision level, which is also known as late

fusion or ensemble fusion. The early fusion can utilize the correlation between multiple

features from different modalities at an early stage, while the late fusion strategy is more

flexible in terms of feature representations and learning algorithms for different modalities

and more scalable in terms of the number of modalities [4].

14

Although previous work utilize these two fusion schemes, they seldom discuss

why multimodal fusion works and how we can combine multimodal data to achieve

better performance. In this dissertation, we propose a theory about multimodal data

and multimodal fusion and discuss multimodal fusion from a deeper perspective in

Chapter 2. I explain why multimodal fusion works by analyzing the correlative and

complementary relations among different modalities. With correlative and complementary

relations, multimodal data can either provide additional information or emphasize the

same information, hence multimodal fusion can utilize these two relations to improve

performance for different applications. We analyze how to improve performance with

correlative and complementary relations for different applications on different multimodal

datasets.

To demonstrate our theory, we present the multimodal ensemble fusion model to

combine images and text for a few applications including word sense disambiguation

and information retrieval in Chapter 5. Our ensemble fusion model can capture the

complementary relation between image processing and text processing to improve

performance, while previous work focus on early fusion based on feature correlation.

In our ensemble fusion model, text processing and image processing are conducted on text

and images separately and various fusion algorithms are used to combine their results.

Experimental results demonstrate ensemble fusion approaches outperform image-only and

text-only approaches.

We present a query-driven multimodal fusion system with multimodal fusion for

knowledge base completion in Chapter 6. The knowledge base completion system

combines both web-based question answering and rule inference to utilize information from

unstructured text and strucutured knowledge bases. The web-based question answering

applies early fusion to combine features extracted from both the unstrcutured Web

and structured knowledge bases. We design novel multimodal features and an effective

question template selection algorithm for question answering, which can achieve better

15

performance with much fewer questions than previous work. We build an augmented

rule inference system with pre-learned logical rules, existing facts in knowledge bases and

web-based question answering. Then late fusion approaches are employed to combine rule

inference and web-based question answering to further improve knowledge base completion

performance. Query-driven optimization techniques are employed to improve the running

time of the whole system pipeline and provide fast responses to user queries.

Scalability and efficiency are also important aspects in this dissertation. We design

and implement a streaming processing system for the fact extraction task on terabytes

of text data, which can efficiently finish in less than one hour based on two layers of

filters. And I introduce how to implement a scalable image retrieval system on top

of Hadoop to efficiently process over millions of images. We design two distributed

clustering algorithms using Hadoop and Map-Reduce, which can run much faster than

previous work. Query-driven techniques are also employed to speed up the knowledge base

completion pipeline on-the-fly.

In the following sections of this chapter, I briefly discuss the applications studied in

this dissertation and our contributions.

1.1 Applications

In this section, the applications studied in this dissertation are briefly introduced,

including fact extraction, word sense disambiguation, information retrieval and knowledge

base completion. Only general definitions and descriptions about these applications are

presented here to give readers a brief introduction to them.

1.1.1 Fact Extraction

Fact extraction is the task to extract unknown structured facts for a knowledge base

from a source dataset, which is often an unstructured text dataset [5]. A knowledge base

(KB) is a data store with entities, attributes, relations and facts, usually stored in the

triple store format. The knowledge base we work on is the Wikipedia [3] knowledge base.

Wikipedia.org [3] is the largest online resource for free information and is maintained by

16

a small number of volunteer editors. The site contains over 5 million English articles.

However, these pages can easily be neglected, becoming out of date. Any news-worthy

event may require an update of several pages. To address this issue of stale articles, we

automatically extract facts from outside datasets to update Wikipedia [3].

1.1.2 Word Sense Disambiguation

Words in natural languages tend to have multiple senses, for example, the word

“crane” may refer to a type of bird or a type of machine. The problem of determining

which sense of a word is used in a sentence is called word sense disambiguation (WSD)

[6, 7]. WSD was first formulated as a distinct computational task during the early days of

machine translation in the 1940s, making it one of the oldest problems in computational

linguistics. Different kinds of methods have been introduced to solve WSD, including

knowledge-based approaches, supervised and unsupervised machine learning models and

other machine learning techniques [6, 7].

1.1.3 Information Retrieval

Information retrieval is the activity of obtaining relevant information to a query from

a collection of documents (usually textual documents). It involves many research topics,

such as document representation models, similarity metric, indexing, relevance feedback,

reranking, and so on. The bag-of-words model is commonly used to represent textual

documents in information retrieval and natural language processing. In this model, a

textual document or sentence is represented as a bag or a set of its words in an order-less

and grammar-free manner. The frequency vectors or occurrence vectors of words are

treated as features in this model. Image retrieval [8] is the search for desired images from

an image dataset based on queries from users. Content-based image retrieval (CBIR),

which emerged in 1990s, is a special case of image retrieval, where the queries are images

and the search process is based on visual content of images rather than textual captions or

image labels.

17

1.1.4 Knowledge Base Completion

Over the past few years, massive amounts of world knowledge have been accumulated

in publicly available knowledge bases, such as Freebase [9], NELL [10] and YAGO [11].

Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. For

example, over 70% of people included in Freebase [9] have no known place of birth, and

99% have no known ethnicity. Knowledge base completion (KBC) is the task to fill in the

gaps in knowledge bases in a targeted way. The difference between fact extraction and

knowledge base completion is the former extracts unknown facts from outside datasets,

which may involve unknown entities, while KBC links existing entities in knowledge bases.

1.2 Contributions

We implement a streaming processing system for the fact extraction task on terabytes

of text data, which can efficiently complete the task in less than one hour on a single

machine. We design two layers of filters to efficiently eliminate unnecessary documents

for fact extraction. We implement a rule-based pattern matching algorithm to effectively

extract facts from raw text.

We implement a scalable image retrieval system on Hadoop to efficiently handle

millions of images using limited resources. We design two distributed clustering algorithms

using Map-Reduce and Hadoop to build the bag-of-visual-words model much faster than

previous work.

We propose a new theory about multimodal fusion based on the observation of the

correlative and complementary relations between different modalities. We explain why

multimodal fusion works by analyzing the correlative and complementary relations. We

discuss how to improve performance for different applications based on correlative and

complementary relations on multimodal datasets.

We present the multimodal ensemble fusion model to combine images and text for

word sense disambiguation and information retrieval. We design three ensemble fusion

approaches which can achieve better performance than single-modality approaches.

18

We implement a query-driven multimodal fusion system for knowledge base

completion by combining question answering and rule inference to utilize unstructured

text and structured knowledge. We design a web-based question answering system with

early fusion, which can achieve better performance than previous work using much

fewer questions. We build a rule inference system based on logical facts, existing facts

in knowledge bases and web-based question answering. We design different approaches

to effectively combine rule inference and question answering to achieve better KBC

performance. We use query-driven optimization techniques to improve system efficiency

and provide fast responses to user queries.

1.3 Outline

First, I introduce the theory on multimodal fusion based on the correlative and

complementary relations between multiple modalities in Chapter 2. Second, the streaming

fact extraction system is presented in Chapter 3 and the scalable image retrieval system

is introduced in Chapter 4. Third, the ensemble fusion model combining images and text

for disambiguation and retrieval are presented in Chapter 5. Finally, the query-driven

pipeline with multimodal fusion of unstructured and structured data for knowledge base

completion is introduced in Chapter 6. In Chapter 7, conclusions of my dissertation are

explained.

19

CHAPTER 2THEORY

Although previous work utilize early fusion and late fusion schemes, they seldom

discuss why multimodal fusion works and how we can combine multimodal data to achieve

better performance. Most of previous approaches are only interested in the correlative

relation among multiple modalities, while they ignore the complementary relation.

In this chapter, we propose a theory on multimodal fusion to explain the benefit

of employing multimodal fusion and how to use multimodal fusion to achieve better

performance than single-modality approaches. First, we explain the correlative and

complementary relations among multiple modalities, which have not been discovered or

studied by previous work. These two relations reveal the potential of using multimodal

fusion to achieve better performance than single-modality approaches. Second, we explain

why multimodal fusion works and how to design multimodal fusion algorithms based on

these two relations.

To simplify the scenario, we only use two modalities (images and text) and word sense

disambiguation (WSD) as an example application to explain the correlative relation and

complementary relation among multiple modalities. These two relations extend to many

other applications such as information retrieval and knowledge base completion, as shown

in later chapters.

2.1 Related Work

Researchers in the multimedia analysis community have developed many multimodal

machine learning models [4] to integrate data of multiple modalities, including text,

images, audios and videos, for multimedia analysis tasks, such as event detection. There

are two major fusion schemes divided by levels of fusion: early fusion and late fusion.

The most widely used strategy, is to fuse the information at the feature level, which

is also known as early fusion [4]. The other approach is late fusion or decision level

fusion, which fuses multiple modalities in the semantic space at the decision level [4].

20

Current approaches mostly focus on developing a unified representation model for multiple

modalities and then employ existing classification methods on the unified representation.

In the machine learning community, with deep learning gaining much popularity in

recent years, there have been efforts in exploiting deep learning for multimodal learning

[12, 13]. In [12], Ngiam et al. proposed the bimodal deep Boltzmann machine and the

bimodal deep autoencoder to fuse features of multiple modalities for multimodal fusion

and cross-modal learning. Deep Boltzmann machine was also employed in [13] to fuse

features of images and text.

For word sense disambiguation, there have been several research projects on using

images and text to improve disambiguation accuracy [14, 15]. In [14], May et al. combined

the image space and text space directly and applied a modified version of Yarowsky

algorithm [16] on the combined space to solve WSD. But this naive combination of two

spaces might not capture the deep or complex correlations between the image space

and text space, which might lead to poor accuracy. In [15], Saenko et al. assumed the

features of one modality are independent of sense given the other modality, then used

LDA to model the probability distributions of senses given images and text separately, and

combined these two distributions using a sum rule. Although the linear rule in Chapter 5

and the sum rule in [15] may look similar, the ideologies and motivations behind them

are quite different. The goal of the sum rule in [15] is to model the joint probability

distribution of senses given both images and text under the independence assumption,

while our goal of the linear rule approach is to capture the complementary relationship

between images and text in the ensemble fusion framework, where text processing and

image processing are conducted first and then the linear rule is used to combine the results

of them to achieve higher quality.

For information retrieval, Rasiwasia et al. [17] proposed several state-of-the-art

approaches to achieve cross-modal information retrieval. The first approach was

correlation matching, which aimed to map the different feature spaces for images and

21

text to the same feature space based on correlation analysis of these two spaces. The

second approach was semantic matching, which represented images and text with the same

semantic concepts using multi-class logistic regression. This work motivated us to discover

the correlative relation among multiple modalities. Wu et al. [18] proposed the super

kernel fusion method to combine multimodal features optimally for image categorization.

Zhu et al. [19] preprocessed embedded text in images to get weighted distance and

combined the distance with visual cues for further classification for images. Bruno et al.

[20] proposed preference-based representation to completely abstract multimedia content

for efficient processing. In [21], the proposed cross-reference-based fusion strategy for video

search used late fusion technique, by hierarchically combining ranked results from different

modalities, which can be considered as a special discrete case of the linear rule in our

model. Fusion techniques have been used in other research areas. For example, [22, 23]

proposed risk analysis approaches for chemical toxicity assessment on multiple limited

and uncertain data sources. However, their approaches are not directly applicable to our

applications since our work focuses on fusing information from deterministic multimodal

data.

Previous work didn’t discover the complementary and correlative relations between

modalities or explain why multimodal fusion works. They mostly focused on using the

early fusion scheme by developing unified representation models from multiple modalities

based on the correlative relation, and then using classification techniques on top of the

unified representation models to solve different tasks. However, we propose a theory on

multimodal fusion to explain the benefit of employing multimodal fusion and how to

design multimodal fusion algorithms to achieve better performance than single-modality

approaches, based on both the complementary and correlative relations, which are first

discovered and studied by us.

22

2.2 Correlative Relation

The data from different modalities tend to contain same or similar semantic

information and correlate with each other. For word sense disambiguation, the correlative

relation between text and images means images and textual sentences of the same

documents tend to contain semantic information describing the same objects or concepts.

For example, the image and textual sentence in Figure 2-1(A) both refer to the sense

“bass (fish)”, while the image and sentence in Figure 2-1(B) both describe the sense “bass

(instrument)”.

A “fish of florida: rock sea bass” B “l.a. kidwell musical instruments - product(bass 006)”

Figure 2-1. Examples selected from UIUC-ISD dataset [2] for keyword “bass”. The leftfigure shows a document carrying sense “bass (fish)” and the right figure showsanother document carrying sense “bass (instrument)”. Photo courtesy of KateSaenko.

Because information from different modalities have this correlative relation, they tend

to be correlated in the feature spaces as well. Then it is possible to conduct correlation

analysis to construct a unified feature space across multiple modalities to represent

multimodal documents. Previous research papers [12, 13, 17] exploit the correlative

relation to develop a unified representation model for multimodal documents, although

most of them did not identify this correlative relation explicitly.

In the late fusion scheme, images and text also display certain correlation at the

decision level. For example, some images and textual sentences are classified to the same

23

senses correctly in the experiments for WSD. But the late fusion scheme obviously cannot

exploit the correlation of images and text at the feature level.

2.3 Complementary Relation

Data from multiple modalities are complementary to each other by containing

different semantic information. For example, in the word sense disambiguation case,

textual sentences contain more useful and relevant information for disambiguation in some

documents, while images contain more useful information in other ones. For example, in

Figure 2-2(A), the sentence “portfolio 2” contains little information to disambiguate senses

for “bass”, while the image depicts the “bass (fish)” object. In Figure 2-2(B), the image is

rather complex and shows a lot of foreground objects, including a person, a boat, a fish,

a lake and trees, while the textual sentence contains cues which can be directly used to

disambiguate, such as “fishing”, “lake” and “catch”.

A “portfolio 2” B “lake fork fishing guide, bass guide - guaran-tees bass catch”

Figure 2-2. Examples selected from UIUC-ISD dataset [2] for sense “bass (fish)”. Photocourtesy of Kate Saenko.

On the other way, if we process data from multiple modalities separately, the

results from these approaches are also complementary to each other. Let’s use image

processing and text processing for instance. Image processing and text processing are

also complementary to each other. For some documents text processing generates correct

results, while for others image processing generates correct results. The reasons are

24

twofold: first, the semantic information in images and text are complementary to each

other; second, text processing usually has high precision but low recall, while image

processing has low precision but high recall.

In word sense disambiguation, the Yarowsky algorithm [16] we use to disambiguate

textual sentences, has high confidence of its disambiguation results, but frequently fails to

disambiguate a lot of unseen documents. On the other hand, the image disambiguation

using SVM classification has lower precision but higher recall because it can disambiguate

all the unseen documents but with lower confidence. Text retrieval and image retrieval

have the similar complementary relationship to each other. After using inverted indexing

to index textual data, the text retrieval has high precision but low recall due to the sparse

representation of short textual sentences. But image retrieval has high recall but low

precision, due to its dense and noisy representation of images. This observation motivated

us to design our ensemble fusion model to combine the results of text processing and

image processing, which is explained in Chapter 5.

2.4 Discussion

2.4.1 Why Multimodal Fusion Works

Complementary and correlative relations can be both leveraged in multimodal

processing tasks such as word sense disambiguation to achieve high performance. They

usually co-exist inside the same datasets, while they are probably presented in different

documents. These two relations reveal the potential of using multimodal fusion to achieve

higher quality than single-modality approaches, since multimodal data can either provide

additional information or emphasize the same information.

2.4.2 How to Design Multimodal Fusion Algorithms

The goal of multimodal fusion is to achieve higher performance than single-modality

approaches. The advantage of using multimodal fusion, as discussed above, is the ability

to exploit the correlative and complementary relations between different modalities. To

design effective multimodal fusion algorithms, we need to first examine the relationship

25

(correlative relation, complementary relation, or both) between different modalities in

multimodal data. Then we need to determine which fusion scheme (early fusion, late

fusion or both), can effectively capture the relationship between modalities in the data.

The last step is to design specific algorithms for early fusion or late fusion.

26

CHAPTER 3STREAMING FACT EXTRACTION

Wikipedia.org is the largest online resource for free information and is maintained by

a small number of volunteer editors. The website is estimated to have nearly 365 million

readers worldwide. It contains over 5 million english articles; these pages can easily be

neglected, becoming out of date. Any news-worthy event may require an update of several

pages. To address this issue of stale articles, we create a system that reads in a stream of

diverse web documents and recommends facts to be added to specified Wikipedia pages.

We developed a three-stage streaming system that creates models of Wikipedia pages,

filters out irrelevant documents and extracts facts that are relevant to Wikipedia pages.

The systems is evaluated over a 500M page web corpus and 139 Wikipedia pages. Our

results show a promising framework for fast fact extraction from arbitrary web pages for

Wikipedia.

An important part of keeping WP usable is to include new and current content.

Presently, there is considerable time lag between the publication of an event and its

citation in WP. The median time lag for a sample of about 60K web pages cited by WP

articles in the living people category is over a year and the distribution has a long and

heavy tail [5]. Such stale entries are the norm in any large reference work because the

number of humans maintaining the reference is far fewer than the number of entities.

Reducing latency keeps WP relevant and helpful to its users. Given an entity page,

such as wiki/Boris_Berezovsky_(businessman), possible citations may come from a

variety of sources. Notable news may be derived from newspapers, tweets, blogs and a

variety of different sources include Twitter, Facebook, Blogs, arxiv, etc. However, the

actual citable information is a small percentage of the total documents that appear on the

web. To help WP editors, a system is needed to parse through terabytes of documents and

select facts that can be recommended to particular WP pages.

27

Previous approaches are able to find relevant documents given a list of WP entities

as query nodes [24–28]. Entities of three categories (person, organization and facility) are

considered. This work involves processing large sets of documents to determine which facts

may contain references to a WP entity. This problem becomes increasingly more difficult

when we look to extract relevant facts from each document. Each relevant document

must now be parsed and processed to determine if a sentence or paragraph is worth being

cited. Discovering facts across the Internet that are relevant and citable to the WP entities

is a non-trivial task. Here we produce an example sentence from a webpage: “Boris

Berezovsky, who made his fortune in Russia in the 1990s, passed away March 2013.”

After parsing the sentence, we must first note that there are two entities named ‘Boris

Berezovsky’ in WP; one a businessman and the other a pianist. Any extraction needs to

take this into account and employ a viable distinguishing policy (entity resolution). Then,

we match the sentence to find a topic such as DateOfDeath valued at March 2013. Each

of these operations is expensive, so an efficient framework is necessary to execute these

operations at web scale.

In this section, we introduce an efficient fact extraction system or given WP entities

from a time-ordered document stream. Fact extraction is defined as follows: match each

sentence to the generic sentence structure of subject — verb — adverbial/complement.

The subject represents the entity (WP entity) and the verb is the relation type (slot) we

are interested in (e.g. Table 3-1). The third component, adverbial/complement, represents

the value of the associated slot. In our example sentence, the entity of the sentence is

Boris Berezovsky and the slot we extract is DateOfDeath with a slot value of March 2013.

The resulting extraction containing an entity, slot name and slot value is a fact.

Our system contains three main components. First, we pre-process the data and

build models representing the WP query entities. Next, we use the models to filter a large

stream of documents so they only contain candidate citations. Lastly, we process sentences

from candidate extractions and return slot values. Overall, we contribute the following:

28

Table 3-1. The set of slot names.Person Facility OrganizationAffiliateAssociateOfContact_Meet_PlaceTime Affiliate AffiliateAwardsWon Contact_Meet_Entity TopMembersDateOfDeath FoundedByTitlesFounderOfEmployeeOf

• Introduce a method to build models of WP name variations;

• Build a system to filter large numbers of diverse documents using a natural languageprocessing rule-based extraction system;

• Extract, infer and filter entity-slot-value triples of information to be added to KB.

Our system extracts hundreds of thousand facts from 5TB multimodal text data,

including blogs, news, forum posts, tweets, Wikipedia. The multimodal text data has been

preprocessed and annotated using natural language processing tools, thus multimodal

fusion is not the major problem here. In this chapter, I focus on discussing the streaming

system handling large datasets and the pattern matching algorithm extracting missing

facts for Wikipedia.

3.1 System

In this section, I introduce the main components of the streaming fact extraction

system. Our system is built with a pipeline style architecture giving it the advantage to

run each section separately to allow stream processing without blocking the data flow of

components (Figure 3-1). The three logical components are Model for entity resolution

purposes, Wikipedia Citation to annotate cite-worthy documents, and Slot Filling to

generate the actual slot values.

To discover facts for a single WP entity, the first step is to extract aliases of the

entity. We extract several name variations from the Wikipedia.org API and from the WP

entity page. Also, if the entity type is person, we can change the order of user names to

29

Figure 3-1. Streaming fact extraction system architecture.

increase coverage (e.g. ‘Boris Berezovsky’ -> ‘Berezovsky, Boris’). Next, we iterate over

documents in the stream and filter out all documents that do not explicitly contain a

string matching the list of entities. To extract relevant facts, we perform pattern matching

over each sentence that matches the entity based on a dictionary of patterns. If a sentence

activates one of the patterns in the dictionary, we emit this sentence as a candidate

contribution for the WP entity. With the candidate set, we infer new facts from the set

and clean up the set by removing the set of values that violate a list of constraints such as

duplicates.

3.1.1 Entity Model

We use the Wikipedia.org API to retrieve aliases. The API allows us to request

pages that redirect users to an entity page. For example, if a WP user tries to access

the ‘William Henry Gates’ entry they are sent to the page for ‘Bill Gates’; we treat such

30

redirects as aliases. To extract more aliases, we parse the HTML source of a WP entity

page. Using regular expressions, we extract the bold phrases of the initial paragraph as

aliases. This method provides several inline aliases from the wiki page. In WP page for

the businessman ‘Boris Berezovski’, there is a mention of ‘Boris Abramovich Berezovsky’

given in bold in the wiki page which obtained by regular expression extraction.

We pass the full set of person entities through rules for generating alternate name

orders. This module produces various forms of expressing entity names and titles. For

example, ‘Bill Gates’ can be written as ‘Gates, Bill’. This allows the system to capture

various notation forms of aliases that appear in text documents.

3.1.2 Wikipedia Citation

The goal of this step is to use the models created to discover a set of documents that

are relevant to the WP entity. As a stream of documents comes in, we first perform a

string matching between the model aliases and document text. We use this technique as a

first filter with confidence, because previous work states non-mentioning documents have

a low chance of being citable in Wikipedia. Given our large number of aliases, we can be

confident that if an alias does not appear in a document it does not need to be cited.

Our system streams in documents in the form of chunk files. Each chunk file contains

thousands of documents. This corpus of documents is processed by a two-layer filter

system referred to as Document Chunk Filter and Document Filter. The purpose of these

filters is to reduce I/O cost while generating slot values for various entities. Document

Chunk Filter removes the chunk files that do not contain a mention of any of the desired

entities. Each chunk file may contain thousands of documents and each document is

expensive to process. The Document Filter removes documents that do not contain a

mention of an entity. This two-level filter allows us to perform detailed slower processing

over a smaller set of documents. Not all chunk files contain mentions of the entities, so

filtering out large chunk files early saves I/O and processing costs. Document Chunk

Filter discards non-mentioning chunk files and promotes chunk files as soon as an entity

31

mention is found. The document filter additionally notes the sentences that contain entity

mentions. This data is passed to the Slot Filling system.

3.1.3 Slot Filling

Streaming Slot Filling (SSF) extracts fact values from sentences according to a list of

patterns. Table 3-1 lists the slot relationships that we look to extract. In Figure 3-1, we

refer to this task as Slot Filling. SSF reads documents filtered by the Wikipedia Citation

step and fetches and tags sentences containing WP entities. All entities are extracted

from the document using a natural language processing tool. In the next section, we

describe how WP entities are matched against the set of patterns. Following, we discuss

our approach to inference over the extracted facts.

A pattern is a template of a fact to be extracted and added to a WP entity. Patterns

are used to find and extract facts from text. A pattern P is represented as a five-tuple

P = {p1, p2, p3, p4, p5}. The first value, p1 represents the type of entity. These entity

types are in the set {FAC, ORG, PER} where FAC represents a type of facility, ORG

represents an organization and PER represents a person. p2 represents a slot name. A list

of slot names is presented in Table 3-1. The third element p3 is the pattern content, i.e.

a string found in the sentence that identifies a slot name. The extractor looks specifically

for pattern content. The pattern evaluator uses a direction (left or right) found in p4 to

explore sentence. The final element p5 represents the slot value of a pattern. The type of

slot value may be the entity type labeled by the named entity extractor, a noun phrase

(NP) tagged by a part of speech tagger or a phrase described in the pattern list.

3.1.4 Constraints and Inference

Our dataset contains some duplicate webpages, webpage texts with similar content,

and some of the entity tags are incomplete. This causes some duplicates or highly similar

content in the extracted list. We implement a filter to remove duplicates or the fact

extractions that match patterns that are general and highly susceptible to be noisy. The

data contains duplicates and incorrect extractions. We define rules to read ordered sets

32

of facts to sanitize the output. The input is processed in time order, in a tuple-at-a-time

fashion to allow rules to discover noisy slots that appears in close proximity. We define

two classes of rules: deduplication and inference rules.

The output contains many duplicate entries. As we read the list of extracted slots we

create rules to define ‘duplicate’. Duplicates can be present in a window of rows; we use a

window size of 2, meaning we only use adjacent rows. Two rows are duplicates if they have

the same exact extraction or if the rows have the same slot name and a similar slot value

or if the extracted sentence for a specific slot types come from the same sentence.

New slots can be deduced from existing slots by defining inference rules. For example,

two slots for the task are FounderOf and FoundedBy. A safe assumption is these slot

names are biconditional logical connectives with the entities and slot values. Therefore,

we can express a rule ‘X FounderOf Y <=> Y FoundedBy X’ , where X and Y are single

unique entities. Additionally, we found that the slot names Contact_Meet_PlaceTime

could be inferred as Contact_Meet_Entity, if the Entity was a FAC and the extracted

sentence contained an additional ORG/FAC tag. We also remove erroneous slots that have

extractions that are thousands of characters in length or tool small. Errors of extracting

long sentences can typically be attributed to poor sentence parsing of web documents. We

have some valid ‘small’ extractions. For example, a comma may separate a name and a

title (e.g. “John, Professor at MIT”). But such extraction rules can be particularly noisy,

so we check to see if the extracted values have good entity values.

3.2 Evaluation

We evaluate the effectiveness of extracting slot values for 139 entities. We look at the

baseline coverage for entities and slot names we present in a 500M page snapshot of the

English web. We estimate the precision and recall of our extractions over several extracted

facts.

Our system was developed on a 32-core server described in Table 3-2. Each document

is annotated using a name entity extraction and in document coreference. A bundle of

33

documents are serialized into chunks and encrypted. The total size of the data after

compression and encryption is 4.5TB. Data is ordered into 11952 date-hour buckets ranged

from 2011-10-05-00 (5th of October 2011, 12am) until 2013-02- 13-23 (13th of February

2013, 11pm). The first four months of data (October 2011 - February 2012) is for training

purposes, and we use this portion for rule and pattern creation and tuning. The data set

contains text from several web page types as listed in Table 3-3.

We develop 172 extraction patterns covering each slot-name/entity-type combinations.

Out of the 500M documents and 139 entities we found 158,052 documents containing

query entities, 17,885 unique extracted slot values for 8 different slots. We did not get any

results from 31 entities and 4 slots.

Table 3-2. Server specifications.Spec Details

Processor 32 core AMD OpteronTM 6272OS CentOS release 6.4 final

Software stack GCC version 4.4.7, Java 1.7.0.25, Scala 2.9.2, SBT 0.12.3RAM 64GBDrives 2x2.7TB disks, 6Gbps, 7200RPM

Table 3-3. Document chunk distribution.Document type # DocumentsArxiv 10988Classified 34887Forum 77674Linking 12947Mainstream News 141936Memetracker 4137News 280629Review 6347Social 688848Weblog 740987

In Table 3-4 we performed two samples of a baseline and estimate the correctness of

the extractions. The first was addressing the overall performance measures of the system,

e.g. precision and recall. The latter experiment was performed over an enhanced version of

34

Table 3-4. Sampled accuracy of the results of the extracted facts.Correct Incorrect entity

nameIncorrect value

Sampling #1 55% 17% 27%Sampling #2 54% 15% 31%

the system; we included the aliases from WP API, the alias generation process, and some

additional patterns. We produced accuracies in range of 54% and 55%. We classify the

errors into two sets, incorrect entities and incorrect extractions. We found 15% and 17%

incorrect entity names and we identified 27% and 30% incorrect value extracted across all

entities and slot types. The majority of errors were regarding poor slot value extraction

patterns and incomplete aliases.

After enhancing the system via better and more extraction patterns we provide more

detailed statistics, as displayed in Table 3-5 and Table 3-6. Table 3-5 shows the recall

for each slot name. Entities can have different coverages across the entire Web. Some

of them were more popular (‘William H. Gates’) or less well known such as (‘Stevens

Cooperative School’). Similarly, slot names have various coverages, for example, Affiliate is

more probable across the entities when compared to AwardsWon. The slot name Affiliate

was extracted the most number of times; AwardsWon contained the next fewest with 38

instances found.

Table 3-5. Recall measure: generic slot names like Affiliate had the most recall, comparedto less popular slot names e.g. DateOfDeath.

Slot name Instances found Entity coverageAffiliate 108598 80AssociateOf 25278 106AwardsWon 38 14Contact_Meet_Entity 191 8Contact_Meet_PlaceTime 5974 109DateOfDeath 87 14EmployeeOf 75 16FoundedBy 326 30FounderOf 302 29Titles 26823 118TopMembers 314 26

35

Table 3-6. Accuracy measure: accuracy of AffiliateOf was the best and Affiliate appliedpoorly due to ambiguity of being an affiliate of somebody/something.Slot name Correct Wrong entity Incorrect valueAffiliate 1% 95% 5%AssociateOf 63.6% 9.1% 27.3%AwardsWon 10% 10% 80%Contact_Meet_Entity 21% 42% 37%Contact_Meet_PlaceTime 5% 20% 85%DateOfDeath 29.6% 71% 25%EmployeeOf 5% 30% 65%FoundedBy 62% 17% 21%FounderOf 50% 0% 50%Titles 55% 0% 45%TopMembers 33% 17% 50%

An Affiliate relationship can be defined in three general ways [29]:

• A relationship consisting solely of the two groups interacting in a specific eventcontext is not enough evidence to constitute a religious/political affiliation;

• Former political or religious affiliations are correct responses for this slot;

• Any relation that is not of parent-child form; a suborganization is not an affiliate itsparent organization but rather a Memberof.

Affiliate is a generic slot name; extracting affiliate relationships is difficult because the

actual relationship must be determined. Our patterns for this relationship led to noisy

results.

However, less ambiguous slot names (AssociateOf) obtained higher accuracy but we

have lower recall. We developed patterns that explicitly expressed these relationships, but

we did not create enough patterns to express all forms of those slot names.

Table 3-6 addresses the relative accuracy measure per slot value. AssociateOf has the

highest accuracy with 63.6%, while Affiliate, Contact_Meet_PlaceTime and EmployeeOf

have the lowest accuracies of 1%, 1% and 5% accuracy respectively.

3.3 Discussion

Table 3-5 shows the distribution of extracted slot names. The number of extraction

between slot names vary greatly. Some slots naturally have more results than other slots.

36

For example, DateOfDeath and CauseOfDeath have some of the fewest entities, because

only a few entities are deceased.

Some patterns use common words as part of their patterns causing more extractions.

For example, Affiliate looks for common words (like and, with) as part of the pattern

content. These words are more common than dead, died or founded in other patterns.

Some of the entities are popular and appear at a greater frequency in the data set. For

example, a ‘Corn Belt Power Cooperative’ Google search results in 86,800 documents,

while ‘William H. Gates’ returns 3,880,000 documents. We observed that more than half

of the entities appear in less than 10 documents in the data set; a large portion have

appeared only once. This significant change in coverage supports the viability of our

search and filter schemes.

The system pipeline architecture is an efficient method of processing the stream of

data. Each hour of in the corpus contains and average of 380 MB of compressed data. It

takes and hour for the system to extract facts from 140-hour worth of data from the KBA

corpus.

For more details about this project, please refer to our papers [30, 31].

37

CHAPTER 4SCALABLE IMAGE RETRIEVAL

In this dissertation, one of the major problem we studied is the multimodal

information retrieval on images and text. The ensemble fusion model for multimodal

information retrieval is further explained in Chapter 5.

As the number of images grows rapidly on the Internet, the scalability of image

retrieval systems becomes a significant issue. The remaining sections in this chapter focus

on the scalability issue of image retrieval on millions of images, since text retrieval usually

is very efficient on millions of titles using existing technologies. In this chapter, we propose

two distributed clustering algorithms to scale up the bag-of-visual-words model on millions

of images and billions of visual features by leveraging distributed systems.

Image retrieval is the search for desired images from an image dataset according to

queries from users. Content-based image retrieval (CBIR), which emerged in 1990s, is

a special case of image retrieval, where the queries are images and the search process is

based on the visual content of images rather than textual captions or image labels. In the

following sections, the term ‘image retrieval’ specifically refers to CBIR, since our focus is

to solve the image retrieval problem based on visual content on large-scale datasets.

Huge image datasets of terabytes or even petabytes have been generated from

the Internet. For example, ImageNet [32], an open image dataset for computer science

research, contains over 20 million images. And social networks, such as Facebook and

Twitter, can generate over petabytes of images everyday. Comparing all the images in

an existing dataset to the query images is not a scalable solution. Thus indexing is a

necessary step to handle large-scale image datasets. In order to index images, they should

be represented as vectors, similar to the bag-of-words model in information retrieval.

With this motivation, the bag-of-visual-words model was designed in the computer vision

community [33, 34] to represent images in ‘visual words’ vectors. Existing indexing

38

approaches in information retrieval, such as inverted indexing, can be directly applied on

the ‘visual words’ vectors.

Since the building process of the bag-of-visual-words model requires a lot of time

on large image datasets, we designed two distributed clustering algorithms to scale up

the building process of the bag-of-visual-words model by utilizing the state-of-the-art

distributed systems.

4.1 Background

The bag-of-visual-words (BoVW) model first appeared in early 2000s [33] and

has been widely used in the computer vision community for tasks such as category

classification [35] and image retrieval [34]. BoVW can represent one image as a histogram

of independent visual words in vector format. Visual words are generated by applying

clustering on local features of images. Then we can use indexing approaches to index the

visual words vectors of images. The process to build the bag-of-visual-words model on an

image dataset is described in Figure 4-1.

In the feature extraction step, local features, such as interest points or local patches,

are extracted from images. We have chosen SIFT (Scale-Invariant Fast Transform)

features, which are invariant to scale, rotation and illumination, making SIFT [36] an ideal

candidate for the bag-of-visual-words model.

After feature extraction, a clustering algorithm is used to divide features into different

clusters. Researchers [33–35] have commonly used K-Means clustering for its simplicity

and rapid convergence, but previous work [34] pointed out K-Means cannot scale up with

a large number of clusters. Even a distributed K-Means, such as Mahout K-Means, fails

to scale up with large numbers of clusters. Thus we have implemented two distributed

clustering algorithms on Hadoop to overcome this issue.

After the clustering step, clusters are treated as independent visual words and finally

a visual vocabulary is formed with these visual words. Then for a given image, the local

39

Figure 4-1. The process of building the BoVW model. Reprinted with permission fromGoogle Images, https://images.google.com/ (October 20, 2017).

features are quantized by assigning the closest visual words to them, to create a histogram

of visual words. For example, the cat image is represented as (1, 3, 2, 2)T in Figure 4-1.

To handle millions of images and billions of features, state-of-the-art distributed

systems were employed for both scalability and stability in our algorithms. All the

time-consuming steps, such as feature extraction, vocabulary construction and image

representation, are run on Hadoop [37]. Mahout [38], an open-source scalable machine

learning library, provides a distributed K-Means implementation on top of Hadoop, which

we also utilized in our distributed hierarchical K-Means. Solr [39], an information retrieval

server based on Lucene [40], is used for indexing and searching.

4.2 Related Work

In recent years, some of the research efforts in image retrieval community have been

focusing on developing scalable algorithms for image retrieval. For example, in [41],

Perronnin et al. applied compressed Fisher kernel framework instead of the BoVW model

to obtain better retrieval quality, and the compressed Fisher kernel framework was more

40

efficient than the non-compressed version. In [42], Deng et al. proposed a hierarchical

semantic indexing to handle large-scale similarity learning for images. The proposed

learning approach was fundamentally parallelizable and as a result scales more easily

than previous work, as stated in their paper. These previous work focus on designing new

algorithms to improve retrieval quality without spending too much time for the retrieval

process, while what we did was to scale up an existing mature BoVW model.

A few projects have used Hadoop as a distributed platform to process image search in

parallel. Hadoop was used to parallelize feature extraction, indexing and searching in [43]

by Gu and Gao. In [44], Yin and Liu first built a database of image features using SURF

(Speeded Up Robust Features) algorithm and LSH (Locality-Sensitive Hashing) and then

performed the search on Hadoop in a parallel way. In [45], Premchaiswadi et al. proposed

a similarity metric between images and performed parallel similarity computation between

the query image and existing images using Map-Reduce on Hadoop. Grace et al. [46]

employed Hadoop Map-Reduce to extract features, compute similarity scores and rank

the images based on similarity scores on medical datasets. Most of the related work listed

above employed Hadoop Map-Reduce to parallelize the search process of finding similar

images, while in our projects we used Hadoop as the platform to accelerate the building

process of the BoVW model.

4.3 System

To process a large number of images at high speed, the BoVW model is built in

parallel on top of Hadoop. After encoding images with visual words, the size of the visual

words vectors is significantly smaller than the original image dataset, usually less than

0.1%. A Solr server can then be deployed to handle the indexing and searching quite

efficiently without requiring significant resources. In our experiments, the image searching

process is very fast, ususally costing less than a few seconds.

Someone may argue the BoVW building process can be conducted offline, so scaling

up the building process is not necessary. However, people usually need to run the BoVW

41

building processes many times to tune the vocabulary size, i.e. the number of visual

words. And a slow approach may take a few days to finish on large datasets with large

numbers of visual words, while a fast approach only costs a few hours in the same

scenario, as shown in experiments.

4.3.1 Overview

Since a single-node cluster and multi-processing cannot deal with such many images,

we employed a Hadoop cluster to provide scalability and stability for our system. The

feature extraction and image representation both fit the data-parallel scheme of the

Map-Reduce paradigm, hence straight-forward to be parallelized on the Hadoop using

Map-Reduce. Lire [47] is used to extract 128-dimensional SIFT features from images.

The bottleneck of the system is the vocabulary construction step, because it involves

iterative clustering algorithms to generate visual words from large numbers of local

features. As shown in related work [33–35], K-Means was used as the default clustering

algorithm to generate visual words for its fast convergence and good performance.

However, the performance of K-Means, even a distributed Mahout K-Means, deteriorates

quickly as the number of clusters increases. Thus we have designed and implemented

distributed approximate K-Means (d-AKM) and distributed hierarchical K-Means

(d-HKM) algorithms on Hadoop to solve this problem. While both d-AKM and d-HKM

run much faster than Mahout K-Means, d-AKM has better running time performance

than d-HKM for smaller cluster numbers and d-HKM works better for larger cluster

numbers, as demonstrated in experiments.

4.3.2 Distributed Clustering Algorithms

Since the most time consuming step of each iteration in these three algorithms is the

assignment step, where the features are assigned to their corresponding nearest clusters.

Let’s assume that each HDFS block in Hadoop can hold s features and the Hadoop cluster

has sufficient resources, then the time complexity of one iteration of Mahout K-Means

42

(d-KM) on the Hadoop is O(s × k). The complexities of these three algorithms for one

iteration are shown in Table 4-1.

Table 4-1. The time complexity of one iteration of Mahout K-Means (d-KM), d-AKM andd-HKM.

Algorithm d-KM d-AKM d-HKMComplexity O(s× k) O(p%s× k) O(s× sqrt(k))

4.3.2.1 Distributed approximate K-Means

In the d-AKM, we have applied an approximate process using a randomized k-d tree

forest to find the nearest cluster centroid for each feature, as introduced in [48–50]. The

d-AKM is parallelized using Map-Reduce on Hadoop. Let’s assume the d-AKM uses at

most p%k comparisons for each feature when searching for its closest cluster centroid

among k clusters, then the running time complexity for one iteration of d-AKM is reduced

to O(p%s×k). The time complexity of k-d tree building is O(k× logk) [48], which is much

smaller than O(p%s× k), since s is usually much larger than k and logk.

4.3.2.2 Distributed hierarchical K-Means

The d-HKM is shown in Figure 4-2. At the top layer, a single Mahout K-Means is

applied to divide the feature dataset into kt clusters parallelly on Hadoop. At the bottom

layer, for each cluster of the kt clusters, a single Mahout K-Means is applied to divide

this cluster into kb clusters in parallel. All the bottom-level Mahout K-Means clustering

processes run in parallel with the total number of clusters k = kt × kb.

At the top level, the running time complexity of one iteration of Mahout K-Means

is O(s × kt). At the bottom level, for each Mahout K-Means, the time complexity

of one iteration is O(s × kb). Assuming we have m bottom-level Mahout K-Means

clustering running at the same time, the running time complexity of one iteration of all

the bottom-level K-Means processes is O(s × kb × kt/m) = O(s × k/m). Thus, when kt,

kb and m are close to each other, the time complexity of one iteration of both the top-level

and the bottom-level clustering processes could be O(s× sqrt(k)).

43

Figure 4-2. The top-down hierarchical K-Means.

In addition, the number of iterations is also positively related to the number of

clusters. The d-AKM usually converges with a similar number of iterations as Mahout

K-Means. For the d-HKM, both top-level and bottom-level K-Means converges with

smaller numbers of iterations compared to Mahout K-Means. In conclusion, both d-HKM

and d-AKM should run much faster than Mahout K-Means.

4.4 Evaluation

The Oxford dataset and ImageNet dataset were used to evaluate the running

time performance of our system, especially the distributed clustering algorithms. The

44

experiments were run on the Pivotal Analytics Workbench (AWB) and Amazon Web

Services (AWS).

4.4.1 Datasets

4.4.1.1 Oxford

The Oxford building dataset, provided by University of Oxford [34], contains 5062

images about different landmark buildings in Oxford Campus searched from Flickr.

4.4.1.2 ImageNet

The two training datasets of the ImageNet Large Scale Visual Recognition Challenge

2014 (ILSVRC14) [51] were used to provide a large dataset of 185GB with over 1.7 million

images and over 230 million features.

The specifics of the 4 datasets are shown in Table 4-2.

Table 4-2. Dataset specifics.Dataset Image # Image size Feature # Feature sizeOxford 5,062 2.2GB 2,734,105 3.0GBImageNet 1,737,734 185.0GB 230,428,057 260.6GB

4.4.2 Performance of Mahout K-Means, d-AKM and d-HKM

This section compares the performance of Mahout K-Means (denoted as d-KM in

the figures), d-AKM and d-HKM with different cluster numbers. Note performance is

equivalent to running time in this chapter. In all the experiments listed in this chapter,

the maximum number of comparisons conducted in each iteration for d-AKM is 5% of the

number of clusters, and kt, kb and m are roughly the same for d-HKM.

The first experiment is to compare the running time of Mahout K-Means, d-HKM

and d-AKM on the Oxford dataset with small cluster numbers on AWB, as shown in

Figure 4-3. The running time of d-KM increases almost linearly with the number of

clusters, while d-AKM and d-HKM are very flat. d-AKM performs better than d-HKM

because d-HKM has very large overhead, due to its two-layer setup and multi-threading

mechanism. When the cluster number increases to 10k, the running time of d-KM

45

Figure 4-3. Running time of different algorithms on Oxford dataset.

increases to over 1000 minutes, demonstrating Mahout K-Means cannot scale up with

large cluster numbers.

A second experiment to compare d-AKM and d-HKM on the Oxford dataset with

larger cluster numbers is shown in Figure 4-4. The running time of d-AKM increases

almost linearly with the number of clusters, while the running time of d-HKM is quite flat

as the cluster number increases, since d-HKM has better running time complexity than

d-AKM for large cluster numbers.

4.4.3 Performance on Large Datasets

The ImageNet dataset was used for testing the performance of the building process

of the BoVW model on large numbers of images. Since d-HKM has better running time

complexity than d-AKM and Mahout K-Means with regard to the numbers of clusters,

d-HKM was used for vocabulary construction in all the experiments shown in this section.

46

Figure 4-4. Performance comparison between AKM and HKM with larger cluster numbers.Note: k refers to a thousand in the figure.

4.4.3.1 Different subsets

There are two groups of subsets generated from ImageNet: the first group with

20GB, 40GB, 60GB, 80GB and 100GB; the second group with 5GB, 10GB, 20GB,

30GB and 47GB. The experiments on the first group were run with 10,000 clusters and

300 containers using AWS, as shown in Figure 4-5(A). The experiments on the second

group were run with 2,500 clusters and 2,000 containers on Pivotal AWB, as shown in

Figure 4-5(B).

With sufficient resources, the running time of the building process of BoVW grows

sublinearly to the dataset size on Hadoop, as shown in Figure 4-5(B). But with limited

resources, the running time of our approach grows almost linearly to the dataset size on

Hadoop, as shown in Figure 4-5(A). But even with only 300 containers, our approach still

can process 100GB image data with 10k visual words in less than 9 hours.

47

A Group 1 with 10k visual words and 300 containers B Group 2 with 2.5k visual words and 2000 contain-ers

C 20GB with 300 containers

Figure 4-5. Experiments on Large Datasets. Note: k refers to a thousand in the figures.

4.4.3.2 Different cluster numbers

The number of clusters has significant influence on the running time of the vocabulary

construction and image representation steps. Several experiments has been conducted on

a 20GB dataset with different cluster numbers from 10k to 90k using 300 containers on

AWS as shown in Figure 4-5(C). The performance of our system is sublinear, very close to

sqrt(k), in the number of clusters with d-HKM for vocabulary construction. It can process

20GB with 90k clusters in less than 4 hours, which is quite fast with only 300 containers.

48

CHAPTER 5MULTIMODAL ENSEMBLE FUSION

In this chapter, we propose a multimodal ensemble fusion model to demonstrate

the theory explained in Chapter 2, by combining the results of text-only processing

(disambiguation or retrieval) and image-only processing (disambiguation or retrieval)

to achieve better quality than them. Our ensemble fusion model is designed to capture

the complementary relation and correlative relation between images and text. Different

ensemble approaches, including the linear rule, the maximum rule and logistic regression,

are used to combine the results from methods using single-modality data. Experimental

results on the UIUC-ISD dataset and the Google-MM dataset show our ensemble fusion

model outperforms approaches using only single modality for disambiguation and retrieval.

Word sense disambiguation (WSD) and information retrieval (IR) are used as tasks

to demonstrate the effectiveness of our ensemble fusion model. We employ several existing

algorithms and models in WSD and IR in our ensemble fusion model, including the

unsupervised Yarowsky algorithm [16] for text disambiguation and the inverted indexing

provided by Solr [39] for indexing and searching. For disambiguation, the results from text

disambiguation and image disambiguation are senses with confidence scores. For retrieval,

the results from text retrieval and image retrieval are similarity scores between documents

to queries.

5.1 Related Work

Related work on multimodal fusion has been introduced in Chapter 2.1. This section

focuses on explaining previous work related to information retrieval and word sense

disambiguation.

5.1.1 Word Sense Disambiguation

Words in natural languages tend to have multiple senses, for example, the word crane

may refer to a type of bird or a type of machine. The problem of determining which

sense of a word is used in a sentence is called word sense disambiguation (WSD). WSD

49

was first formulated as a distinct computational task during the early days of machine

translation in the 1940s, making it one of the oldest problems in computational linguistics.

Different kinds of methods [6, 7] have been introduced to solve WSD, including supervised

approaches, unsupervised approaches and knowledge-based approaches. While most of the

existing approaches exploit only textual information, very limited research efforts have

been conducted on multimodal data for word sense disambiguation [14, 15].

For supervised approaches, many supervised statistical algorithms [6, 7] have been

employed for WSD, including decision list, decision tree, Naive Bayesian, Neural Networks

and Support Vector Machines. However, it is unrealistic to manually label a very large

collection of textual data, which is the major limitation of the supervised approaches.

Unsupervised approaches [6, 7], on the other hand, do not require a large labeled dataset,

which enables them to overcome the knowledge acquisition bottleneck, i.e. the lack of

large data collections with manual annotations. But unsupervised approaches have a

major disadvantage that they do not exploit any knowledge inventory or dictionary of

real-world senses. Knowledge-based methods, which utilize knowledge resources (e.g.

dictionaries, ontologies, etc.), provide a better trade-off between disambiguation accuracy

and computational costs than supervised and unsupervised methods. One of the famous

unsupervised algorithms, Yarowsky algorithm [16] is employed in our ensemble fusion

model for text disambiguation.

5.1.2 Information Retrieval

Information retrieval is the activity of obtaining relevant information to a query from

a collection of documents (usually textual documents). It involves many research topics,

such as document representation models, similarity metric, indexing, relevance feedback,

reranking, and so on. The bag-of-words model is commonly used to represent textual

documents in information retrieval and natural language processing. In this model, a

textual document or sentence is represented as a bag or a set of its words in an order-less

50

and grammar-free manner. The frequency vectors or occurrence vectors of words are

treated as features in this model.

Image retrieval is the search for desired images from an image dataset according to

queries from users [8]. Content-based image retrieval (CBIR), which emerged in 1990s,

is a special case of image retrieval, where the queries are images and the search process

is based on visual content of images rather than textual captions or image labels. Image

retrieval borrows many existing algorithms and technologies from information retrieval.

For CBIR task, the most popular approach uses the bag-of-visual-words model

[33] with local features like SIFT [36] features for representing images. Similar to the

bag-of-words model, the bag-of-visual-words model is designed to represent images as

a frequency or occurrence vectors of “visual words”. The extracted local features from

images are quantized into a dictionary of visual words, with which each image can further

be represented by a histogram of visual words. Visual words are generated offline by

clustering from local features of images [33]. Thus techniques in information retrieval can

be easily borrowed and applied to CBIR, and the model has been proven effective and

efficient [34, 52].

One of the most important indexing algorithms, the inverted indexing algorithm

is used for indexing and searching images and text in our model. For each word, the

inverted index stores a list of documents in which the word appears. Inverted indexing

can provide fast full-text document search, hence is widely applied in the document

information retrieval community. Our implementation for image retrieval utilizes the

bag-of-visual-words model and inverted indexing.

5.2 Model

In our ensemble fusion model, text processing and image processing are conducted

on text and images separately and a fusion algorithm is used to combine the results. For

disambiguation, the results from text disambiguation and image disambiguation are senses

with confidence scores. For retrieval, the results from text retrieval and image retrieval are

51

similarity scores between documents and queries. The details of the model and different

ensemble approaches are explained below.

5.2.1 Ensemble Fusion Model

In the ensemble fusion model, images and text are first processed separately to

provide decision-level results. Then the results are combined using different approaches,

including the linear rule, the maximum rule and logistic regression classification, to

generate the final results.

Let’s use score to denote the results from text processing and image processing.

For disambiguation, score refers to the confidences scores (c1, c2, c3, .., cn)T of senses

(s1, s2, s3, .., sn)T . For retrieval, score refers to the similarity score of a document to the

query document. The process of our ensemble fusion model is shown in Figure 5-1.

Let’s simplify the scenario for word sense disambiguation: say for one keyword w with

two senses s1 and s2, and a document d with one image i and a textual sentence t, the

image classifier generates (s1, ci1) and (s2, ci2), and the text classifier generates (s1, ct1)

and (s2, ct2), where ci1, ci2, ct1 and ct2 denoting the confidence scores of senses s1 and s2

generated by image disambiguation and text disambiguation respectively. Confidence

scores are normalized into [0, 1] interval. The sense with the higher confidence score

between s1 and s2 is used as the disambiguated sense annotation for the word w.

Let’s also formulate the retrieval problem: say for a document d with one image i and

a textual sentence t in the data collection, the image retrieval generates similarity score

scorei and the text retrieval returns similarity score scoret.

Our ensemble model is simple but powerful. The experimental results demonstrate

the effectiveness of our model. In addition, the model can be viewed as a general

framework for multimodal fusion, where you can come up with new fusion approaches

to combine the results from text processing and image processing, or new text processing

and image processing methods. It also can be expanded to more modalities, such as audios

and videos, beyond only images and text.

52

Probabilistic Ensemble Fusion (Linear Rule,Max Rule, Logistic Regression)

Text Processing(Yarowsky, Bag-of-Words)

Image Processing(SVM, Bag-of-Visual-Words)

scoret scorei

scoref

Largemouth bass fishing tips

Figure 5-1. The ensemble fusion model. Photo courtesy of Kate Saenko.

5.2.2 Ensemble Approaches

We proposed rule-based and classification-based approaches to combine the results

from image processing and text processing. There are two rule-based approaches,

linear rule fusion and maximum rule fusion. Logistic regression is employed as a

classification-based fusion approach in our model.

5.2.2.1 Linear rule

The linear rule fusion uses a weight λ to combine the scores from image processing

and text processing. For disambiguation, the fused confidence scores for s1 and s2 are:

cf1 = λ× ci1 + (1− λ)× ct1 (5–1)

53

cf2 = λ× ci2 + (1− λ)× ct2 (5–2)

λ = Accuracyi/(Accuracyi + Accuracyt) (5–3)

λ is calculated by dividing the accuracy of image disambiguation by the sum of accuracy

of text and image disambiguation on the validation datasets.

For retrieval, the fused similarity score for d is:

scoref = λ× scorei + (1− λ)× scoret (5–4)

λ = APi/(APi + APt) (5–5)

λ is calculated by dividing the AP (average precision) of image retrieval by the sum of the

AP of text and image retrieval on training queries.

5.2.2.2 Maximum rule

The maximum rule selects the highest confidence score or similarity score from text

processing and image processing. For disambiguation, the maximum rule chooses the

sense s with the highest confidence score c from (s1, ci1), (s2, ci2), (s1, ct1) and (s2, ct2). For

example, with (s1, 0.45) and (s2, 0.55) from image classification and (s1, 0.91) and (s2, 0.09)

from text classification, we choose s1 as the output sense for the document d according to

the maximum rule, because the text classification outputs the highest confidence score 0.91

for sense s1. For retrieval, the maximum rule simply chooses the larger score from scorei

and scoret as the final score scoref .

5.2.2.3 Logistic regression

For logistic regression, confidence scores and similarity scores are used as features

to train the logistic regression classifier. For disambiguation, confidence scores from two

modalities, ci1, ci2, ct1 and ct2, are used to train the logistic regression classifier on the

validation datasets. For retrieval, similarity scores, returned by training queries, are used

to train the logistic regression classifier to determine if a document is relevant or similar

to the query. Then the logistic regression classifier is used to classify the documents to get

54

the final results. And the confidence scores of the logistic regression are used as the final

confidence scores for WSD or the final similarity scores for IR. Logistic regression is chosen

for its non-linear transformation of the confidence scores or similarity scores compared to

rule-based approaches.

5.2.3 Applications (Individual Approaches and Implementation)

5.2.3.1 Disambiguation

For text disambiguation, the unsupervised Yarowsky algorithm [16] is implemented.

The iterative Yarowsky algorithm starts with a small set of seed rules to disambiguate

senses and a large untagged corpus. In each iteration, the algorithm first applies known

rules to untagged samples and learns a set of new rules from newly tagged samples.

This process is repeated until all training samples are tagged, and the learned rules are

arranged in descending order of confidence scores, which are determined by the numbers of

samples supporting the rules. When given an unseen testing sample, the algorithm returns

the first rule matching the testing sample in the ordered list and the confidence score of

the matched rule.

For image disambiguation, we use SIFT [36] to extract local features and the

bag-of-visual-words model [33] to represent images. Then, a SVM (Support Vector

Machine) classifier is trained on the bag-of-visual-words vectors to classify images. The

SVM model is a supervised classification model, the goal of which is to construct a set

of hyperplanes in the high-dimensional feature space with the intuition of maximizing

the margins between different hyperplanes [53]. Both the image disambiguation and the

text disambiguation generate sense annotations along with confidence scores for testing

samples. The image classifier and text classifier are trained on training datasets.

For text disambiguation, the Yarowsky algorithm [16] implementation is written in

C++ and the pseudo probability distribution is implemented over the Yarowsky classifier

using Python. For image disambiguation, OpenCV is used to extract SIFT features from

images, the K-Means from Python scikit-learn is used to generate visual words, and

55

the multi-class SVM implementation from Python scikit-learn is used to disambiguate

images. The ensemble fusion model uses the logistic regression implementation with L2

regularization from Python scikit-learn.

5.2.3.2 Retrieval

In our implementation, Solr [39], a web server based on Lucene, is deployed to handle

indexing and searching for textual data and image data with inverted indexing and tf-idf

weighting. The textual sentences are represented using bag-of-words model and the

images are represented using the bag-of-visual-words model [33]. Both text and images are

represented as vectors, which makes them straight-forward to be indexed and searched by

Solr.

The cosine similarity with tf-idf weighting on the word vectors or visual word

vectors is used as similarity metric for documents (images and textual sentences). The

cosine similarity scores between documents and a query document are used to rank the

documents by Solr. The tf-idf (term frequency-inverse document frequency) weight is a

numerical weight often used in information retrieval and text mining, to evaluate how

important a word is to a document in a corpus. Given a query document, the sentence and

the image are transformed into their proper representation vectors and then searched by

Solr separately. Solr returns the ranked lists of documents with similarity scores for both

text retrieval and image retrieval.

For text retrieval, the bag-of-words model is used to represent textual sentences.

For image retrieval, LIRE, a Java library for image processing, is used to extract SIFT

features. The bag-of-visual-words model is implemented in Java using distributed K-Means

algorithms. Solr provides indexing and searching for both images and text. We use the

logistic regression implementation with ridge regularization from Weka for fusion.

5.3 Evaluation

Experiments were run on the UIUC-ISD dataset [2] and the Google-MM dataset to

test the performance of the three fusion approaches used in our ensemble fusion model.

56

Results demonstrated these three fusion approaches achieve higher quality than the

text-only and image-only methods.

5.3.1 Datasets

5.3.1.1 UIUC-ISD

The multimodal UIUC-ISD dataset [2] is used to test the accuracy of the text-only

disambiguation (Yarowsky algorithm), the image-only disambiguation (the SVM classifier)

and the three fusion approaches in our ensemble fusion model for WSD. There are three

keywords “bass”, “crane” and “squash” in the dataset. For each keyword, we selected two

core senses. There are 1691 documents for “bass”, 1194 documents for “crane” and 673

documents for “squash”.

We have constructed a training dataset, a validation dataset and a testing dataset for

each keyword. The training dataset is used to train the image and text classifiers. The

validation data is used to train the logistic regression classifier and select the linear weight

λ based on the accuracy of the image disambiguation and text disambiguation on the

validation dataset. The testing dataset is used to evaluate the fusion algorithms and to

demonstrate that by using multimodal fusion, we can get higher disambiguation accuracy

compared to methods using single modality.

5.3.1.2 Google-MM

The Google-MM dataset is used to evaluate the retrieval quality of image-only

retrieval, text-only retrieval and these three fusion approaches in our ensemble fusion

model for information retrieval. We have crawled 2,209 multimodal documents using

Google Images with 20 object categories (airplane, cat, dog, etc.) and 14 landmarks (Big

Ben, Eiffel Tower, The Taj Mahal, etc.). Each document is composed of one title and

one image. For each category or landmark, we have prepared one query for training and

one query for testing, with each query containing a few keywords and one image. For

each training or testing query, the ground truth results are provided for retrieval quality

evaluation.

57

5.3.2 Results

5.3.2.1 Word sense disambiguation

The experimental results for WSD on the UIUC-ISD dataset are shown in Table 5-1.

From the table, the accuracy of the three fusion methods is much higher than the

image-only and text-only methods on “bass” and “crane”. For “bass”, the ensemble

approaches improved the accuracy from 0.565 to 0.871. For “crane”, the maximum rule

approach improved the accuracy from 0.642 to 0.808. For “squash”, because the accuracy

of text-only disambiguation is low (0.188), we could not get much additional information

from the text-only disambiguation. Therefore the accuracy of the three fusion approaches

for “squash” is quite similar to the image-only classification.

Table 5-1. The accuracy of image-only, text-only, linear rule fusion, maximum rule fusionand logistic regression fusion on UIUC-ISD dataset for WSD.

Image Text Linear-Rule Max-Rule Log-Regbass 0.565 0.365 0.871 0.871 0.871crane 0.642 0.333 0.800 0.808 0.775squash 0.754 0.188 0.768 0.754 0.754

5.3.2.2 Information retrieval

The experimental results for information retrieval on the Google-MM dataset are

shown in Table 5-2. The retrieval quality is measured by the mean average precision

(MAP) of all 34 testing queries. From Table 5-2, all the three fusion approaches achieve

higher MAP than image-only and text-only retrieval. While image-only retrieval has

0.125 MAP and text-only retrieval has 0.761 MAP, the linear rule fusion achieves 0.802

MAP, the maximum rule fusion achieves 0.788 MAP and the logistic regression reaches

0.798 MAP. For the naive early fusion where we combine text words and visual words

directly as introduced in [14], MAP is 0.187, slightly higher than image only MAP. The

reasons why we have significantly lower image-only MAP are: 1) the most of the searched

images have noisy backgrounds, or incomplete coverage of the object; 2) we only use the

bag-of-visual-words model and cosine distance to calculate the similarity score, without

58

Table 5-2. Retrieval quality (MAP) of image-only, text-only, early fusion, linear rulefusion, maximum rule fusion and logistic regression fusion on the Google-MMdataset for IR.

Image Text Early fusion Linear-Rule Max-Rule Log-Reg0.125 0.761 0.187 0.802 0.788 0.798

Figure 5-2. IR: per-query detailed result.

complex techniques, since our focus is on the ensemble fusion part. The reasons why the

naive early fusion has very low MAP are: 1) the images and textual sentences are usually

not quite correlated; 2) the image feature space has much more dimensions than the text

feature space, which leads very low impact of the text features to the retrieval results.

Figure 5-2 shows the detailed per-query image result for IR. We can see that the three

fusion models have very close performance, while the naive early fusion of text and visual

words has low MAP for all 34 queries.

By combining the results of image-only and text-only processing under an ensemble

fusion framework, we can achieve higher performance compared to methods using only

single modality. In cases where image processing and text processing are reliable to

some extent, such as “bass” and “crane” in Table 5-1, the fusion model can improve the

performance to a great extent. Even in cases where one of the single-modality methods

59

has very poor performance, the fusion model can still generate results as good as or even

slightly better than the best results from any single-modality processing methods, such

as “squash” in Table 5-1. More analysis about our ensemble fusion model and fusion

approaches is presented below.

5.4 Discussion

In this section, we discuss how our ensemble fusion model captures the correlative

and complementary relations between images and text to achieve higher quality compared

to single-modality approaches. The differences between early fusion and ensemble fusion

models are also compared.

5.4.1 Correlation

Images and text display certain correlation at the decision level. For WSD, if image

processing and text processing generate the same sense annotation for one document,

the linear rule fusion, the maximum rule fusion and the logistic regression fusion will

usually generate the same sense annotation as image processing and text processing for

this document, according to our experimental results. For IR, if image retrieval and text

retrieval generate high similarity scores for one document, the linear rule fusion, the

maximum rule fusion and the logistic regression fusion would generate high similarity

scores for this document as well, according to our experimental results. Thus, although

our ensemble fusion model cannot capture the correlation between images and text at the

feature level, it can capture the correlation at the decision level.

5.4.2 Complementation

Although our ensemble fusion model can capture the correlation between images

and text at the decision level, this is not the main reason why we can improve the

quality, since in this case our model just generates consistent results as image-only and

text-only processing. However, the ability to capture the complementary relation between

image-only and text-only processing helps our model to generate more good results

compared to image-only and text-only processing. The average precision and average

60

recall of image-only processing, text-only processing and three ensemble fusion approaches

on WSD for keyword “bass” are shown in Table 5-3, as an example to illustrate the

complementary relation between image processing and text processing.

Table 5-3. The coverage, average precision (AP) and average recall (AR) of differentapproaches on WSD for keyword “bass”. Coverage refers to the percentage ofthe documents each approach can effectively disambiguate.

Image Text Linear-Rule Max-Rule Log-RegCoverage 1.000 0.376 1.000 1.000 1.000AP 0.522 0.857 0.862 0.862 0.859AR 0.636 0.297 0.884 0.884 0.893

Text processing usually has high precision but low recall, as shown in Table 5-3. For

example, Yarowsky classifier works well when the testing sentences contain patterns that

have been discovered in training datasets. It will generate very high confidence scores

for the correct senses in most cases, for example, (s1, 1.0) and (s2, 0.0) or (s1, 0.95) and

(s2, 0.05), with s1 usually being the correct sense. However for the sentences that do

not contain known patterns, Yarowsky classifier would fail to disambiguate between two

senses and output (s1, 0.0) and (s2, 0.0). Similar to text disambiguation, the text retrieval

also has high precision and low recall, because inverted indexing works well for textual

sentences which contain query keywords. For those sentences which do not contain query

keywords, the text retrieval can not return them as relevant results, thus causing recall to

drop.

On the other hand, image processing has high recall but low precision, as shown

in Table 5-3. For disambiguation, the image SVM classification can disambiguate all

images, but it is less accurate due to the noisy image data and image representation.

Hence image disambiguation generates less confident results, for example, (s1, 0.55) and

(s2, 0.45) or (s1, 0.60) and (s2, 0.40), with s1 possibly to be a wrong label. Image retrieval

also generate lower similarity scores for documents than text retrieval because of the noisy

representation of images. And since each image may contain hundreds or thousands of

61

local features, the image representation is more dense, so image retrieval has better recall

compared to text retrieval.

Hence, for documents in which the text processing works, the results of these three

fusion approaches in our ensemble fusion model would be consistent with text processing,

since the text processing outputs results with very high confidence scores or similarity

scores. For other documents in which the text processing fails, the results of these three

fusion approaches in the ensemble fusion model would be consistent with the image

processing because text processing returns no useful information for these documents.

Therefore, our ensemble fusion model can increase both precision and recall by taking

advantage of both text processing and image processing and avoiding their drawbacks in

the meantime.

5.4.3 Early Fusion vs Ensemble Fusion

Images and text have correlative and complementary relations between each other,

as we have discussed in prior sections. Early fusion can capture the correlative relation

between images and text at the feature level, while ensemble fusion can capture the

complementary relation at the decision level. Whether we should use early fusion or

ensemble fusion is dependent on the nature of the multimodal datasets.

In our multimodal datasets, the images and textual sentences are mostly complementary

to each other, which corroborates the fact that our ensemble fusion model can achieve

better quality than image-only and text-only approaches. On the other hand, the

correlative relation between images and text is not commonly found in the the documents,

which explains why the naive early fusion fails to improve retrieval quality. Since the

early fusion approaches use correlation analysis methods to fuse features from different

modalities, which aim to maximize the correlation effect between images and text in the

combined feature space, they are not expected to achieve very good results on the datasets

we used.

62

CHAPTER 6KNOWLEDGE BASE COMPLETION

A knowledge base (KB) is usually a data store of structured information about

entities, relations and facts. In recent years, huge knowledge bases, such as Freebase [9],

NELL [10] and YAGO [11], have been constructed to host massive amounts of knowledge

acquired from real-world datasets. Despite their huge size, these knowledge bases are

greatly incomplete. For example, Freebase [9] contains over 112 million entities and 388

million facts, while over 70% of people included in Freebase have no known place of birth

and 99% have no known ethnicity. Therefore, knowledge base completion has drawn a lot

of attention from researchers.

Formally speaking, knowledge base completion (KBC) is the task to fill in the gaps

in knowledge bases in a targeted way. Facts inside KBs are usually represented in triple

format as <subject, relation, object>, for example <Marvin_Minsky1 , wasBornIn,

New_York_City>. A knowledge base completion query can be formulated as <subject,

relation, ?>, which means given the subject and relation, what is the corresponding object

value(s)?

Knowledge bases can be constructed by iteratively extracting new information

from large datasets (usually text corpora) [9–11]. However, this is not the ideal solution

for KBC, because creating large datasets requires a lot of processing time and human

labor and the running time of knowledge base construction is usually too long. Inference

and learning in knowledge bases have been utilized in recent years for knowledge base

completion [54–57]. But learning effective, expressive and scalable models inside knowledge

bases is challenging [56, 57].

1 Marvin Lee Minsky (August 9, 1927 – January 24, 2016) was an American cognitivescientist concerned largely with research of artificial intelligence (AI), co-founder ofthe Massachusetts Institute of Technology’s AI laboratory, and author of several textsconcerning AI and philosophy. https://en.wikipedia.org/wiki/Marvin_Minsky.

63

https://en.wikipedia.org/wiki/Marvin_Minsky

Previous approaches for knowledge base completion ususally either utilize only

unstructured textual information, or only the structured information inside knowledge

bases. However, structured information, such as entity types and entity-to-entity

relatedness, can help fact/knowledge extraction tasks based on unstructured datasets.

On the other hand, approaches using strucutured information in KBs can benefit

by incorporating unstructured textual information, since knowledge bases are highly

incomplete. And fusing two different approaches with different types of datasets can

further help improve performance, because they have complementary strengthes and

weaknesses [58]. Another common problem of previous work is they are all batch-oriented

systems and they cannot prvoide fast real-time responses to user queries at query time.

In this chapter, we propose a query-driven knowledge base completion system

combining rule inference and question answering and fusing unstructured text and

knowledge bases. To our best knowledge, our system is the first system providing

query-time knowledge base completion and leveraging both unstructured and structured

data in depth.

We employ web-based question answering (WebQA) to solve knowledge base

completion for its flexibility and effectiveness based on the massive unstructured textual

information available on the Web. We design novel multimodal features and an effective

question template selection algorithm for WebQA, which can achieve better performance

with much fewer questions than previous work [1]. WebQA fuses unstructured textual

snippets and structured information from knowledge bases to extract features and uses

entity type information to filter out incorrect candidate answers. Our question answering

system exploits some similar techniques used in [1, 59], but we pursue both effectiveness

and efficiency and provide real-time responses to user queries.

Horn-clause logical rules are used in our system to infer new facts for KBC queries.

These rules are pre-learned by previous work [60] through ontological path finding.

However, only using existing facts in knowledge bases fail to match the premises of rules

64

to effectively infer new facts of interest in many cases, because knowledge bases are highly

incomplete. We employ WebQA to first extract missing premise facts from the Web and

then use rule inference to get answers for KBC queries. By combininig WebQA and logical

rules learned from knowledge bases, our augmented rule inference system can achieve

better KBC performance than only using existing information in knowledge bases.

We use ensemble fusion to combine augmented rule inference and web-based question

answering to further improve KBC performance. As discussed in previous work [58] by

Peng et al., approaches on different datasets display complementary relation between

each other, which can provide complementary information to improve performance by

combining them together. We use a few ensemble approaches to fuse the results of rule

inference and question answering, including linear rule, maximum rule, sum rule and

logistic regression. Experiments demonstrate significant performance gain after using

ensemble fusion.

We design a few query-driven approaches to eliminate unnecessary computations

and reduce the running time of our system on-the-fly. We implement the query-driven

snippet filtering component in WebQA, which can greatly reduce the number of snippets

for processing and improve the running time of the WebQA pipeline. For augmented rule

inference, we only use WebQA for rules of which premises are missing inside knowledge

bases and confidence thresholds to choose the most reliable results from WebQA to reduce

running time on-the-fly. We also use query-driven optimization to avoid unnecessary

WebQA queries in augmented rule inference.

Our contributions are shown below:

• We propose an effective and efficient KBC system by combining rule inference andweb-based question answering with the massive information available on the Weband in the knowledge bases. Our system fuses both unstructured data from the Weband structured information from knowledge bases in depth.

• We design and implement a web-based question answering (WebQA) system toextract missing facts from the unstructured Web with effective multimodal features

65

and question template selection, which can achieve better performance with muchfewer questions than previous work [1].

• We build an augmented rule inference system leveraging logical rules, existingstructured facts in knowledge bases and our web-based question answering system,to infer missing facts for KBC queries.

• We apply ensemble fusion to effectively combine question answering and ruleinference to achieve high KBC performance.

• To improve efficiency, we employ a set of query-driven techniques for WebQA andrule inference to reduce the running time for user queries on-the-fly.

• Extensive experiments have been conducted to demonstrate the effectiveness andefficiency of our system.

6.1 Related Work

Although huge knowledge bases have been constructed from large datasets, they

are far from complete as shown above. There are a few approaches to fill in missing

information from different research directions. In this section, we briefly discuss related

work on knowledge base construction, inference and learning inside knowledge bases and

question answering.

6.1.1 Knowledge Base Construction

Huge knowledge bases have been constructed since mid-2000s [9–11]. Most knowledge

bases use iterative construction processes to learn extractors, rules and facts from large

datasets [10, 11]. Some knowledge bases employ human workers to manually add new

information [9].

TAC KBP [61] and TREC KBA [5] are the two most famous annual competitions for

knowledge base construction. The goal of them is to develop and evaluate technologies

for building and populating knowledge bases from unstructured text. Most of these

methods process each document in turn, extracting as many facts as possible by using

named-entity linkage and (supervised) relation extraction methods. Summaries of the

standard approaches in TAC KBP and TREC KBA are given by Ji and Grishman [28],

Weikum and Theobald [62] and Frank et al. [63].

66

Manual KB construction/population is not a feasible approach because of its long

response time and intense human labor cost. Iterative KB construction requires very

long processing time to learn new extractors, rules and facts on large datasets [11].

Constructing knowledge bases on new datasets is not scalable because creating large

datasets is time-consuming and involves intense human labor, and construction processes

usually take very long time to finish, e.g. a fast streaming system by Morteza et al.

[30, 31] for TREC KBA still needs hours to process 5TB text data in one pass. Another

disadvantage of knowledge base construction is it can not guarantee to extract the missing

facts users are looking for in a targeted way, as knowledge base completion queries require.

6.1.2 Inference and Learning

Inference and statistical learning have been utilized in recent years for knowledge base

completion [54–57]. Logical rule inference has been widely used for inferring new facts in

knowledge bases [60]. Richardson and Domingos proposed the Markov Logic Networks

[54] for inference based on logical rules and graphical models. However, batch inference in

Markov Logic Networks is very time-consuming and unscalable for large knowledge bases.

Information inside knowledge bases can be structured as massive graphs of entities

and relations. Entities are treated as nodes and relations are treated as edges in

the knowledge graphs. Random walks in knowledge graphs have been utilized for

knowledge base completion for its scalability [55, 56]. Recent work [57, 64, 65] learned

embedded representations of entities and relations in the knowledge bases and used these

representations to infer missing facts. But learning expressive, scalable and effective

models can be challenging [56, 57].

Inference and learning in knowledge bases are restricted to the information only

available inside knowledge bases. To make the problem more difficult, information in

knowledge bases are highly incomplete. In our system, we combine web-based question

answering and logical rules to build a rule inference system to achieve better KBC

performance than using only information inside knowledge bases.

67

6.1.3 Question Answering

Open-domain question answering (QA) has been popular for a long time. QA returns

exact answers to natural language questions posed by users. Since 1999, a specialized

track related to QA has been introduced into the annual competition held at the Text

Retrieval Conference [66]. Web-based QA systems are highly scalable and are among the

top performing systems in TREC-10 [67]. Such systems issue simple reformulations of the

questions as queries to a search engine, and rank the repeatedly occurring N-grams in the

top snippets as answers based on named entity recognition (NER) and heuristic answer

type checking.

In our system, we implement web-based question answering as a subsystem to solve

knowledge base completion for its scalability, flexibility and effectiveness based on the

massive information available on the Web. We first formulate knowledge base completion

tasks to natural language questions, search these questions on the Web using search

engines and extract answers from crawled data. Our main focus is not developing better

QA systems, but rather addressing the issue of how to use and adapt such systems for

knowledge base completion. In [1], West et al. proposed different question templates

based on relations of entities and utilized existing in-house question answering systems for

knowledge base completion. We design our own question templates and a novel template

selection algorithm which can greatly reduce the number of questions, while maintaining

high performance. Our system uses some similar techniques such as entity linking and

type filtering shown in [1, 59], but we pursue both effectiveness and efficiency in WebQA,

provide real-time responses to user queries and study multimodal fusion of unstructured

text and structured knowledge in depth.

6.2 System Overview

As stated earlier, we propose a query-driven knowledge base completion system with

multimodal fusion of unstructured text and structured knowledge. Our system combines

rule inference and web-based question answering using ensemble fusion. The web-based

68

Figure 6-1. The query-driven knowledge base system pipeline.

question answering system utilizes structured knowledge from KBs to help extract facts

from the textual snippets returned by the Web. The rule inference system combines logical

rules, existing facts in knowledge bases and web-based question answering to infer missing

facts.

Our system pipeline is illustrated in Figure 6-1. The same KBC query <subject,

relation, ?> is passed to two different components (rule inference and question answering)

and processed separately by these two components. These two components produce ranked

candidate answers with confidence scores. Their results are finally fused by the ensemble

fusion component.

In question answering and rule inference, multimodal information from text and KBs

are also fused together to achieve high performance. The WebQA system first transforms

KBC queries to natural language questions and extracts candidate answers from textual

snippets searched by these questions on the Web. It then links candidate answers to

entities in KBs, utilizes entity category information and relation schema information to

filter out incorrect candidate answers, and employs entity-to-entity relatedness and entity

69

descriptions inside KBs for feature extraction. The augmented rule inference system uses

logical rules pre-learned based on the information inside KBs, existing facts inside KBs

and WebQA to infer missing facts.

We apply the multimodal ensemble fusion model similar to the model explained in

[58]. We tested different ensemble fusion approaches, including maximum rule, linear rule,

sum rule and logistic regression. The empirical results demonstrate sum rule is better in

most cases.

6.2.1 Ensemble Fusion

We applied ensemble fusion to combine rule inference and web-based question

answering. Our ensemble fusion model is similar to the model explained in Chapter 5. We

tested different ensemble fusion approaches, including maximum rule, linear rule, sum rule

(simply adding confidence scores for the same candidate answers together) and logistic

regression, similar to approaches in Chapter 5. The empirical results demonstrate sum rule

is best in most cases.

6.3 Web-Based Question Answering

In this section, we explain the web-based question answering system (WebQA) for

knowledge base completion by fusing both unstructured data from the Web and structured

information from knowledge bases. WebQA employs question templates to generate

multiple natural language questions for each knowledge base completion query. Then

textual snippets are crawled by searching these questions on the Web via search engines.

Different from traditional question answering systems, we utilize entity linking to collect

candidate answers from snippets. Candidate answers with incorrect entity types for KBC

queries are discarded. Various features are extracted for candidate answers by fusing

information from both the unstructured snippets and structured knowledge in KBs. Then

we rank the candidate answers by probability scores generated from classification on their

features. Our system exploits some similar techniques used in [1, 59], but we pursue both

effectiveness and efficiency in our system and provide real-time responses to user queries.

70

Figure 6-2. The web-based question answering system.

Compared to previous work [1] using question answering for KBC, we design effective

question templates to achieve high performance with much fewer questions. We propose a

greedy algorithm to automatically learn question templates for transforming KBC queries

to natural language questions. We conduct query-driven snippet filtering to reduce the

number of snippets for processing, which greatly improves the running time performance

of WebQA. While previous work used batch-oriented question answering systems [1, 59],

WebQA can provide real-time responses to user queries on the fly. We design novel and

effective features through early fusion [4] of information from unstructured text and

structured knowledge bases. Experimental results demonstrate both the effectiveness and

efficiency of WebQA.

6.3.1 WebQA Pipeline

In this section, we use <Marvin_Minsky, wasBornIn, ?> as an example query and the

correct answer to this query is New_York_City. More similar examples for 4 relations are

shown in Table 6-1.

There are four major components in the query-driven WebQA system, including

question generation, data collection, answer extraction and answer ranking. The system

pipeline is illustrated in Figure 6-2. In this section, we briefly explain the design and

implementation of these components.

6.3.1.1 Question generation

Structured queries are transformed into natural language questions using selected

question templates, as shown in Table 6-1. Each relation has multiple corresponding

71

question templates. For example, for relation wasBornIn, we use born, birth and birthplace

as its templates. Then for <Marvin_Minsky, wasBornIn, ?>, the corresponding questions

are “Marvin Minsky born”, “Marvin Minsky birth”, “Marvin Minsky birthplace”.

Table 6-1. Example relations, templates, queries, questions and snippets.Relation Templates Question examples Top snippetswasBornIn born, birth,

birthplace, childbirth,delivered, delivery,etc.

<Marvin_Minsky,wasBornIn, ?>:Marvin Minsky born,Marvin Minsky birth,Marvin Minsky birth-place, etc.

Marvin Lee Minsky was born inNew York City, to an eye surgeonfather, Henry, and to a mother,Fannie ...Marvin Minsky - A.M. TuringAward Winner, BIRTH: New YorkCity, August 9, 1927. DEATH:Boston, January 24, 2016 ...

isMarriedTo married, marriage,spouse, husband,wife, love, etc.

<Ryan_Block,isMarriedTo, ?>:Ryan Block married,Ryan Block marriage,Ryan Block spouse,etc.

Jul 15, 2014 ... Ryan Block,formerly of Engadget and now atAOL .... More famous for being mar-ried to Veronica Belmont IMHO...Spouse(s), Veronica Belmont. RyanBlock (born June 25, 1982) is aSan Francisco-based technologyentrepreneur. He was ...

hasChild child, children, kid,son, daughter,offspring, etc.

<Julia_Foster,hasChild, ?>:Julia Foster child,Julia Foster children,Julia Foster kid, etc.

Mother Love - Ben Fogle and hismother Julia Foster ... A shyand introverted child, he often feltoverwhelmed ...Children, Ben Fogle, Emily and Bill.Julia Foster (born 2 August 1943) isan English stage, screen and televisionactress. Born in ...

isCitizenOf citizenship,nationality, country,citizen, nation,national, etc.

<Ruth_Dyson,isCitizenOf, ?>:Ruth Dyson citizen-ship,Ruth Dyson national-ity,Ruth Dyson country,etc.

Nationality, New Zealand. Politicalparty, Labour Party ... Ruth SuzanneDyson (born 11 August 1957) is aNew Zealand politician ...Ruth Suzanne Dyson (born 11August 1957) is a New Zealandpolitician ... so Dyson’s familyfrequently moved around the coun-try.

72

For each relation, we design multiple question templates. The benefit of using

multiple question templates is it can increase the chance of finding true answers by

crawling more snippets with different questions and improve the KBC performance. As

demonstrated by experiments, multiple questions can provide higher KBC performance

than any single question.

Templates with multiple words tend to generate long questions, which may return

many noisy snippets without true answers by search engines. On the contrary, search

engines are better at finding relevant snippets to short questions. Based on this

observation, we design question templates by selecting single words with their meanings

close to the semantic meanings of relations. For example, for relation wasBornIn, born,

birthplace and birth are selected as templates, and for relation isMarriedTo, single words

such as marriage and spouse are selected as templates. More examples are listed in

Table 6-1.

In previous work [1], West et al. utilized information from other relations about

entities to augment the questions for a given relation. For example, for query <Frank_Zappa,

mother, ?>, one example question generated by their templates is “Frank Zappa mother

Baltimore”, with Baltimore being the birthplace of Frank_Zappa. However, the major

problem of their templates is search engines may get confused at the focus of their

questions. For example, for question “Frank Zappa mother Baltimore”, search engines

may find it hard to determine whether it is asking about “Frank Zappa mother” or “Frank

Zappa Baltimore”, and then return snippets related to Frank_Zappa and Baltimore

instead of the mother of Frank_Zappa. As shown in experiments, our system can achieve

better completion performance with much fewer questions than previous work [1].

Issuing all possible questions with all templates to search engines is problematic for

two reasons. First, its computational cost is too high. Processing each question involves

significant computational resources, such as CPU time and web searches. Especially web

searches require a lot of time waiting for responses from search engines. Moreover, more

73

questions return more snippets and entity linking on a large number of snippets is also

very time-consuming. Second, the KBC performance may deteriorate with more questions.

Not all questions are equally good. Some questions have better KBC performance than

others. So by asking all possible questions, we are likely to get more false answers, which

affects the performance of answer ranking. And through experiments, we find out with

only a few questions, we can get better or equivalent performance compared to using all

possible questions. We propose a greedy question template selection algorithm to select

a small subset of question templates which achieves highest KBC performance for each

relation.

6.3.1.2 Data collection

We search the natural language questions on the Web using search engines and crawl

the snippets returned by search engines to extract missing information for KBC queries.

A snippet is a small piece or fragment of text belonging to the document which search

engines find relevant to the queries. Thus, a snippet usually contains relevant information

excerpted from the original document. For query <Marvin_Minsky, wasBornIn, ?>,

a top snippet we crawled from the Web is “Marvin Lee Minsky was born in New York

City, to an eye surgeon father, Henry, and to a mother, Fannie ...,” which contains the

correct answer New_York_City. Examples of top snippets for more KBC queries are

shown in Table 6-1. We also conduct query-driven snippet filtering to effectively reduce

the number of snippets for processing at next steps and maintain high KBC performance

in the meantime.

Snippets are chosen over whole documents for a few reasons. First, snippets are

generated by search engines with the goal of excerpting useful information for questions

from original documents. So, answers to questions should exist in snippets in most cases.

Second, snippets are very short while documents are much larger, thus we can save a lot

of processing time by conducting entity linking on snippets instead of whole documents.

Third, crawling original documents would launch additional HTTP connections and cost a

74

lot of time waiting for responses from different web servers. We crawl up to 50 snippets for

each question and hundreds of snippets for each KBC query. To reduce time waiting for

responses from search engines for each relation, multithreading is employed to parallelize

the snippet crawling step with multiple questions.

However, entity linking on hundreds of snippets is still very time-consuming and far

from being able to provide real-time responses to user queries. Therefore, we implement

query-driven snippet filtering to select best snippets for candidate answer extraction and

ranking. Clearly not all snippets contain useful information to answer KBC queries.

Processing hundreds of snippets for each KBC query is also expensive in terms of

computational cost. Thus, we implement a query-driven snippet filtering component

to automatically select top snippets to extract candidate answers for knowledge base

completion.

6.3.1.3 Answer extraction

Noun phrases are extracted from the snippets and then treated as candidate answers.

Then candidate answers are linked to entities in Wikipedia [3] and Yago [11]. Entity

linking is the task to link entity mentions in text with their corresponding entities in

a knowledge base [68]. Linking candidate answers in snippets to entities in knowledge

bases has several remarkable advantages [59]. First, redundancy among candidate answers

is automatically reduced. Second, the types of a candidate answer can be effortlessly

determined by its corresponding entity in knowledge bases. Third, we can develop

semantic features for candidate answer ranking by utilizing the rich semantic information

about entities in knowledge bases.

Since entity linking is beyond the scope of this paper, please refer to a survey paper

[68] for more information. An open-source entity linking tool, TagMe [69, 70] is employed

in our system to accomplish the entity linking task. We parallelize the entity linking

process using multithreading to reduce running time waiting for responses from the TagMe

server [70].

75

After linking candidate answers discovered in the snippets to entities in knowledge

bases, candidate answers with incorrect entity types for KBC queries are discarded. For

example, the query <Marvin_Minsky, wasBornIn, ?> is looking for candidate answers

of type city rather than person. In the snippet “Marvin Lee Minsky was born in New

York City, to an eye surgeon father, Henry, and to a mother, Fannie ...”, the entity

Henry_Minsky, which is the father of Marvin_Minsky, is discarded because of wrong

entity types. This type filtering step greatly reduces the number of candidate answers for

ranking and thus helps improve answer ranking quality.

6.3.1.4 Answer ranking

After obtaining a set of eligible candidate answers with correct entity types from

snippets, various features are developed for candidate answers and classification is

applied on their features for ranking. For example, both Boston and New_York_City are

extracted as candidate answers for <Marvin_Minsky, wasBornIn, ?> from the snippet

”Marvin Minsky - A.M. Turing Award Winner, BIRTH: New York City, August 9, 1927.

DEATH: Boston, January 24, 2016 ... ”.

For feature extraction, we design six features to combine information from unstructured

snippets and structured knowledge in KBs. For classification, three classification

algorithms, SVM, logistic regression and decision tree, have been tested and two

approaches, resampling and cost weighting, have been applied to solve the issue of

imbalanced training datasets. The probability scores from classification results are used

to rank the candidate answers. We develop features for candidate answers based on

information from both unstructured textual snippets and structured knowledge bases and

apply classification on these features of candidate answers.

6.3.2 Offline Training

6.3.2.1 Template selection

Issuing all possible questions to search engines is problematic for two reasons.

First, its computational cost is too high. Processing each question involves significant

76

computational resources, such as CPU time and web searches. Especially web searches

require a lot of time waiting for responses from search engines. Moreover, more questions

return more snippets and entity linking on a large number of snippets is also very

time-consuming. Second, the KBC performance may deteriorate with more questions.

Not all questions are equally good. Some questions have better KBC performance than

others. So by asking all possible questions, we are likely to get more false answers, which

affects the performance of answer ranking. And through experiments, we find out with

only a few questions, we can get better or equivalent performance compared to using all

possible questions.

According to previous work [1], greedy selection is the best selection strategy. In [1],

West et al. first evaluated each question template on training datasets and then greedily

selected the top-performing question templates. However, this is not the most effective

approach in our case. We observe that some top-performing question templates produce

mostly overlapping results. So we propose a different greedy algorithm to learn the best

set of question templates as shown in Algorithm 6.1.

Algorithm 6.1. Greedy selection algorithm

T = {t1, t2, ..., tn}: the set of n question templates

Q = ∅: current selected question templates

QS = ∅: the set of different sets of question templates

for i = 1; i <= n; i++ do

Select tj from T such that Q∪{tj} has the highest performance for all possible t in T

Q = Q ∪ {tj}

QS = QS ∪ {Q}

T = T − {tj}

end for

Select Qm from QS with the highest performance and smallest size

return Qm

77

Our greedy algorithm aims to select the question template tj from T which works best

with existing templates in Q, instead of just choosing the template with best performance

among all remaining templates in T . When i = 1, our greedy algorithm selects the

question template with highest KBC performance. When i = 2, instead of selecting the

second best question template, our algorithm selects the question template which works

best with the first selected question. The algorithm goes on to select next questions in the

same way. After collecting a series of different sets of question templates, we choose the

set of templates which achieves highest KBC performance among all possible sets with the

smallest number of templates.

The advantage of our greedy selection algorithm is by choosing templates which work

best together, we can avoid computing the exponential combinations of question templates

and quickly find the optimal set of question templates, which can greatly reduce the

number of questions to be asked for KBC queries. As shown in experiments, our system

can achieve quite good performance with two or three question templates compared to

using all questions.

6.3.2.2 Query-driven snippet filtering

To reduce the number of snippets for processing, we propose a query-driven snippet

filtering algorithm to select snippets most likely containing information relevant to

knowledge base completion queries. An important observation is not all top snippets

ranked by search engines contain useful information for KBC queries. For example, for

question “Marvin Minsky born”, some of the top snippets returned by search engines focus

on general information about Marvin_Minsky rather than the birthplace of him. To solve

this problem, we rerank the snippets by classification on features of them and select top

snippets in the reranked list for candidate answer extraction.

The features we used for classification on snippets are:

• The original rank of a snippet returned by a search engine.

78

• A boolean indicator about whether the keyword in question templates appearing inthe snippet or not, e.g. whether born appearing in snippets returned by searchingthe question “Marvin Minsky born”.

• How many words of entity names appearing in the snippet. For instance, if “Marvin”and “Minsky” both appear inside the snippet for question “Marvin Minsky born”,the value of this feature is 2.

Clearly these features are designed to select snippets, which not only are originally

high-ranking snippets returned by search engines, but also contain information about

question keywords and subject entities.

A logistic regression classifier is trained on training datasets to classify snippets and

confidence scores of them are used for snippet reranking. The original training datasets are

highly imbalanced, because the number of positive samples is much smaller than negative

samples. We resolve this issue by conducting resampling on these biased training datasets

to generate new balanced datasets for training classifiers.

6.3.2.3 Feature extraction

In previous work [58, 71], Peng et al. explained multimodal data can provide

additional information or emphasize the same information among multiple modalities,

and thus multimodal fusion can usually achieve better performance than single-modality

approaches. In our system, we adopt the early fusion scheme, which combines information

from multiple modalities at the feature level [4, 58, 71]. Both unstructured textual

snippets from the Web and structured knowledge from KBs are fused together to produce

various effective features in our system. For each candidate answer, we extract 6 features

as shown below.

• The feature snippet count represents the number of snippets in which a candidateanswer appears. This feature is straightforward to understand, because correctanswers are expected to appear more frequently than false candidate answers acrosssnippets.

• The feature average rank calculates the average rank of the snippets in which thecandidate answer appears. Search engines aim to rank most relevant snippets to thetop of the ranked list of retrieved snippets for queries. The smaller the average rank,the more likely the candidate answer is correct.

79

• The feature keyword count is the number of times question keywords appearingtogether with the candidate answer in snippets. For question “Marvin Minsky born”,born is the question keyword and Marvin_Minsky is the subject entity. We findout snippets containing question keywords are usually useful snippets and trueanswers are more likely to co-occur with question keywords than false answers. Soif a candidate answer appears frequently together with question keywords, it isconsidered likely to be correct.

• The feature context distance measures the similarity between the context of thecandidate answer in snippets and the Wikipedia abstract of the subject entity. Eachentity has a short abstract describing the most important information about theentity in Wikipedia. The context of a candidate answer is the set of words appearingin the neighborhoods of it in the snippets. This feature is calculated by the cosinedistance between the bag-of-words vectors of the context of the candidate answerand the abstract of the subject entity.

• The feature abstract distance measures the similarity between the Wikipediaabstracts of the candidate answer and the subject entity. It is calculated as thecosine distance between the bag-of-words vectors of these two abstracts. The correctanswer and the subject entity of a KBC query should be related to each other, whichmeans the context distance and abstract distance between them should be quitesmall in the bag-of-words vector space.

• The relatedness between the candidate answer and the subject entity measures thesemantic relevance of these two entities based on only the structured informationinside knowledge bases. The entity relatedness implementation is provided by TagMe[70].

As shown above, these six features combine information from both unstructured

textual snippets and structured knowledge bases. The major advantage of applying

multimodal fusion at the feature level is that multimodal features can provide more

information than using only textual snippets or knowledge bases.

6.3.2.4 Classification

The features of candidate answers are classified using pre-trained classifiers and

corresponding probability scores are used to rank the candidate answers. However,

classification on candidate answers is challenging because of the highly imbalanced training

datasets, which is not dealt with in previous work [1, 59]. The training datasets usually

contain 30+ times more negative samples than positive samples, making the training

datasets extremely biased.

80

We employed two approaches, resampling and cost-weighting, to solve the issue of

imbalanced training. The resampling approach samples the existing training datasets

to create new balanced datasets with equal numbers of positive samples and negative

samples. The cost-weighting approach assigns higher costs to false negatives, which forces

the classifiers to get higher recall. After using resampling and cost-weighting, we usually

get classifiers with 20% to 40% larger PRC (area under precision-recall curve) than regular

classifiers.

Three different classification techniques have been utilized in our system, logistic

regression [1], decision tree [59] and support vector machines. Through extensive

experiments, logistic regression usually performs more steadily than the other two

classifiers for most relations.

6.4 Rule Inference

We build an augmented rule inference (AugRI) system with logical rules, existing

facts in knowledge bases and web-based question answering to infer new facts of interest.

We compare the performance of augmented rule inference with ordinary rule inference.

The ordinary rule inference (OrdRI) system uses only existing facts in knowledge bases.

6.4.1 Rules

We choose logical rule inference to infer missing facts based on structured information

inside knowledge bases for its expressiveness and efficiency. The rules we use are

horn-clause rules pre-learned by previous work [60]. A Horn clause is a disjunction of

literals. An example horn-clause rule is shown in Figure 6-3.

In Figure 6-3, the premise isMarriedTo(x, y) ∧ hasChild(y, z) is called body and the

conclusion hasChild(x, z) is called head. Each rule has a confidence score indicating its

validness. There are two kinds of rules divided by the numbers of literals in their bodies,

length-1 rules (e.g. isMarriedTo(x, y) =⇒ isMarriedTo(y, x)) and length-2 rules (e.g.

isMarriedTo(x, y) ∧ hasChild(y, z) =⇒ hasChild(x, z)).

81

Figure 6-3. An example rule.

6.4.2 Ordinary Rule Inference

The rules learned by previous work [60] contain noisy and incorrect ones which we

eliminated before inference. We also discarded rules which have very low confidence and

support, since processing these rules cannot produce meaningful results. We store the facts

in knowledge bases as triples in database tables.

Ordinary rule inference only uses existing facts inside knowledge bases. It is run

by executing corresponding SQL queries for all rules in parallel using multi-threading,

which is very fast. If a fact can be infered by one or more rules, we must combine the

results from multiple rules. We tested a few fusion approaches, including maximum rule

(choosing the highest score), sum rule (adding all corresponding scores together) and

logistic regression (based on features such as average confidence score and total number of

rules by which the fact is infered). The sum rule works best in validation datasets. Notice

that the confidence score of an infered fact could exceed 1.0, which is allowed since the

confidence scores are only used to rank the candidate answers.

6.4.3 Augmented Rule Inference

Ordinary rule inference has low performance because many body literals of rules are

missing, for knowledge bases are highly incomplete. So in order to increase performance of

rule inference, we use WebQA to find missing literal values of rule bodies and use them to

82

infer missing facts. We follow the query-driven scheme to decide when to use WebQA. For

these literals already existing in knowledge bases, we use them directly to avoid running

expensive WebQA queries. Only for these literals not existing inside knowledge bases, we

choose WebQA to find their values from the Web. Experimental results show AugRI can

achieve better performance than OrdRI.

6.4.3.1 Length-1 rules

An example of the inference process of a single literal is shown in Figure 6-4.

The single-literal processing works for length-1 rules, such as diedIn(x, y) =⇒

wasBornIn(x, y). For a candidate answer y, the confidence score from this rule is

score(wasBornIn(x, y)) = score(diedIn(x, y)) × scorer. If diedIn(x,y) exists inside

knowledge bases, its confidence score is 1. Otherwise, its confidence score is the score

returned by WebQA. scorer is the confidence score of this rule.

Figure 6-4. Single-literal processing.

6.4.3.2 Length-2 rules

The inference process of two literals is shown in Figure 6-5. In Figure 6-5, we first

get a list of candidate y values for the first literal and then for each of the y value, we

execute a single-literal processing step for the second literal. Then all these z values with

83

confidence scores are output for final ranking. Two-literal processing works for length-2

rules.

To calculate the confidence scores of infered facts for length-2 rules, we propose a

method called sum of products. The intuition behind this method is very straight-forward.

For example, for a KBC query <x, hasChild, ?>, we use the rule isMarriedTo(x, y) ∧

hasChild(y, z) =⇒ hasChild(x, z) to infer hasChild(x,z). The rule isMarriedTo(x, y) ∧

hasChild(y, z) =⇒ hasChild(x, z) itself has a confidence score scorer. The

confidence score for hasChild(x,z) generated by this rule is score(isMarriedTo(x, y)) ×

score(hasChild(x, z))× scorer. Then we sum up the confidence scores generated by all the

available intermediate y values by which hasChild(x,z) is infered, and use the sum value as

the confidence score of this infered fact. To fuse results from multiple rules, we follow the

same method in ordinary rule inference.

Figure 6-5. Two-literal processing.

6.4.3.3 Query-driven optimization

Usually a KBC query has only a few true answers (less than 4 in most cases). For a

length-2 rule, if single-literal processing for the first literal generates m candidate answers

and the second one generates n candidate answers in average for each intermediate y

value, then we get m × n candidate answers in total, of which most are wrong answers.

84

Two-literal processing would issue m + 1 WebQA queries for each rule, which is very

time-consuming. Although we use multi-threading to parallelize the rule inference

process, issuing all the available WebQA queries is not efficient. Another disadvantage is

low-confidence results from the first step would generate many incorrect candidate answers

at the second step.

In order to improve KBC performance and system efficiency, we use two parameters

to filter the candidate answers generated by WebQA for the first literal, the confidence

threshold and the number of answers to pass to the second step of two-literal processing.

If the confidence score of a candidate answer for the first literal is smaller than the

threshold, we discard it. And we only pass at most top k answers to the next step. We

learned the best parameters by empirically running experiments with different sets of

parameters. We also terminate some WebQA queries if they have waited very long time

for responses from web servers.

6.5 Evaluation

In this section, we demonstrate the effectiveness and efficiency of our system

through extensive experiments. We first introduce our datasets for training and testing.

Experiments have been conducted to evaluate the KBC performance of our system under

different circumstances. We also discuss the efficiency of the WebQA system and show it

can provide real-time responses to user queries.

For KBC performance, we evaluate the quality of answer rankings. Mean average

precision (MAP) is used as the evaluation metric. For a KBC query, the average precision

is defined as AP = (∑n

k=1 p(k)× r(k))/n, where k is the rank in the sequence of candidate

answers, n is the number of candidate answers, p(k) is the precision at cut-off k in the

ranked list and r(k) is the change in recall from candidate answers k − 1 to k. Averaging

over all queries yields the mean average precision (MAP).

85

6.5.1 Datasets

To evaluate our system, we extracted facts from Yago [11, 72] as training and testing

datasets. Yago [11, 72] is a huge semantic knowledge base, derived from Wikipedia,

WordNet and GeoNames. Currently, Yago has knowledge of more than 10 million entities

(like persons, organizations, cities, etc.) and contains more than 120 million facts about

these entities. The whole Yago knowledge base can be downloaded from Yago website2 .

We consider 8 relations from Yago (diedIn, graduatedFrom, hasAcademicAdvisor,

hasCapital, hasChild, isCitizenOf, isMarriedTo and wasBornIn) for testing our system.

To collect training and testing data, we make the local closed-world assumption, which

assumes if Yago has a non-empty set of objects O for a given subject-relation pair, then O

contains all the ground-truth objects for this subject-relation pair. For each relation, we

randomly sampled 500 subjects and corresponding objects from Yago for training and 100

subjects and corresponding objects for testing. For some relations, there are no rules or

reliable rules available, so we did not conduct rule inference for them.

6.5.2 WebQA

Logistic regression was chosen as the classification method for WebQA. To balance

training datasets, we tested both resampling and cost-weighting and selected the

better one for different relations separately. We first evaluate the KBC performance of

WebQA with different numbers of questions selected by Algorithm 6.1. Then we show

experimental results for the overall KBC performance of our system for 8 relations. These

experiments used all snippets crawled by searching selected questions on the Web. Lastly,

we conducted experiments to examine the performance of WebQA with top snippets

selected by query-driven snippet filtering.

2 http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/.

86

http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/

http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/

Figure 6-6. The KBC performance results for three relations with different numbers ofquestions. k is the number of selected questions. The KBC performance ismeasured by MAP.

6.5.2.1 Question template selection

As discussed earlier, issuing all possible questions for each KBC query is not efficient

and sometimes deteriorates the answer ranking quality. We designed Algorithm 6.1, the

greedy selection algorithm, to select a few questions which can achieve considerably good

performance compared to using all possible questions. For experiments, we designed

several question templates for each relation and run the greedy learning algorithm with

different numbers of questions. For demonstration, we present experimental results for

three relations (hasChild, isCitizenOf and isMarriedTo) with typical results in Figure 6-6.

For relation isMarriedTo, the KBC performance of our system boosts from 0.45

to 0.52 when the number of questions increases from 1 to 3, then decreases with more

questions. For relation isCitizenOf, the KBC performance increases from 0.35 to 0.45 when

the number of questions grows from 1 to 3, then keeps unchanged with more questions.

87

Table 6-2. Overall KBC performance for 8 relations with all snippets. Comparisonbetween our system and previous work [1] (denoted as West in the table) arealso explained. MAP (mean average precision) measures the KBC performance.Numbers in bold indicate the best results for individual relations.

Relation Perf. (Ours) Question # (Ours) Perf. (West) Question # (West)wasBornIn 0.75 2 0.67 8hasChild 0.24 2 0.18 8

isMarriedTo 0.52 3 0.50 8isCitizenOf 0.45 3 0.93 32

diedIn 0.43 3 N/A N/AhasCapital 0.52 2 N/A N/A

graduatedFrom 0.25 2 N/A N/AhasAcademicAdvisor 0.21 2 N/A N/A

For relation hasChild, the best KBC performance was achieved at 0.24 with 2 questions.

After adding more questions, the performance gradually decreases to 0.22.

As we explained earlier, when k = 1, the greedy algorithm selects the question

template with highest performance. In Figure 6-6, we can see multiple questions have

higher MAP than the best single question. Another important observation from the results

is using all questions does not guarantee to improve KBC performance compared to a

few questions. For relation isMarriedTo and hasChild, the performance with 6 questions

are lower than 2 or 3 questions. The reason is with more questions, especially low-quality

questions, false answers will be more likely ranked top in the ranked list of candidate

answers. For all 8 relations we examined, with only two or three questions selected by the

greedy algorithm, we can achieve the highest KBC performance.

The experimental results in Figure 6-6 demonstrate our greedy algorithm can

effectively select very few questions to achieve high KBC performance compared to using

all possible questions. And it is significant because we can use much fewer questions

for KBC and hence improve running time performance of the whole pipeline with fewer

questions and fewer snippets crawled from them.

88

6.5.2.2 Overall performance

For each relation, we discovered the smallest set of question templates which can

achieve the highest KBC performance based on experiments shown above. Using these sets

of question templates, we conducted experiments to evaluate WebQA for all 8 relations.

These experiments used all snippets crawled from the Web.

West et al. designed their own question templates and exploited an in-house question

answering system [1]. Their system is the only system found by us, which used web-based

question answering for knowledge base completion. We compare the KBC performance

of our system with their system for 4 relations, wasBornIn, hasChild, isMarriedTo and

isCitizenOf. The results are demonstrated in Table 6-2.

For three relations wasBornIn, hasChild and isMarriedTo, our system can achieve

better performance than previous work [1] with much fewer questions, as shown in

Table 6-2. This is due to a few reasons. First, we design better templates than previous

work as discussed above. Second, we fuse information from both the unstructured text

and structured knowledge bases to design features, while previous work only uses textual

information to rank candidate answers.

Only for relation isCitizenOf, our system fails to match previous work. The possible

reason is, previous work [1] checks facts for top 100K entities searched by Google, while

we randomly select 100 subject entities from millions of entities in Yago, which means

these entities we examined could be mostly rare entities, which are not appearing in the

Internet as frequently as those in previous work [1]. As we found out in experiments, the

major problem causing low KBC performance for isCitizenOf in WebQA is the missing

citizenship information for rare entities.

We have demonstrated through experiments that our system can achieve good

performance with much fewer questions than previous work [1]. However, there are

still some issues with web-based question answering. First, popular entities have more

89

Table 6-3. KBC performance with snippet filtering for different numbers of snippets. Theexperiments are run on our system evaluated with 10 snippets, 20 snippets, 30snippets and all snippets. Performance of previous work [1] is denoted as West.Performance is measured by MAP.

Relation 10 snippets 20 snippets 30 snippets All snippets WestwasBornIn 0.70 0.71 0.70 0.75 0.67hasChild 0.21 0.21 0.24 0.24 0.18

isMarriedTo 0.48 0.50 0.51 0.52 0.50isCitizenOf 0.39 0.40 0.41 0.45 0.93

diedIn 0.31 0.38 0.40 0.43 N/AhasCapital 0.45 0.48 0.51 0.52 N/A

graduatedFrom 0.19 0.22 0.22 0.25 N/AhasAcademicAdvisor 0.10 0.16 0.18 0.21 N/A

useful information crawled from the Web than rare entities. Second, web-based question

answering works poorly at some relations, such as graduatedFrom and hasChild.

6.5.2.3 Performance with snippet filtering

To reduce the number of snippets for processing, we apply query-driven snippet

filtering to select useful snippets which most likely contain relevant information to queries.

While improving system efficiency, we want to demonstrate through experiments, selecting

a subset of the snippets by query-driven snippet filtering does not cause severe loss of

answer ranking quality. So we conducted a few experiments with different numbers of

snippets and compare their KBC performance with experiments using all snippets and

previous work [1]. The experiments here used the same sets of questions templates as the

experiments above. The results are shown in Table 6-3.

For all the 8 relations we tested, the performance of our system using snippet filtering

with 20 or 30 snippets decreases very little compared to using all snippets, usually with

less than 0.04 loss in MAP. And for relation hasChild, our system achieves the same MAP

with 30 snippets as all snippets. Compared to previous work [1], our system still achieves

better performance for relation wasBornIn, isMarriedTo and hasChild after using snippet

filtering. It’s safe to say our system can still achieve better performance than previous

work [1] after snippet filtering.

90

6.5.2.4 Efficiency

In our WebQA pipeline, question generation and answer ranking are very fast, usually

costing only a few milliseconds. The bottleneck of our system is data collection and

answer extraction, which involve web searches and server inquiries. Compared to previous

work [1, 59], WebQA has two advantages. First, WebQA needs much fewer questions to

achieve high KBC performance than previous work [1]. Second, we conduct query-driven

snippet filtering to select only a small subset of all snippets to reduce the workload of

entity linking.

A sequential pipeline without parallelization usually costs a few minutes to finish.

So we employed multithreading to parallelize snippet crawling and entity linking, in

order to reduce the time waiting for responses from search engines and web servers. A

parallelized pipeline achieves about 10x speedups in terms of total running time compared

to a sequential pipeline. However, parallelization alone cannot provide real-time responses

to user queries, because too many connections are maintained for hundreds of snippets

and web servers process many queries from other users simultaneously. Thus, we conduct

query-driven snippet filtering to effectively reduce the number of snippets while maintain

high KBC performance.

To evaulate the running time of our pipeline, the experiments were run on a single

machine with a 3.1GHZ four-core CPU and 4GB memory. The running time varies

with multiple environment factors such as network congestion and server speed. So we

calculated average running time through extensive experiments with different queries.

The number of questions has an important impact on the running time of our

pipeline, since it crawls more snippets with more questions. Experimental results for

evaluating the running time of our system with different numbers of questions for relation

wasBornIn with 30 snippets for each question are shown in Figure 6-7. The results for

other relations are smiliar to wasBornIn. From Figure 6-7, the running time of WebQA

grows almost linearly as the number of questions increases. Since our system needs much

91

Figure 6-7. The average running time of WebQA with different numbers of questions forrelation wasBornIn.

fewer questions compared to previous work [1], it is definitely more efficient under the

same circumstances.

Query-driven snippet filtering is conducted to further improve the running time

by reducing the number of snippets. Experimental results of our system using snippet

filtering with different numbers of snippets are shown in Table 6-4. With 3 questions, we

need to crawl up to 3× 50 = 150 snippets per query. Using snippet filteirng, we can reduce

the number of snippets from 150 to 20/30 without sacrificing too much quality as shown in

Table 6-4. The running time is about 3 seconds when the number of snippets decreases to

20/30, which is less than 25% of the time when using all snippets. In conclusion, WebQA

can provide real-time responses to user queries on-the-fly, since it only spends a few

seconds for each query.

Table 6-4. Average running time of our system using query-driven snippet filtering forrelation wasBornIn with 3 questions.

Snippet number 20 30 50 150Time (seconds) 3.1 3.2 4.1 12.5

92

Table 6-5. KBC performance of OrdRI vs AugRI (measured by MAP).OrdRI AugRI

wasBornIn 0.12 0.49hasChild 0.22 0.24

isCitizenOf 0.08 0.53diedIn 0.20 0.30

graduatedFrom 0.06 0.17isMarriedTo 0.96 0.96

Table 6-6. KBC performance of individual approaches and ensemble fusion (measured byMAP). WebQA is conducted with 30 snippets.

WebQA AugRI WebQA + AugRIwasBornIn 0.70 0.49 0.82hasChild 0.24 0.24 0.40

isCitizenOf 0.41 0.53 0.55diedIn 0.40 0.30 0.47

graduatedFrom 0.22 0.17 0.30isMarriedTo 0.51 0.96 0.96

hasAcademicAdvisor 0.18 N/A N/AhasCapital 0.51 N/A N/A

6.5.3 Rule Inference

We have conducted a series of experiments for two different rule inference systems,

ordinary rule inference (OrdRI) and augmented rule inference (AugRI) for 6 relations. The

experimental results are shown in Table 6-5. As shown in Table 6-5, we can see ordinary

rule inference (OrdRI) has low MAP for 5 relations. Only for relation isMarriedTo,

OrdRI gets the same performance as AugRI. The reason is knowledge bases are highly

incomplete, hence many literals in rule bodies cannot are missing. And after using WebQA

to augment rule inference, we can see improvements of performance over all relations,

especially large improvements for relation graduatedFrom (200%+), diedIn (50%+),

isCitizenOf (550%+) and wasBornIn (300%+). The performance of AugRI is still not high

enough for two reasons. First, knowledge bases are highly incomplete. Second, WebQA

is not very reliable for some relations and hence would generate low-confidence candidate

answers.

93

6.5.4 Ensemble Fusion

We have conducted experiments to compare the KBC performance of WebQA, AugRI

and ensemble fusion of WebQA and AugRI (WebQA + AugRI). The experimental results

are shown in Table 6-6. The ensemble fusion approach achieves higher performance than

WebQA and rule inference for most relations. For relation diedIn and graduatedFrom, the

KBC performance of ensemble fusion improved by over 0.07 MAP compared to WebQA

and AugRI. For relation wasBornIn, ensemble fusion improved KBC performance by

0.12. For relation hasChild, ensemble fusion achieved nearly 70% improvement of KBC

performance than WebQA and AugRI. In conclusion, multimodal fusion generates very

high KBC performance for many relations by exploiting the complementary relation

between WebQA and AugRI.

We use multi-threading to parallelize the rule inference and web-based question

answering. Our whole system runs very fast in average (e.g. costing only 4̃-5 seconds for

relation isCitizenOf), because threshold filtering effectively avoids unnecessary WebQA

operations. However, for some queries issuing a lot of WebQA operations, it could take

dozens of seconds to finish in worst cases.

94

CHAPTER 7CONCLUSIONS

This dissertation focuses on utilizing different kinds of data by multimodal fusion to

improve performance and providing scalability and high efficiency for different tasks. I

introduce multimodal datasets and multimodal fusion for different applications, including

word sense disambiguation, information retrieval and knowledge base completion.

Multimodal fusion is the use of algorithms to combine information from different kinds of

data with the purpose of achieving better performance than single-modality approaches.

Multimodal datasets studied in this dissertation include images, unstructured text and

structured facts from knowledge bases.

Scalability and efficiency are two important aspects in this dissertation. We first

present the streaming processing system for fact extraction on terabytes of text data,

which can efficiently finish in less than one hour based on two layers of filters on a single

machine with limited computation resources. Then we introduce how to implement a

scalable image retrieval system on top of Hadoop to efficiently process millions of images.

We design two distributed clustering algorithms using Hadoop and Map-Reduce, which

can run much faster than previous work. We also use query-driven optimization techniques

to improve KBC system efficiency and provide fast responses to user queries.

We propose a theory about multimodal fusion based on the observation of the

correlative and complementary relations between different modalities. With correlative

and complementary relations, multimodal data can either provide additional information

or emphasize the same information, hence multimoal fusion can utilize these two relations

to improve task performance. Previous work usually focus on exploiting the correlation

between different modalities at the feature level and ignore the complementary relation

between different modalities. In this dissertation, I discuss multimodal fusion from a

deeper perspective, explain why multimodal fusion works and analyze how to improve

95

performance for different tasks based on correlative and complementary relations on

multimodal datasets.

We present the multimodal ensemble fusion model for word sense disambiguation and

information retrieval as an example to explain our theory. The multimodal datasets for

these two applications display mostly the complementary relation between images and

text. And image processing and text processing are also complementary to each other.

Our multimodal ensemble fusion model can utilze the complementary relation between

images and text to achieve better performance than image-only and text-only approaches.

We design a query-driven system with multimodal fusion for knowledge base

completion by combining web-based question answering and rule inference to fuse

unstructured text and structured knowledge. In different phases of the pipeline,

information from multiple modalities are fused to exploit both complementary and

correlative relations of multimodal data. The web-based question answering applies early

fusion to combine features extracted from both the unstrcutured Web and structured

knowledge bases. We design novel multimodal features and an effective question template

selection algorithm for question answering, which can achieve better performance with

much fewer questions than previous work. We implement the query-driven snippet filtering

algorithm, which can greatly reduce the number of snippets for processing and reduce

the running time. We build an augmented rule inference system utilizing web-based

question answering, pre-learned logical rules from knowledge bases and existing facts in

knowledge bases to infer new facts for KBC queries. Query-driven optimization techniques

are used to reduce the runtime of augmented rule inference. Then late fusion approaches

are employed to combine rule inference and web-based question answering to further

improve knowledge base completion performance. Query-driven optimization techniques

are employed to improve the running time performance of the whole system pipeline and

provide fast responses to user queries.

96

REFERENCES

[1] R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, and D. Lin, “Knowledgebase completion via search-based question answering,” in Proceedings of the 23rdinternational conference on World wide web. ACM, 2014, pp. 515–526.

[2] “Uiuc-isd dataset,” http://vision.cs.uiuc.edu/isd/, accessed: 2017-04-05.

[3] “Wikipedia,” https://www.wikipedia.org/, accessed: 2017-04-05.

[4] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, “Multimodal fusionfor multimedia analysis: a survey,” Multimedia systems, vol. 16, no. 6, pp. 345–379,2010.

[5] J. R. Frank, M. Kleiman-Weiner, D. A. Roberts, F. Niu, C. Zhang, C. Ré, andI. Soboroff, “Building an entity-centric stream filtering test collection for trec 2012,”DTIC Document, Tech. Rep., 2012.

[6] R. Navigli, “Word sense disambiguation: A survey,” ACM Computing Surveys(CSUR), vol. 41, no. 2, p. 10, 2009.

[7] E. Agirre and P. Edmonds, Word sense disambiguation: Algorithms and applications.Springer Science & Business Media, 2007, vol. 33.

[8] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, andtrends of the new age,” ACM Computing Surveys (Csur), vol. 40, no. 2, p. 5, 2008.

[9] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: acollaboratively created graph database for structuring human knowledge,” in Proceed-ings of the 2008 ACM SIGMOD international conference on Management of data.AcM, 2008, pp. 1247–1250.

[10] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M.Mitchell, “Toward an architecture for never-ending language learning.” in AAAI,vol. 5, 2010, p. 3.

[11] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of semantic knowledge,”in Proceedings of the 16th international conference on World Wide Web. ACM, 2007,pp. 697–706.

[12] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deeplearning,” in Proceedings of the 28th international conference on machine learning(ICML-11), 2011, pp. 689–696.

[13] N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmannmachines,” in Advances in neural information processing systems, 2012, pp. 2222–2230.

97

http://vision.cs.uiuc.edu/isd/

https://www.wikipedia.org/

[14] W. May, S. Fidler, A. Fazly, S. Dickinson, and S. Stevenson, “Unsuperviseddisambiguation of image captions,” in Proceedings of the First Joint Conferenceon Lexical and Computational Semantics-Volume 1: Proceedings of the main con-ference and the shared task, and Volume 2: Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation. Association for Computational Linguistics, 2012,pp. 85–89.

[15] K. Saenko and T. Darrell, “Filtering abstract senses from image search results,” inAdvances in Neural Information Processing Systems, 2009, pp. 1589–1597.

[16] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,”in Proceedings of the 33rd annual meeting on Association for Computational Linguis-tics. Association for Computational Linguistics, 1995, pp. 189–196.

[17] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, andN. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in Proceedingsof the 18th ACM international conference on Multimedia. ACM, 2010, pp. 251–260.

[18] Y. Wu, E. Y. Chang, K. C.-C. Chang, and J. R. Smith, “Optimal multimodal fusionfor multimedia data analysis,” in Proceedings of the 12th annual ACM internationalconference on Multimedia. ACM, 2004, pp. 572–579.

[19] Q. Zhu, M.-C. Yeh, and K.-T. Cheng, “Multimodal fusion using learned text conceptsfor image categorization,” in Proceedings of the 14th ACM international conference onMultimedia. ACM, 2006, pp. 211–220.

[20] E. Bruno, J. Kludas, and S. Marchand-Maillet, “Combining multimodal preferencesfor multimedia information retrieval,” in Proceedings of the international workshop onWorkshop on multimedia information retrieval. ACM, 2007, pp. 71–78.

[21] S. Wei, Y. Zhao, Z. Zhu, and N. Liu, “Multimodal fusion for video search reranking,”IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 8, pp. 1191–1199, 2010.

[22] L. Yang and D. Neagu, “Toxicity risk assessment from heterogeneous uncertain datawith possibility-probability distribution,” in Fuzzy Systems (FUZZ), 2013 IEEEInternational Conference on. IEEE, 2013, pp. 1–8.

[23] L. Yang, D. Neagu, M. T. Cronin, M. Hewitt, S. J. Enoch, J. C. Madden, andK. Przybylak, “Towards a fuzzy expert system on toxicological data qualityassessment,” Molecular Informatics, vol. 32, no. 1, pp. 65–78, 2013.

[24] P. McNamee, V. Stoyanov, J. Mayfield, T. Finin, T. Oates, T. Xu, D. W. Oard, andD. Lawrie, “Hltcoe participation at tac 2012: Entity linking and cold start knowledgebase construction.” in TAC, 2012.

98

[25] J. Dalton and L. Dietz, “Bi-directional linkability from wikipedia to documents andback again: Umass at trec 2012 knowledge base acceleration track,” DTIC Document,Tech. Rep., 2012.

[26] L. Bonnefoy, V. Bouvier, and P. Bellot, “A weakly-supervised detection of entitycentral documents in a stream,” in Proceedings of the 36th international ACM SIGIRconference on Research and development in information retrieval. ACM, 2013, pp.769–772.

[27] K. Balog and H. Ramampiaro, “Cumulative citation recommendation: Classificationvs. ranking,” in Proceedings of the 36th international ACM SIGIR conference onResearch and development in information retrieval. ACM, 2013, pp. 941–944.

[28] H. Ji and R. Grishman, “Knowledge base population: Successful approaches andchallenges,” in Proceedings of the 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technologies-Volume 1. Association forComputational Linguistics, 2011, pp. 1148–1158.

[29] J. Ellis, “Tac kbp 2013 slot descriptions,” TAC KBP, 2013.

[30] M. S. Nia, C. E. Grant, Y. Peng, D. Z. Wang, and M. Petrovic, “Streaming factextraction for wikipedia entities at web-scale.” in FLAIRS Conference, 2014.

[31] M. S. Nia, C. Grant, Y. Peng, D. Z. Wang, and M. Petrovic, “University of floridaknowledge base acceleration notebook,” The Twenty-Second Text REtrieval Confer-ence (TREC 2013).

[32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scalehierarchical image database,” in Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.

[33] J. Sivic, A. Zisserman et al., “Video google: A text retrieval approach to objectmatching in videos.” in iccv, vol. 2, no. 1470, 2003, pp. 1470–1477.

[34] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval withlarge vocabularies and fast spatial matching,” in Computer Vision and PatternRecognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.

[35] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scenecategories,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEEComputer Society Conference on, vol. 2. IEEE, 2005, pp. 524–531.

[36] D. G. Lowe, “Object recognition from local scale-invariant features,” in Computervision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2.Ieee, 1999, pp. 1150–1157.

[37] “Apache hadoop,” https://hadoop.apache.org/, accessed: 2017-04-05.

[38] “Apache mahout,” https://mahout.apache.org/, accessed: 2017-04-05.

99

https://hadoop.apache.org/

https://mahout.apache.org/

[39] “Apache solr,” http://lucene.apache.org/solr/, accessed: 2017-04-05.

[40] “Apache lucene,” http://lucene.apache.org/, accessed: 2017-04-05.

[41] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, “Large-scale image retrieval withcompressed fisher vectors,” in Computer Vision and Pattern Recognition (CVPR),2010 IEEE Conference on. IEEE, 2010, pp. 3384–3391.

[42] J. Deng, A. C. Berg, and L. Fei-Fei, “Hierarchical semantic indexing for large scaleimage retrieval,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEEConference on. IEEE, 2011, pp. 785–792.

[43] C. Gu and Y. Gao, “A content-based image retrieval system based on hadoopand lucene,” in Cloud and Green Computing (CGC), 2012 Second InternationalConference on. IEEE, 2012, pp. 684–687.

[44] D. Yin and D. Liu, “Content-based image retrial based on hadoop,” MathematicalProblems in Engineering, vol. 2013, 2013.

[45] W. Premchaiswadi, A. Tungkatsathan, S. Intarasema, and N. Premchaiswadi,“Improving performance of content-based image retrieval schemes using hadoopmapreduce,” in High Performance Computing and Simulation (HPCS), 2013 Interna-tional Conference on. IEEE, 2013, pp. 615–620.

[46] R. K. Grace, R. Manimegalai, and S. S. Kumar, “Medical image retrieval systemin grid using hadoop framework,” in Computational Science and ComputationalIntelligence (CSCI), 2014 International Conference on, vol. 1. IEEE, 2014, pp.144–148.

[47] M. Lux and S. A. Chatzichristofis, “Lire: lucene image retrieval: an extensible javacbir library,” in Proceedings of the 16th ACM international conference on Multimedia.ACM, 2008, pp. 1085–1088.

[48] C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image descriptormatching,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEEConference on. IEEE, 2008, pp. 1–8.

[49] M. Muja and D. Lowe, “Fast approximate nearest neighbors with automaticalgorithm configuration,” in VISAPP, 2009.

[50] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensionaldata,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36,no. 11, pp. 2227–2240, 2014.

[51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognitionchallenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252,2015.

100

http://lucene.apache.org/solr/

http://lucene.apache.org/

[52] J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo, “Evaluatingbag-of-visual-words representations in scene classification,” in Proceedings of theinternational workshop on Workshop on multimedia information retrieval. ACM,2007, pp. 197–206.

[53] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3,pp. 273–297, 1995.

[54] M. Richardson and P. Domingos, “Markov logic networks,” Machine learning, vol. 62,no. 1, pp. 107–136, 2006.

[55] H. Tong, C. Faloutsos, and J.-Y. Pan, “Fast random walk with restart and itsapplications,” 2006.

[56] N. Lao, T. Mitchell, and W. W. Cohen, “Random walk inference and learning in alarge scale knowledge base,” in Proceedings of the Conference on Empirical Methodsin Natural Language Processing. Association for Computational Linguistics, 2011,pp. 529–539.

[57] M. Nickel, V. Tresp, and H.-P. Kriegel, “A three-way model for collective learning onmulti-relational data,” in Proceedings of the 28th international conference on machinelearning (ICML-11), 2011, pp. 809–816.

[58] Y. Peng, X. Zhou, D. Z. Wang, I. Patwa, D. Gong, and C. Fang, “Multimodalensemble fusion for disambiguation and retrieval,” IEEE MultiMedia, 2016.

[59] H. Sun, H. Ma, W.-t. Yih, C.-T. Tsai, J. Liu, and M.-W. Chang, “Open domainquestion answering via semantic enrichment,” in Proceedings of the 24th InternationalConference on World Wide Web. ACM, 2015, pp. 1045–1055.

[60] Y. Chen, S. Goldberg, D. Z. Wang, and S. S. Johri, “Ontological pathfinding,” inProceedings of the 2016 International Conference on Management of Data. ACM,2016, pp. 835–846.

[61] H. Ji, R. Grishman, H. T. Dang, K. Griffitt, and J. Ellis, “Overview of the tac 2010knowledge base population track,” in Third Text Analysis Conference (TAC 2010),vol. 3, no. 2, 2010, pp. 3–3.

[62] G. Weikum and M. Theobald, “From information to knowledge: harvesting entitiesand relationships from web sources,” in Proceedings of the twenty-ninth ACMSIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM,2010, pp. 65–76.

[63] J. R. Frank, S. J. Bauer, M. Kleiman-Weiner, D. A. Roberts, N. Tripuraneni,C. Zhang, C. Re, E. Voorhees, and I. Soboroff, “Evaluating stream filtering for entityprofile updates for trec 2013 (kba track overview),” DTIC Document, Tech. Rep.,2013.

101

[64] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translatingembeddings for modeling multi-relational data,” in Advances in neural informationprocessing systems, 2013, pp. 2787–2795.

[65] R. Socher, D. Chen, C. D. Manning, and A. Ng, “Reasoning with neural tensornetworks for knowledge base completion,” in Advances in neural information process-ing systems, 2013, pp. 926–934.

[66] E. M. Voorhees et al., “The trec-8 question answering track report.” in Trec, vol. 99,1999, pp. 77–82.

[67] E. Brill, J. J. Lin, M. Banko, S. T. Dumais, A. Y. Ng et al., “Data-intensive questionanswering.” in TREC, vol. 56, 2001, p. 90.

[68] W. Shen, J. Wang, and J. Han, “Entity linking with a knowledge base: Issues,techniques, and solutions,” IEEE Transactions on Knowledge and Data Engineering,vol. 27, no. 2, pp. 443–460, 2015.

[69] P. Ferragina and U. Scaiella, “Fast and accurate annotation of short texts withwikipedia pages,” IEEE software, vol. 29, no. 1, pp. 70–75, 2012.

[70] “Tagme,” https://sobigdata.d4science.org/web/tagme/, accessed: 2017-04-05.

[71] Y. Peng, D. Z. Wang, I. Patwa, D. Gong, and C. V. Fang, “Probabilistic ensemblefusion for multimodal word sense disambiguation,” in Multimedia (ISM), 2015 IEEEInternational Symposium on. IEEE, 2015, pp. 172–177.

[72] “Yago official website,” http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/, accessed: 2017-04-05.

102

https://sobigdata.d4science.org/web/tagme/

http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/

http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/

BIOGRAPHICAL SKETCH

Yang Peng received his Bachelor of Science degree in computer science at Nanjing

University, Nanjing, China in June 2012. He has been pursuing Ph.D. degree in computer

science at University of Florida since Fall 2012. His research interests involve Data

Science, Big Data, Knowledge Bases and Multimodal Fusion. He has worked on a few

projects on large data processing, information extraction, information retrieval, word

sense disambiguation and knowledge base completion. He has served as a session chair for

IEEE ISM, a reviewer for WWW and an external reviewer for SIGMOD, VLDB, VLDB

Journal, IJCAI, ICDE, and so on. He also worked as a software engineering intern at

Google Photos team in Fall 2016.

103