F2ConText: How to Extract Holistic Contexts of Persons of ... · Md Abdul Kader1, Arnold P. Boedihardjo2 and M. Shahriar Hossain3 1IBM Innovation Center, Austin, TX 78758 2Radiant

Under consideration for publication in Knowledge and Information Sys-tems

F2ConText: How to Extract Holistic

Contexts of Persons of Interest for

Enhancing Exploratory Analysis

Md Abdul Kader1, Arnold P. Boedihardjo

2and M. Shahriar Hossain

3

1IBM Innovation Center, Austin, TX 787582Radiant Solutions, Herndon, VA 201713The University of Texas at El Paso, El Paso, TX [email protected], [email protected], [email protected]

Abstract. A wide-variety of publicly available heterogeneous data has provided us with anopportunity to meander through contextual snippets relevant to a particular event or persons ofinterest. One example of a heterogeneous source is online news articles where both images andtext descriptions may coexist in documents. Many of the images in a news article may containfaces of people. Names of many of the faces may not appear in the text. An expert on the topicmay be able to identify people in images or at least recognize the context of the faces who arenot widely known. However, it is di�cult as well as expensive to employ topic experts of newstopics to label every face of a massive news archive. In this paper, we describe an approachnamed F2ConText that helps analysts build contextual information, e.g., named entity contextand geographical context of facial images found within news articles. Our approach extractsfacial features of the faces detected in the images of publicly available news articles and learnsprobabilistic mappings between the features and the contents of the articles in an unsupervisedmanner. Afterwards, it translates the mappings to geographical distributions and generates acontextual template for every face detected in the collection. This paper demonstrates threeempirical studies — related to construction of context based genealogy of events, tracking of acontextual phenomenon over time, and creation of contextual clusters of faces — to evaluatethe e↵ectiveness of the generated contexts.

Keywords: Exploratory Analysis; Image-Text Alignment; Geographical Context; InformationGenealogy

Received 14 Feb 2017Revised 09 Aug 2018Accepted 15 Sep 2018

Accepted on 15 Sep 2018. To appear in Knowledge and Information Systems.

2 Md Abdul Kader

1. Introduction

Building a mental model and establishing contextual phenomena is central to many ex-ploratory analysis work, especially for improved situational awareness (Heuer 1999). Dataanalysts face many challenges while fusing disparate streams of data and rapidly proto-typing event-scenarios for quantitative predictions to help policy makers arrive at analyt-ical conclusions (National Research Council 2002). Publicly available imagery data andtextual information are widely leveraged during rescue missions, disaster management,surveillance, and in other scenarios to gather insights for making informed decisions. Al-though there are existing systems to aid exploratory analysis (e.g., see (IBM Analyticsfor a Safer Planet 2016, Palantir Gotham 2007, IN-SPIRE Visual Document Analysis2014)), the growing volume of public data feeds and the evolving demand of analyticalcapabilities necessitate further aide in situational awareness.

This paper presents a framework called F2ConText (Face to Context using Text) thathelps analysts build contextual templates for persons of interest using co-occurring imagesand textual data. A contextual template is composed of mappings between faces, names,geographical locations, and many other entities. While developing a contextual senseabout events and persons of interest is a natural process for human beings, automaticgeneration of context using publicly available data to aid the analytic process is still achallenge due to the massiveness and multimodal nature of the datasets. Moreover, theheterogeneous feature elements that create a meaningful context are not well defined inmost publicly available datasets. For example, the Wikipedia entry for David Cameron,the former Prime Minister of United Kingdom, contains a few faces of his supporterswhose contexts are not related to the use of his free time described in the page.

The images in publicly available news articles do not have labeled faces as found inmany social media photos. With news archives containing hundreds of thousands of arti-cles and images, analysts cannot label all the faces in the images manually. Additionally,many of the faces in the news images are of unknown people that may appear with otherknown persons of interest. For example, the names of the security personnel of the primeminister of a country may never appear in any news article but the faces of the secu-rity team may be seen with the country’s prime minister in an image of a news article.Theoretically, the prime minister and a security sta↵ will have the same context if theface of the security sta↵ appears in all the images where the prime minister appears. Inreality, the prime minister’s face will appear in more images than the face of a particularsecurity person. This indicates that the context of the prime minister will span morearticles than the context of a security sta↵, which will result in similar but ultimatelydi↵erent contexts. The foundation of our system to generate context for faces is drivenby this concept.

Situational awareness requires harnessing high quality contextual information regard-ing location of people of interest. In many instances the location information of a knownperson is apparent from her/his social presence in media like linkedin, twitter, and face-book. Unfortunately, faces of people in images of news articles are not name-tagged.Moreover, the location context of a person is beyond the name of the current locationor birth place of the person. A person of one country may be discussed well in anothercountry. The geographical context can be the distribution of degree of association of aperson with all locations. This paper describes a methodology to generate a geographicalcontext at country level for every face detected in every image of a news corpus. Thebenefit of the ability to generate country-level geographical context is two-fold. First,the geographical context answers the “where” question during an analysis relevant to in-surgency or any event of interest. Second, a geographical context is traceable over time,which allows analysts to study how focus of a person of interest change over time and

F2ConText: How to Extract Holistic Contexts of Persons of Interest 3

over locations. The proposed solution does not rely on any coordinate data, rather itgenerates a geographical context for every face image using news content, and a publiclyavailable city and country list. The experiments discussed in this paper exclusively useopen source and publicly available data.

1.1. Associated Analytic Challenges

Most of the existing analytic tools (e.g., Entity Workspace (Bier et al. 2006), Jigsaw(Stasko et al. 2008), NetLens (Kang et al. 2007), and Sentinel Visualizer (FMS Inc.2017)) require development of context in the mind of the analyst through extensivemanual exploration of the data and connection building e↵ort. Challenges grappled byanalysts associated with contextual analysis of people of interest using image and textdata are outlined below:

Challenge 1: Limited or no knowledge about the images: Associated images in newsarticles and many other public feeds do not have labeled information. Lack of labels andmeta-data makes querying di�cult, which results in manual e↵ort of tagging the faceswith textual comments and remembering images for future use.Challenge 2: Lack of mappings between image and text: Intelligence analysts struggle inconnecting text granules with images in presence of documents containing both text andimages. While existing software tools help detect entities (e.g., person and organization)from text (OpenNLP 2017, Alias-i. 2011), detect faces in images (Viola & Jones 2001),and extract facial features from the detected faces (Turk & Pentland 1991), the taskof mapping the extracted faces to text granules to provide a sense of context is stillunsolved.Challenge 3: Connecting content-driven context to location: For many decision makingtasks, analysts are required to represent the contexts discovered from the content of thedocuments in a di↵erent space. For example, the location entities found in a documentmight not always be countries, rather the locations can be region names, towns, cities,or even some organizations that are representative of some areas. Additionally, the sameregion name can be present in multiple countries. How can an analyst quickly retrievea geographical context at country level to disambiguate locations or to down scope theanalysis to a particular part of the globe?Challenge 4: Lack of support for contextual grouping of people of interest: Detectinggroups of people with similar activities or context containing suspicious entities of interestis key to many intelligence analysis tasks in order to reveal latest social associations(United States Government 2009). Lack of software support to detect such contextualgrouping of potential pool hampers generation of high quality intelligence in a timelymanner.Challenge 5: Lack of support to study evolving nature of context: Another limitation isthe lack of algorithmic support to assist analysts in coping with the changing nature ofthe reasoning tasks they routinely tackle (Tecuci et al. 2010). In the space of contextualanalysis, reasoning requires extensive understanding of how each context evolves overtime. Current literature lacks such context-tracking mechanism.Challenge 6: Complex nature of the di↵usion of events: Detection of the evolution ofthe context of an entity provides the ability to track individual contexts. However, thestudy of an event is more complex than the analysis of a specific context because anevent is a composition of interactions between many individuals. The rapid growth oftextual and imagery data makes it quite challenging for analysts to trace the genealogyof all actors involved in an event of interest.

4 Md Abdul Kader

Person Context Location Context

david david_cameron queen_elizabeth_ii samantha_cameron shaun_bailey david and samantha nick_houghton shaun_bailey george_zambellas phil_redmond

kingham britain sarsden medieval_britain peasemore unin_europea daylesford hot_shot windsor_hockey_club speen

David Cameron

0.00

0.10

0.20

0.30

0.40

0.50

0.60

afgh

anist

anch

ina

egyp

tir

elan

dfr

ance

geor

gia

germ

any

iran

israe

lir

aqno

rth

kore

ale

bano

npa

kist

anpo

land

russ

iatu

rkey

unite

d ki

ngdo

muk

rain

eun

ited

stat

esvi

etna

m

David Cameron

Figure 1. Generated contexts of a face. From left to right: the face image for whichcontexts are generated, person context, location context, geographical context using abar chart, and geographical context laid on a map. The geographical context shows thatthe United Kingdom has the highest probability for the former British Prime Minister,David Cameron.

1.2. Contributions

The challenges outlined above motivate our context generation mechanism. Specific con-tributions of this paper are:

– Our framework leverages a mechanism to enhance facial features by complementingstate-of-the-art techniques. These features are later connected to entities using a prob-abilistic model to avoid manual labeling.

– We formulate and solve the problem of holistically mapping faces to textual entities(e.g., person and location) to build a context for each face detected in the images of anews archive.

– F2ConText generates geographical context at the country level for each face found ineach of the images of a news archive. None of the state-of-the-art methods to generategeographical context has the ability to compute such a nontrivial mapping betweenfaces and geography.

– We demonstrate that the generated contexts help identify meaningful contextual clus-ters of faces.

– We demonstrate that a geographical context generated by our framework is traceableover a time-line. This allows analysts reason how a geographical context of a certainimage may evolve over time.

– We introduce a new event summarization mechanism that leverages text and images toexplain di↵usion and evolution of events as chains of documents. The proposed methodtraverses a similarity network of news articles without materializing the network en-tirely and by constraining consecutive documents with certain cohesion threshold,context overlap requirement, and temporal ordering.

Figure 1 shows a sample contextual template generated by F2ConText. The figureshows the person context, the location context, and the generated geographical contextof the former British Prime Minister, David Cameron.

1.3. Problem Formulation

Let = {A, E , I,F , RAE , RAI , RIF } be a collection of articles containing images I ={i1, i2, . . . , i|I|} and textual descriptions A = {a1, a2, . . . , an}. Text descriptions maycontain entities from E = E

N[E

L = {e1, e2, . . . , e|E|} where EN is the set of person entities


and EL is the set of location entities. F = {f1, f2, . . . , f|F|} is the set of faces extracted

from images I. RAE represents entities within articles, {{aq, er} : aq 2 A, er 2 E}.Similarly, RAI is the set of relationship {{aq, ir} : aq 2 A, ir 2 I} and RIF is the set ofrelationships {{iq, fr} : iq 2 I, fr 2 F} representing images within articles and faces ofpeople within images, respectively. The task of context generation in F2ConText is twopronged. For each face f 2 F ,1. generate a person context C

N (f) = { 1(f), 2(f), . . . , |EN |(f)}. r(f) is a tuple{{f, er, P (er|f)} : f 2 F , er 2 E

N )}, where P (er|f) is the probability of a personentity er given a face f . In practice, CN (f) is arranged in descending order and recordsonly a feasible number of probabilities. Similar to the person entity context, generatethe location context CL(f) = {�1(f),�2(f), . . . ,�|EL|(f)} where �q(f) = {{f, eq,P (eq|f)} : f 2 F , eq 2 E

L} and P (eq|f) is the probability of location entity eq given a

face f .2. generate a geographical context, D(f) = {d1(f), d2(f), . . . , dm(f)}, as a probability

distribution of m countries.

2. Related Research

This section outlines the literature relevant to the research and experiments presentedin this paper.

Context extraction for image: Since deep learning techniques perform extraordi-narily well for object detection (Szegedy et al. 2013), it opens a new window to researchersto describe the human interactions and semantic relationships among the objects of animage (Karpathy & Fei-Fei 2015, Yao et al. 2010). All these methods generate text de-scriptions of images based on local contextual understanding of fragments of imagescommonly called objects. A few attempts that map image fragments to a limited numberof words (Socher & Fei-Fei 2010, Karpathy et al. 2014) su↵er from the limitation of re-quired labeled or training data. Dictionary of visual words, visual elements extracted fromimage, combined with text features is used to enhance semantic representations of words(Bruni et al. 2014, 2011). While such enhancements give better word-representations, thesystems do not generate textual words for visual elements of the images in contrast toour approach.

While there has been significant e↵orts in describing images by summarizing rela-tionships of image fragments, contextual face annotation (Tian et al. 2007) has receivedless attention. The authors in (Chen et al. 2003) proposed a semi-automated annotationtechnique for faces that leverages similarity based search and relevance feedback con-cepts to annotate photos of a personal collection. In (Choi et al. 2010, 2008), faces areautomatically annotated and clustered based on visual and geo-temporal contexts fora personal photo collection. A more robust and flexible approach to face annotation isthe use of auxiliary textual information (Feng & Lapata 2008) for automatic annotation.Image captions are used in various ways (Guillaumin et al. 2012, Feng et al. 2004) toannotate images and faces. All these approaches rely on labeled information, coordinatesassociated with images, and text near the images (e.g., caption). Instead of tagging facesby names, our approach provides a holistic context of each face as a probabilistic distri-bution to map a face to textual entities and geography. Moreover, our approach does notrequire any labeled information or metadata about the images.

In (Wang et al. 2014, Le & Satoh 2008), an unsupervised approach for automaticface annotation is introduced that retrieves a short list of weakly labeled faces basedon textual query. A limitation of this approach is that the training data is prepared

6 Md Abdul Kader

based on queries composed of known names. Our approach is more robust in that itdoes not assume that the name of the person is present in the dataset. Our approachgenerates a context rather than explicitly providing a name tag. Context extracted fromsocial networks substantially increases face recognition quality (Stone et al. 2008) forauto tagging faces in personal photographs. A requirement of social presence of all thepeople limits the capability of this approach. Our approach, on the other hand, is ableto generate a context for each face from publicly available unstructured documents.

While most of the literature studies local context in images and mapping faces to afew keywords or names, our approach harnesses its strength by providing context of aface as a probabilistic ordering of all entities found in the entire dataset, as well as acontext represented as a geographical distribution.

Face-feature extraction: We leveraged existing techniques for face-feature extrac-tion. Our method uses the deep convolutional network-based face embedding techniqueFaceNet (Schro↵ et al. 2015) that produces state-of-the-art face embedding in a compactEuclidean space. Other popular image feature extraction methods like Eigenface (Turk &Pentland 1991), Fisherface (Lee et al. 2001), Local Binary Pattern (Ahonen et al. 2009),and Curvelet (Rahman et al. 2012) work well for face recognition if there are enoughnumber of labeled faces for many orientations of a person. There are rotation invariantface detection techniques (Huang, Ai, Li & Lao 2007, Wu et al. 2004, Rahman et al. 2010)that provide good results in recognizing faces but do not provide high-quality descriptorsof the faces that can be leveraged for mapping features with large number of possiblelabels. Hassner et al. (2015) propose a face frontalization technique that produces frontalviews of non-front facing faces. It assumes a single 3D facial shape as an approximationto the shape of all faces. As opposed to the approach of Hassner et al., we generateextended features based on a frontalization technique that does not assume a single 3Dtemplate, rather the technique uses facial key-points to realize face orientations. Some re-searchers also designed algorithms to generate rotation invariant image descriptors usingholistic Fourier feature (Ahonen et al. 2009, Lai et al. 2001) but those methods do notoutperform the recently developed deep learning-based mechanisms for face recognition(Taigman et al. 2014, Parkhi et al. 2015).

Aligning image and text: There are di↵erent approaches to generate contexts ofimages using textual informations that co-exist with the images. The simplest approachuses the full text of a document as the context of an image (Kalva et al. 2007). Somesystems leverage the text in neighborhood of the image as the context (Fauzi et al.2009, Yong-hong et al. 2005, Cai et al. 2003). The limitation of these approaches is theassumption that the image and its contextual information co-exist in a single document.Our application o↵ers a holistic approach to generate contexts of faces using informationfrom all documents instead of limiting the context to a local view of the documents wherethe faces were found.

Geographical context: Geo-tag of documents and user’s location are sometimesleveraged as key features of geographical contexts in news recommendation systems (Sonet al. 2013). Location information and preferences captured from the hand-held devicesare widely used as contexts to recommend points of interest (Li, Cong, Li, Pham &Krishnaswamy 2015, Park et al. 2007). The extraction of geographical context fromthe textual resources has been studied in a few frameworks. The systems use di↵erentstrategies, such as, building a contextual dictionary for all cities using Wikipedia data(Mishra et al. 2010), computing geographic scope through ranking algorithms (Silva et al.2006), and modeling locations using geotagged twitter data (Kinsella et al. 2011). Ourcontext generation is nontrivial because we compute the geographical scope of each ofthe faces detected in the images of a news archive as a probability distribution over allcountries of the world. The geographical context generated in the form of a probability


distribution of countries overlaid on a map helps analysts build a mental models of thescope of a person of interest.

Genealogy of event: Detection and summarization of events that progress overtime have been studied in a variety of domains like computer systems (Koch et al. 2010),intelligence analysis (Fisichella et al. 2010), and surveillance systems (Little et al. 2013).Since the definition of an event varies largely among domains, a range of techniques havebeen proposed in the literature including frequent itemset mining (Chakrabarti et al.1998), episode mining (Laxman et al. 2005), graph search (Xu & Lu 2015), temporalevent clustering (Gung & Kalita 2012), to name a few. The graph based model proposedby Xu and Lu (Xu & Lu 2015) to identify the most representative images, has someresemblance to our work that provides visual summary of an event using images fromvarious social media. However, our method summarizes events as stories of several sub-events over time using both textual and visual contents of the documents.

Network exploration: The core algorithm of the time-evolving summarizationsolves a “connecting the dots” problem, which has appeared in the literature beforein a variety of guises in entity networks (Fang et al. 2011, Hossain, Butler, Boedihardjo& Ramakrishnan 2012), image collections (Heath et al. 2010), social networks (Faloutsoset al. 2004), and document collections (Das-Neves et al. 2005, Kumar et al. 2006, Shahaf& Guestrin 2010, Kader, Naim, Boedihardjo & Hossain 2016). While some of these e↵ortscan be adapted toward our problem context, the time-evolving summarization of eventsemphasizes on amalgamating heterogeneous information (text and image features) in thesummarization process, whereas the above e↵orts typically require a stronger connectingthread between nodes of a homogeneous network.

3. Methodology

F2ConText uses three operational stages to generate contexts for faces: (1) feature ex-traction and modeling, (2) generation of entity based contexts — person context (CN (f))and location context (CL(f)), and (3) construction of geographical context, D(f), as aprobability distribution over all countries.

The following subsections describe these computational stages. For the convenienceof the readers, we provide the list of most frequently used symbols in Table 1.

3.1. Features Extraction and Modeling

F2ConText requires extraction of features from both images and texts that coexist inthe documents of a collection.

3.1.1. Facial Feature Generation in F2ConText:

We leverage Convolutional Neural Network (CNN) based state-of-the-art face detectionapproaches (Zhang et al. 2016, Li, Lin, Shen, Brandt & Hua 2015) to detect faces fromthe set of images I. The face detection model is a deep cascade architecture built onconvolutional neural networks. We use a pre-trained model that jointly performs facedetection and alignment using multi-task cascaded convolutional networks (Zhang et al.2016).

Popular feature extraction methods like Eigenface (Turk & Pentland 1991), Fisherface(Lee et al. 2001), and Local Binary Pattern (Ahonen et al. 2009) commonly used inface recognition perform reasonably well when most of the faces are front-facing. Most

8 Md Abdul Kader

Table 1. List of frequently used symbols

Symbol Description

Collection of articles containing images and text

A Set of textual documents

I Set of images

F Set of faces

E Set of entities

CN (f) Person context for face f

CL(f) Location context for face f

D(f) Geographical context for face f

D1(f) Geographical context for face f based on person name entities

D3(f) Geographical context for face f based on location entities

DT (f) Ground truth geographical context for face f

DB(f) Geographical context for baseline method for face f

� User settable parameter for adjusting sensitivity of entity level context

✓ Maximum allowed document distance for finding story

⌧ Maximum allowed distance for face context between documents

of the images in a news corpus are not taken in a studio or laboratory environmentand the detected faces are not always front facing. Recently, deep learning based faceembedding approaches (Schro↵ et al. 2015, Sun et al. 2014) have started outperformingtraditional facial feature extraction methods. We use a pre-trained model of FaceNet(Schro↵ et al. 2015), a CNN-based face embedding framework, to extract face featuresin low-dimensional euclidean space. Since many of the faces in the images of datasetsare side-facing, we use a frontalization method to be able to capture positions of somekey facial points in a projected plane where the side-faced photo represents a front-posing face. Our frontalization method estimates a 3D face for each of the faces, whereasthe technique in (Hassner et al. 2015) assumes a single 3D face shape for all faces. Wefrontalize facial key points unlike (Hassner et al. 2015) to be able to extract facial featuresfrom the angles of the facial points and distances between them. Frontalization of facialkey points enables us to enhance the lower dimensional CNN space by including theadditional frontalization features.

Facial Key Points Detection: We detect five facial key points – two eye centers,nose and two mouth corners, as shown in Figure 2 – using a pre-trained cascaded deepconvolutional neural network (Sun et al. 2013). The model takes linear time in termsof number of faces. It was trained using the LFW dataset (Huang, Ramesh, Berg &Learned-Miller 2007). The neural network utilizes texture information over the entireface and the geometric constraints among key points with high accuracy. Since a facecan be in any pose in an image, we need to frontalize the facial key points to be ableto extract the actual geometric properties of a face. Even before the frontalization, itis necessary to estimate the angles of a face by which it is deviated from a front facingposition.


Figure 2. Detected eyes, nose and mouth corner points using Cascaded Deep Convolu-tional Neural Network for few faces.

Frontalization

Figure 3. Facial key-points frontalization.

Angle Prediction: We use Generalized Linear Model (GLM) to predict vertical andside angles of a face. We generate a synthetic dataset using six 3D-face models, which iscreated from six face images covering various ethnicities (e.g., Caucasian, Scandinavian,Japanese). Since the dimension of faces in the knowledge base, is 100⇥ 100 pixels, weimagine a cube of 100⇥ 100⇥ 100 for the 3D face models where the center of head is atthe origin (0, 0, 0). Five facial key points of the faces are then marked on these models.We made three assumptions for simplicity in placing the five key points in the model.1. Facial key points for eye pair and mouth corners are in the same plane, which is +35

unit ahead of and parallel to xz-plane.2. Facial key point for nose is in plane P parallel to xz-plane and +15 ahead of the plane

of mouth corners and eye pair.3. Nose point ⇡nose = (0, 50, 0) is fixed for all the models. But eye pair and mouth corner

points are placed by maintaining relative distance of those points.To create a synthetic training dataset of faces with known facial key points and knownangles, we rotate the 3D models by varying angles around z-axis and x-axis and projectingthem back on a 2D xz-plane. We rotate the models from �45� to +45� at 3� intervalsaround z-axis and from �15� to +15� at 5� intervals around x-axis. This produces atraining set of 1302 instances. Each instance consists of 30 features as described in “Face

Feature Generation” part later. Two class labels are azimuth angle az around z-axisand elevation angle el around x-axis. We train two GLM models using these two setsof class labels. After the training, the models can predict the vertical and side anglesfor any five facial key points of a face. This enables us to apply frontalization by thosepredicted angles.

10 Md Abdul Kader

Facial Key Points Frontalization: We predict azimuth angle (side angles), az andelevation angle, el for each detected face by using GLMs from the extracted facial keypoints. Let ⇢ = {⇢1, ⇢2, ..., ⇢5} be the set of five facial key points extracted from a face.Algorithm 1 describes the steps for frontalizing the set of key points ⇢. The algorithmuses a function Rotate that rotates a 3D point around a particular axis by a certainangle. Figure 3 shows a sample of frontalization applied on five facial key points.

Algorithm 1: Facial Key Point Frontalization

Input : ⇢,⇡nose,P, az, elOutput: ⇢frontalized

1: P0 Plane P rotated by az around z-axis then by el around x-axis

2: ⇡0nose Rotate(Rotate(⇡nose, axis = z, az), axis = x, el).

3: for all ⇢k 2 ⇢ do

4: ⇢0k ⇢k + (⇡0nosex ,⇡

0nosez )� ⇢3

5: Lk Line perpendicular to xz-plane going through ⇢0k6: tk Intersecting point between the plane P

0 and the line Lk

7: t0k Rotate(Rotate(tk, axis = x, el), axis = z, az)

8: ⇢frontalizedk t0k orthogonally projected on xz-plane9: end for

10: return ⇢frontalized

Face Feature Generation: An important criterion in building a good feature set ofa face is that the produced features vary a little for a particular person in di↵erent lightingconditions, expressions, and occlusions. Our method for facial key point generation andfrontalization aid in generating features with such properties. We generate two sets offeatures from the frontalized facial key points. The first set is composed of pairwiserelative distances between all five facial key points and the second set comprises of twenty

1 2

3 4

5

6

7 8 9

10

11

12 13

14

15

16 17

18

19

20

Figure 4. Twenty angles generated from frontalized five facial key points.


angles as shown in Figure 4. Combination of these two sets of features makes a featurevector of length thirty where the first ten are the pairwise relative distances and the lasttwenty are the angles.

In addition to these thirty features, we leverage a pre-trained convolutional neuralnetwork based face embedding technique (Schro↵ et al. 2015) that produces 128 dimen-sional embedding for each face. We did not consider other embedding dimensionalitiesbecause (Schro↵ et al. 2015) demonstrates that the optimal embedding dimension is 128.Previously extracted thirty features using frontalized key points are concatenated withthis 128 dimensional embedding forming a vector of length 158 for each face.

3.1.2. Entity Extraction and Document Modeling:

F2ConText combines the outputs of a number of entity extractors including LingPipe(Alias-i. 2011), OpenNLP (OpenNLP 2017), and Stanford NER (Finkel et al. 2005)to identify entities within the textual contents of the articles in A (Kader, Boedihardjo,Naim & Hossain 2016). Although we extracted all standard entity types including personname, organization, and location, this paper scopes down the analysis to person andlocation entities only, especially because the images are explained using detected humanfaces. The weight of person-name entity e 2 E

N in the article a 2 A is computed as:

WN (e, a) =(1 + log(tfe,a))(log

|A|afe

)s

P

e02ENa

⇣(1 + log(tfe0 ,a))(log

|A|af

e0)⌘2 (1)

where tfe,a is the frequency of entity e in article a, afe is the number of articles containinga connection with entity e, and E

Na is the set of person entities that are connected to

article a. Equation 1 is a variant of tf-idf modeling with cosine normalization (Hossain,Butler, Boedihardjo & Ramakrishnan 2012, Hossain, Gresock, Edmonds, Helm, Potts &Ramakrishnan 2012). The articles in A have descriptions of di↵erent sizes. In general,longer descriptions have higher term frequencies because many terms are repeated. Thecosine normalization helps lessen the impact of size of the descriptions in the modeling.The Weight, WL(e, a), of a location entity e 2 E

L in an article a 2 A is calculated usingthe same formula as Equation 1.

3.2. Generation of Entity based Context

F2ConText generates two separate contexts, one using person name entities and theother using location entities, for each face f 2 F . The person context CN (f) of a face fis expressed as a probability distribution over the set of person entities EN . Similarly, thelocation context CL(f) is expressed as a probability distribution over the set of locationentities EL. Since the mechanism to generate person context CN (f) and location contextCL(f) are similar, we present the process for computing person context only.The probability of an entity e 2 E

N for each given face f can be expressed as

P (e|f) / P (e)⇥ P (f |e) (2)

where P (e) is the prior probability of e and P (f |e) is the likelihood. The representation ofEquation 2 is analogous to the Bayesian classification for text (Christopher D. Manning& Schtze 2008). Our process utilizes the same Bayesian principle but di↵ers from textclassification because the probabilistic connections between words to labels of text clas-sification are replaced in our case by facial-features to entities. Let f = {f1, f2, . . . , fV

}

12 Md Abdul Kader

be the feature representation of face f obtained by the method described in Section 3.1.V is the length of the feature vector of f . With an assumption of independence betweenthe features, we can rewrite Equation 2 as

P (e|f) / P (e)⇥VY

l=1

P (f l|e) (3)

A person entity e can appear in multiple articles of A. Let Ae✓ A be the set of articles

that contain entity e. Each article a 2 Ae may in turn contain a number of faces (as

expressed by RAI in ). Let Fa✓ F be the set of faces in article a. The relationships

between the face-features and person-name entities in articles can be computed by theentity weights, WN (e, a) and face-feature weights in the articles. The likelihood of thelth feature of a face given an entity e can be computed as

P (f l|e) =

✓Pa2Ae P (f l

|a)⇥WN (e, a)Pa2Ae WN (e, a)

◆f l

(4)

where P (f l|a), which is calculated by Equation 5, is the probability of the face feature

f l given article a.

P (f l|a) =

P�2Fa �l

P�2Fa

PVl0=1 �

l0(5)

where �l is the lth feature of a face � 2 Fa. Now, replacing P (f l

|e) of Equation 3 bythis expression we obtain

P (e|f) / P (e)⇥VY

l=1

✓Pa2Ae P (f l

|a)WN (e, a)Pa2Ae WN (e, a)

◆f l

(6)

So that a small value of any of the factors of multiplications does not zero-out the overallprobability, we take logarithms on both sides of Equation 6 to convert the multiplicationsto summations. The resulting equation is provided below.

LL(e, f) =

log(P (e|f)) / log(P (e))�LX

l=1

f l⇥ log

X

d2De

W (e, d)

!!

+LX

l=1

f l⇥ log

X

d2De

P (f l|d)⇥W (e, d)

!!(7)

We generate context for each face using Equation 7, which produces a probability distri-bution over all entities for each face f . In practice, we do not record the full probabilitydistributions, rather we keep record of a maximum of LC entities with highest probabil-ities as the context of a face. A higher value of the probability P (e|f) indicates a highercontextual connection of entity e with face f , whereas a lower value suggests otherwise.


Waterloo(in(Canada,(Belgium,(and(Ireland

Sydney(in(Australia(and(Canada

Moscow(in(Russia(and(USA

Auckland(in(New(Zealand(and(USA

Lebanon(in(USA(and(Country(Lebanon

Face(of(John(Doe

Location(Context

Auckland

Waterloo

Sydney

Moscow

Lebanon

0

20

40

60

80

100

0 20 40 60 80 100

Perc

enta

ge o

f fac

es c

onta

inin

g at

le

ast M

am

bigu

ous

city

nam

es in

th

e lo

catio

n co

ntex

t

Context Size, K

M=1 M=5

M=10 M=15

M=20 M=25

Minimum number of ambiguous cities, M !

(a) Location context of a person. (b) Location ambiguity in the location-context of the faces from NY Timesdataset.

Figure 5. Location ambiguities in dataset.

3.3. Geographical Context Generation

Location context CL(f) is generated leveraging the same method used to generate the

person context CN (f). Analysts generally have deep knowledge about the actors of rel-

evant events but location entities are di�cult to interpret because they might containvillage, city, county, or even community names. To aid an analyst with a more abstractsense of location context, our framework generates a geographical context in the form of acountry distribution, which demonstrates prominence of countries as seen in a document.

Our observation is that The New York Times dataset has more than 85% articlescontaining at least three ambiguous city names. As a result of such ambiguities, the loca-tion entity based context is not su�cient to generate a geographical context. Figure 5(a)shows that each of the prominent location entities related to John Doe, a hypotheticalperson, can be found in multiple countries. A closer look on the maps reveal that JohnDoe’s geographical context is more focused on North America. Figure 5(b) depicts highnumber of ambiguous locations in the location contexts of the detected faces of NY Timesdataset.

To address the location ambiguity problem, our framework generates a template ofprobability distribution D(f) of countries for each face f 2 F by combining entity levelcontexts CN (f) and C

L(f). This probability distribution D(f) is the geographical contextgenerated for each face. To identify ambiguous locations, the framework uses a publiclyavailable database (GeoDataSource 2018) that maps all city names with countries.

Generation of Geographical Context, D(f): Let � = {�1,�2, . . . ,�m} be the setof m countries, ⇢ be the set of all cities, and ⇢�i be the set of all cities in country �i.Ideally, m is equal to 195 because there are 195 countries in the world today. However,all country names may not appear in a text dataset. Therefore, m 195. The countrydistribution of a face f 2 F using the person entity based context C

N (f) alone can be

14 Md Abdul Kader

defined as:

D1(f) = {d1�1(f), d1�2

(f), ..., d1�|�|(f)}, where

d1�i(f) =

X

e2EN

"P (�i|e)

⇥ exp

0

@��maxe02EN

(LL(e0, f))� LL(e, f)

maxe02EN

(LL(e0, f))� mine02EN

(LL(e0, f))

1

A#

(8)

where � is a user settable parameter varying the number of top entities to consider foranalytic purpose. Larger values of � results in lesser number of entities. The probability,P (�i|e), is computed by

P (�i|e) =⌘�ie

⌘e(9)

where ⌘�ie is the number of articles containing both country �i and entity e. ⌘e is the

number of articles containing e.The country distribution of f using location context is:

D2(f) = {d2�1(f), d2�2

(f), ..., d2�|�|(f)} (10)

where d2�i(f) is computed using the same formula as d1�i

(f) with the only exception

that d2�i(f) uses the location entities and location contexts instead of person entities and

contexts.In practice, the entity recognizers do not recognize location tokens with 100% accu-

racy. To reduce the impact of erroneous locations we define another distribution, D3(f),by validating the existence of the location context entities in the database of cities andcountries. D3(f) is defined by:

D3(f) = {d3�1(f), d3�2

(f), ..., d3�|�|(f)}, where

d3�i(f) =

X

e2EL\(⇢[�)

"P (�i|e)⇥

exp

0

@��max

e02EL\(⇢[�)(LL(e0, f))� LL(e, f)

maxe02EL\(⇢[�)

(LL(e0, f))� mine02EL\(⇢[�)

(LL(e0, f))

1

A#

(11)

Notice that D3(f) is an improved version of D2(f) that resolves errors of locationentity detection. Therefore, the final composition of the geographical context is:

D(f) = ln(D1(f)) + ln(D3(f)) (12)

The Equation 8, 11 and 12 produce country probability distribution for a face f basedon person context, location context and a composition of Equation 8 and 11, respectively.In Section 4, we show a comparison of the e↵ectiveness between several combinations ofD1(f), D2(f), and D3(f).


2014/04/15 2014/10/24 2014/11/12 2015/03/02 2015/04/10

Wife of Bomb Suspect Got Triple-Slaying Subpoena

PROVIDENCE, R.I. — The attorney for the wife of the dead Boston Marathon bombing suspect says a federal grand jury asked her for items….

Feds: Nothing New on Tsarnaev, 3 Slayings

BOSTON — Prosecutors said in a court filing Friday they don't have any new evidence that the brother of Boston Marathon bombing suspect Dzhokhar Tsarnaev was involved in a 2011 triple slaying and have no …

Tsarnaev Lawyers Want Evidence on Triple Killing

BOSTON — Lawyers for Boston Marathon bombing suspect Dzhokhar Tsarnaev urged a judge Wednesday to order federal prosecutors to turn over any evidence they have about his brother's participation in a 2011 …

Feds Want Boat Panels Brought to Court to Show

Tsarnaev Note BOSTON — Prosecutors want panels of the boat in which Boston Marathon bombing suspect Dzhokhar Tsarnaev was found hiding to be brought to court to …

Penalty Phase of Marathon Bombing Trial to Start

After Race BOSTON — The second phase of the trial of Boston Marathon bomber Dzhokhar Tsarnaev will begin on April 21, after the second anniversary of the attack …

Figure 6. The story contains five news documents associating Boston bombers’ involve-ment in the Waltham triple murder. The story includes trial phase.

3.4. Complexity Analysis

Generation of the entity-based context using Equation 7 for all faces costs O(|F|⇥Nd⇥

|E|), where Nd is the average number of documents associated with each entity. The geo-graphical context generation (Equations 8, 11 and 12) for all faces has a time complexityof O(|F| ⇥ |�| ⇥ |E|), where � is the set of all countries in the world. Frontalization offive key points of a face using Algorithm 1 requires a constant time given that we scaledall the detected faces to same size. Therefore, the algorithm requires linear time in termsof the number of images.

3.5. Analytical Task Extensions

The geographical and entity based contexts have a wide variety of applications in ex-ploratory analysis. In this section, we present three extensions of the framework to demon-strate di↵erent capabilities of the framework: (a) finding genealogy of events, (b) trackinggeographical context of people over time, and (c) discovering contextual clusters of faces.These extensions are described below. Empirical studies on all the extensions are providedin Section 4.

3.5.1. Finding Genealogy of Event

We investigate the potency of the entity-level contexts of faces by introducing a newmechanism to summarize events that evolve over time. The purpose of time-evolvingsummarization of an event is to give the analyst an overview as a chain of documentswith accompanying faces relevant to the documents. An example of the summarizationof an event is shown in Figure 6. The figure explains the aspects of the Boston MarathonBombing tragedy. Figure 6 is further explained later in Section 4.8.

The summarization task focuses on forming a chain of {article, face-set} pairs using aset of news documents D and the generated face contexts using knowledge base createdfrom Wikipedia articles. Summarization requires extraction of entities from D to formED, discovering faces relevant to each new document based on similarity between a

news document and context of images in the knowledge base, and designing a path-finding algorithm in a similarity network of news documents where discovered paths areconstrained by text coherence, context similarity, and progression of time in the story.While in reality news documents may contain images but many of those images arerepeated from the past to provide a visual context. Many of the news documents do

16 Md Abdul Kader

not contain any image. Since the knowledge base covers broad aspects of everything,face images can be reproduced from the images of the knowledge base. In our set of newsdocuments, D, we considered that there is no image. We reproduce relevant faces for eachdocument of a story using the knowledge base. We order faces for each news documentd 2 D based on relevance between d and all faces. We compute this relevance using acontext matching score between a news document d and the context C(f) of a face f :

�(f, d) =X

e2EDd \C(f)

V (e, d)⇥ (|C(f)|�R(e, f)) (13)

where V is a function similar to W defined in Equation 1. The weights of V are computedover news documents of D. R(e, f) is the rank of e in the list of entities C(f) of face f .That is, we have two kinds of information associated with each news document. One isthe weighted list of entities extracted from the text of the document and the other type isthe most relevant faces (or contexts) from the knowledge base. Both the types are lever-aged in a heuristic search algorithm to build a path of {article, face-set} pairs betweentwo documents {ds, dt} 2 D. We use a variation of A* search algorithm that uses Soergeldistance as the heuristic. Soergel distance (Hossain, Butler, Boedihardjo & Ramakrish-nan 2012) is an admissible heuristic for A* search. Since Soergel distance is normalizedbetween 0.0 and 1.0, which considers the Euclidean space a hyperspace with equal-lengthdimensions, it is more meaningful to analysts in terms of interpreting distance (Hossainet al. 2011, Hossain, Butler, Boedihardjo & Ramakrishnan 2012). The Soergel distancebetween two documents d and d0 is calculated using the following formula.

SrgDist(d, d0) =X

e2EDd [ED

d0

|V (e, d)� V (e, d0)|

max(V (e, d), V (e, d0))(14)

Equation 14 is also used to compute distance between the combined contexts of therelevant faces of two documents.

Our heuristic algorithm maintains the following properties during exploration.

1. The complete search space, i.e., the network of documents is not precomputed. Instead,neighboring documents during the search are generated on-the-fly by looking up aprecomputed ball tree of the documents dataset to compute b-nearest neighbors.

2. Any two consecutive documents during the search must maintain a maximum allowabledistance ✓.

3. Face contexts of one document, as combined from certain number of most relevantfaces retrieved by using Equation 13, cannot be more than ⌧ distant from the facecontexts of a neighboring document.

4. The search must have a progression over time, i.e., T imestamp(di) T imestamp(di+1)for two consecutive articles di and di+1.

Notice that all these constraints can be applied during a candidate evaluation phaseof any heuristic search algorithm. We generate the b nearest neighbors based on textcontent of the documents, and rank the candidate documents for exploration based ontheir contexts. b is also considered the branching factor for the heuristic search algorithm.Empirical studies with di↵erent b, ✓, and ⌧ values are shown in Section 4.9.


3.5.2. Tracking Geographical Context

Traceability of faces of interest in terms of geographical context help an analyst under-stand the spatial nature of an actor of an event. This includes how political campaignsspread over the world, or how an actor of one country influences the public sentimentsof the neighboring countries (Petz et al. 2014).

To trace the geographical context of faces, we divided the New York Times datasetinto buckets of two consecutive years in such a way that each bucket has one year incommon with the next bucket in the sequence. This is to ensure that the time seriesgenerated from the dataset do not have sudden spikes due to discrete year-wise divisionof the data. After the division of the data, we copy a face image of a person of interestto all the articles containing the name of the person. That is, the copy is an additionto all other images that already exist in the data for each bucket. Then, we generatea geographical context for each of the copies within the scope of each bucket. A diver-gence between the geographical context generated for a person and a uniform countrydistribution is recorded for each bucket. This results in a signal that when goes up refersto centralization. A fall in the signal indicates globalization because the geographicaldistribution is closer to the uniform distribution.

Later, in the experimental results (Section 4.5), we demonstrate traces of faces of afew political leaders.

3.5.3. Contextual Clustering of Faces

The focus of this application is to form groups of faces with high inter-cluster contex-tual similarity. This application helps discover the community of a person (face) basedon shared contextual similarity. We leverage a density based clustering algorithm, DB-SCAN (Ester et al. 1996), in this application. Unlike k-means clustering, DBSCAN doesnot require prior specification on the number of clusters. DBSCAN has the ability toavoid outliers and form the intrinsic clusters. We used Soergel distance (Hossain, Butler,Boedihardjo & Ramakrishnan 2012) to compute the dissimilarity between two contexts oftwo faces. Section 4.6 explains some of the contextual clusters discovered using DBSCAN.

4. Experimental Results

We use New York Times news articles for our experiments. The dataset contains 54,371articles and 86,966 images with around 98,914 faces and 69,829 entities. All these newsarticles are time stamped. For all the classification based experiments, we used logis-tic regression, one-vs-rest when appropriate, with L2-regularization, and 10-fold cross-validation unless mentioned explicitly. In this section, we seek to answer the followingquestions to justify the capabilities of our F2ConText framework1. Our experiments andcase studies highlight three di↵erent aspects of the framework: (1) contextual analysis,where we evaluate di↵erent types of contexts quantitatively and qualitatively, (2) en-hanced feature generation, where justification for the selection of face feature generationtechniques are presented, and (3) runtime analysis for context generation method.

1. Contextual analysis:

(a) How good are the contexts generated for the face images? (Section 4.1)

1 Codes and data are provided here: http://dal.cs.utep.edu/projects/storyboarding/KAIS/. Pass-word: 16context

18 Md Abdul Kader

(b) How well do the generated face-contexts complement a solution of a face recognitionproblem? (Section 4.2)

(c) How do person, location and geographical contexts provide a sense about a faceimage? (Section 4.3)

(d) How well does the geographical context D(f) perform compared to its di↵erentcompositions and any baseline? (Section 4.4)

(e) Are geographical contexts traceable for analytic purpose? (Section 4.5)(f) Are the generated contexts suitable for computing distance to be able to cluster

the faces? (Section 4.6)(g) What is the impact of face context on the quality of the stories? (Section 4.7)(h) Do the generated stories provide meaningful genealogy of events? (Section 4.8)(i) How do the search parameters control the characteristics of the stories? (Sec-

tion 4.9)

2. Enhanced feature generation:

(a) Which method is more capable of detecting facial key points for this particulardata set? (Section 4.10)

(b) Does frontalization of key points result in better face recognition accuracy? (Sec-tion 4.11)

(c) Which learning mechanism should we use to attain greater performance in predict-ing the proper rotation angle for frontalization? (Section 4.12)

3. Runtime analysis:

(a) Does the context generation mechanism scale well with increasing data size? (Sec-tion 4.13)

4.1. Quality of Entity Level Face-Contexts

We present three di↵erent experiments in this section to evaluate entity level contexts ofpersons. Each of the experiments evaluates a distinct aspect of the context.

In the first experiment, we evaluate our context generation method in terms of thecapability of capturing the actual person name of a face within the context. The personcontext of a face is a list of entities in descending order of association probabilities (Eq. 7).For comparison, we use a baseline method, which creates a context of a face by combiningall entities of the document where the face was found. The entities in the context of aface using the baseline method is ordered by the TF-IDF weights of the entities in thedocument containing the face. In Figure 7 we compare our context generation method,F2ConText, against the baseline method. The x-axis represents the number of top entitiesconsidered as the context. The y-axis represents percentage of faces for which the contextof the face contained the actual name of the person.

The line for baseline method in Figure 7 suggests that the name and face of a personappear in the same document for only around 60 percents of the faces. The line becomeshorizontal after a context size of around 30 because there was no document containingmore than 30 name entities. Figure 7 shows that F2ConText is capable of producingcontext containing the actual name of the persons for more faces than the baseline methodwhen the context size is greater than 10. When the context size is greater than 30,F2ConText keeps improving its performance as opposed to the baseline method. Thisindicates that F2ConText can bring the actual person name of the face even if the nameappears in other documents but not in the document that contains the face. The result


0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80 90 100

Perc

enta

ge o

f fa

ces-

-the c

onte

xt c

onta

ins

nam

e o

f th

e p

ers

on

Context size

Our methodBaseline

Figure 7. Quality of context in terms of the appearance of the person name in the contextof a face.

is based on a random 1200+ faces for which a human analyst labeled the faces with theiractual names. This benchmark data is available in this link: http://dal.cs.utep.edu/projects/storyboarding/KAIS/LabeledFaces.zip

In the second experiment, we evaluate the quality of the context of a face by comparingthe context with Google image search results. Although Google search has a di↵erentobjective than ours, we are interested in verifying whether some of the entities in thecontext of a face can be discovered using Google image search by uploading the faceimage. Since Google image search API limits the number of queries and the process istime consuming, we randomly picked up to 300 faces and computed how many of theterms of the context of each face were found in the list of titles and summaries in the firstpage, where the search result was returned by Google after uploading the face. Then wecalculate the number of entities it has in common with our context of that face. Figure 8shows a distribution of percentage of faces for di↵erent number of overlaps of contextand Google search. As expected, the percentage of faces goes down when we look formore common entities. The plot shows that almost 75% of the faces have at least fiveentities in common with corresponding Google search result when we pick a maximumof 80 most probable entities.

In an additional experiment to evaluate the quality of the generated contexts againstground truth information, we sample 21 faces from F , and manually attach most ap-propriate person entities to each of them with the help of human experts. We thencompare these ground truth contexts with our automatically generated context. Figure 9demonstrates a comparison between two approaches using vanilla face features and facefeatures combined with frotalized features, in terms of number of entities in commonwith the human-made context. Face feature combined with frontalization was found tobe providing the best context for all context sizes.

20 Md Abdul Kader

0

20

40

60

80

100

0 10 20 30 40

Perc

enta

ge o

f fac

es (%

)

Overlap of minimum N entities

Context size=80Context size=65Context size=50Context size=35Context size=20

Figure 8. Resemblance between a face context and a relevant Google image search.

5

6

7

8

9

10

11

12

5 10 15 20 25 30 35 40 45

Num

ber

of entities

com

mon w

ith h

um

an a

nnota

ted d

ata

Context size

Facial and frontalized featuresFacial features

Figure 9. Comparison of context generation methods for human annotated test data.Adding frontalized features improve quality of context.


0.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

20 30 40 50 60 70 80 90 100

F1 s

co

re

Context size

Face featureContext feature

Face and context combined

Figure 10. Context in face recognition.

0

0.1

0.2

0.3

0.4

afgh

anis

tan

aust

ralia

chin

aeg

ypt

iran

isra

elira

qja

pan

jers

eyno

rth k

orea

sout

h ko

rea

russ

iasa

udi a

rabi

asi

ngap

ore

switz

erla

ndtu

rkey

taiw

anun

ited

king

dom

ukra

ine

unite

d st

ates

Kim Jong-un Person Context jong-un kim_jong_un kim_jong-un pak_pong-ju kim_yo_jung choe_ryong-hae pak_pong-ju ri_sol

Figure 11. An example of geographical context: (left) Probabilities of top twenty countrieswith highest values, and (right) probabilities are highlighted in a map.

4.2. Context in Face Recognition

The contexts generated for faces can be used as features to complement a face recognitionsolution. In this study, we create three sets of features for face images, (1) featuresgenerated using faceNet and frontalization described in Section 3.1.1, (2) the contextfeatures generated for each face using our framework (3) a combination (concatenation)of the context features and the face-features. Figure 10 shows that incorporation ofcontext features with face-features improves the face-recognition accuracy in terms of F1-score when compared with face recognition using face-features alone, or face recognition

22 Md Abdul Kader

0

5

10

15

20

25

0 50 100 150 200

Aver

age

KL-

Div

erge

nce

betw

een

a di

stri

butio

n an

d gr

ound

trut

h

Lambda, λ

D(f)=ln(D1)+ln(D3) ln(D1)+ln(D2) ln(D1) ln(D3) ln(D2) ln(D2)+ln(D3)

𝑫 𝒇 = 𝐥𝐧 𝑫𝟏 𝒇 + 𝐥𝐧(𝑫𝟑 𝒇 ) 𝐥𝐧 𝑫𝟏 𝒇 + 𝐥𝐧(𝑫𝟐 𝒇 ) 𝐥𝐧 𝑫𝟏 𝒇 𝐥𝐧 𝑫𝟑 𝒇 𝐥𝐧 𝑫𝟐 𝒇 𝐥𝐧 𝑫𝟐 𝒇 + 𝐥𝐧(𝑫𝟑 𝒇 )

0

2

4

6

8

10

12

14

0 50 100 150 200

Aver

age

KL-

Div

erge

nce

betw

een

a di

stri

butio

n an

d gr

ound

trut

h

Lambda, λ

D(f)D^B(f)Our method 𝐷(𝑓) Baseline 𝐷^𝐵 (𝑓) 𝐷𝐵(𝑓)

86

88

90

92

94

96

0 35 70 105 140 175

Perc

enta

ge o

f fac

es fo

r w

hich

D(f) i

s bet

ter t

han

D^B

(f)

Lambda, λ

Figure 12. (left) The combination of country distributions D1(f) and D3(f) performsbest for various values of �. (middle) Performance of D(f) against baseline DB(f): loweraverage KL-divergence using our method indicates better results. (right) D(f) performsbetter than DB(f) more than 87% of the times even during the worst choice of �.

using context features alone. For this experiment, we randomly selected around 1,200faces which resulted in around 500 people. A human expert labeled these faces. For theclassification, we used an one-vs-rest logistic regression with L2-regularization and 2-foldcross validation.

It is noticeable that the accuracy of face recognition varies with di↵erent contextsize. A general observation is that if the context size is smaller the F1 score is lower;slowly with increasing context size, the F1 score becomes higher with a peak at aroundsize 65. The F1 score decreases as the context size is increased further. This indicatesthat the selection of the best context size is an analytic choice for any dataset-specificface-recognition task.

4.3. Person, Location, and Geographical Contexts

Figure 1 in Section 1.2 shows an example of a face, the generated person context, locationcontext, geographical context using a bar chart, and the geographical context laid on amap. The face was of David Cameron, a British politician and the former Prime Ministerof the United Kingdom. The person context captures the name of the Prime Minister aswell as the names of a few other related people. Entities in the generated location contextprovide an idea about the areas related to the face, such as “kingham”, “britain”, and“medieval britain”. The bar chart in Figure 1 presents the geographical context, whichis the probability distributions of countries computed by Equation 12. In addition, weprovide another representation in which circles with size proportional to those country-probabilities are laid on a map.

Figure 11 shows the geographical context of a face of Kim Jong-un, the supremeleader of North Korea. The person context is set into the bar chart. Both the bar chartand the map portray that Kim Jong-un’s geographical context is focused on the regionof North Korea, South Korea and China. The reason behind Switzerland’s appearancein the geographical context is a controversial piece of information about Kim Jong-un’sschool attendance in Switzerland.

4.4. Comparison between Di↵erent Methods to Generate

Geographical Context

The absence of a ground truth dataset to evaluate the generated geographical contextsfor each face makes our evaluation challenging. To address this issue, we develop a groundtruth set by manually labeling one hundred faces by the name of the person correspond-


ing to each face, and then by detecting countries and cities from the documents wherethe labeled person name is found. This allows us to generate a ground truth countrydistribution, DT (f), for each face labeled manually. Let %f be the set of person namesmanually identified for face f . The ground truth country distribution of f is:

DT (f) = {dT�1(f), dT�2

(f), ..., dT�|�|(f)}, where

dT�i(f) =

X

%2%f

X

a2a%

X

l2ELa \⇢�i

P (�i|l) ⇤WL(l, a) (15)

Here, a% is the set of articles where name % appears, ⇢�i be the cities in country �i andELa is the set of location entities in article a. A geographical context D(f) of a manually

labeled face f is evaluated as a high quality context if D(f) is close to the ground truthgeographical context, DT (f), which is computed using the labels.

A baseline geographical context DB(f) of a face f is a country distribution thatis generated using the cities and countries found in the same article which contains f .DB(f) is computed using the following equation.

DB(f) = {dB�1(f), dB�2

(f), ..., dB�|�|(f)}, where

dB�i(f) =

X

l2ELaf

P (�i|l) (16)

where af is the article where face f appears.In this section, we demonstrate the e↵ectiveness of our approach and the baseline

approach as compared to the ground truth. From Equation 12, D(f) is a compositionof D1(f) and D3(f). In this experiment, we compare the resulting error of D(f) andall combinations of D1(f), D2(f) and D3(f) using the ground truth data. The error isderived by computing the average KL-divergence between the distribution under consid-eration and the ground truth distribution DT (f) of one hundred labeled faces. A loweraverage KL-divergence indicates a better distribution because it is closer to the groundtruth. Figure 12(left) shows average KL-divergences using di↵erent combinations withvarying �. Lower values of � will allow inclusion of more entities. Figure 12 (left) showsthat D(f) has the lowest KL-divergence with the ground truth using any value of �. Thisindicates that D(f) performs the best among all the combinations.

In addition, we compute average KL-div(D(f), DT (f)) and KL-div(DB(f), DT (f))with di↵erent �. The baseline, DB(f), is computed using Equation 16. Figure 12 (mid-dle) shows that the geographical context D(f) generated by our method has lesser KL-divergence than the baseline DB(f) indicating that our approach provides closer resultsto the ground truth. KL-div(DB(f), DT (f)) is constant in Figure 12(left) because it doesnot depend on �.

One observation is that average KL-div(D(f), DT (f)) is the highest when � is toosmall. This is because small values of � indicate the use of a long list of entities inthe context. Our observation is that the best performance is found near � = 80, whichis evident in Figure 12(right). The plot also shows that even in the worst case, D(f)performs better than DB(f) more than 87% of the times.

4.5. Tracking Geographical Contexts

As explained in Section 3.5.2, geographical context of each person is a traceable dis-tribution. In Figure 13, we outline the trends of geographical contexts of four political

24 Md Abdul Kader

iraq, uk us, iraq

iraq, us

china, iraq

israel, iran

egypt, france us, iran mexico, us

us, mali mexico, us iraq, uk

us, iraq us, china

us, iraq

us, aus

us, iraq

uk, us

india, us

iraq, us

us, iraq

us, iraq

india, iran

us, jersey

uk, india

puerto rico, us

us, chad

us, chad

uk, iraq uk, iraq

uk, us

uk, afgh uk, afgh

uk, italy

uk, france

uk, germ

uk, russia

0.88

0.92

0.96

1

Div

erge

nce

betw

een

Geo

grap

hica

l Con

text

and

un

iform

dist

ribu

tion

Hillary Clinton Barack Obama Mitt Romney David Cameron

Figure 13. Geo-contextual trends of leaders over time.

leaders: Hillary Clinton, Barack Obama, Mitt Romney, and David Cameron. Top twocountries are used as a label for each data point in the plot. Hilary Clinton served asthe United States Secretary of State from 2009 to 2013, in which part the trend line hascomparatively lower values indicating a focus on a↵airs around the globe. The upwardmovement of Hilary Clinton’s trend line from 2013 indicates that her focus is centraliz-ing toward the United States. While Mitt Romney’s trend line exhibits somewhat morecentralizations toward the United States, President Barack Obama’s trend line has someinteresting patterns — gradual globalization from 2006 to 2010, centralization from 2010to 2012, and again gradual globalization from 2012 to 2015. David Cameron’s trend linehas similar trends to President Barack Obama but David Cameron’s trend line has lesserfluctuations. Such tracking capability will allow social scientists and analysts study thedynamics of persons of interest.

4.6. Context based Clustering of Faces

Generated contexts can be used to create a feature space for the faces. For the personcontexts, it is possible to create vectors using entities as features. The distribution ofthe geographical context can be directly leveraged as the feature space. This leads tothe ability to compute pairwise distance between contexts and hence opens up the op-portunity to contribute to many machine learning applications. In this subsection, weprovide examples and analysis of clustering outcomes using person and geographical con-texts. We leveraged DBSCAN to group the faces based on context. Figure 14 shows twoclusters of faces generated by DBSCAN using person context elements as features. InCluster 1 of Figure 14, there are eleven faces of three people. One of these three people isthe Turkish president, Recep Tayyip Erdogan. Cluster 1 contains five faces of presidentRecep Tayyip Erdogan. These five faces were detected from five di↵erent news articlesof the New York Times dataset. The faces of other two people were found in the vicinity


w2011091300_0 w2013122111_2 w2013122207_2 w2013122001_2 w2013110804_5

w2013122111_1 w2013122207_1 w2013122001_1

w2013122001_0 w2013122207_0 w2013122111_0

Doc1 Doc2 Doc5 Doc3 Doc4

Pers

on1

Pers

on2

Pers

on3

w2007050104_4

w2010070905_2 w2014061207_0 u2011060103_3

w2007050104_2 w2007050104_3

Doc

11

Doc15 Doc13 Doc14

w2007050104_0 w2007050104_1

w2007043002_2 w2007043002_3 W2007043002_0 w2007043002_1 w2007043002_4

Doc

12

Person1 Person2 Person3 Person4 Person5

Pers

on6

Pers

on7

Pers

on8

Cluster 1 Cluster 2

Figure 14. Two clusters of faces generated by DBSCAN using name context elements asthe features of the faces.

w2013122111_2 w2014020810_0 w2010050206_1 w2014092005_1 w2013052404_0 w2011081200_0

w2013101703_0 w2013121106_0 w2015022408_0 w2012090507_0 w2014091900_0 w2012111207_0

Prime Minister of UK







First Secretary of State of UK

Member of Parliament of

UK

Secretary of State for Health

of UK

UK aid worker captured by

ISIS

BBC Chairman

A geographical context based cluster

Figure 15. An example of a geo-context based clustering.

of the images where the president’s face was present. All these faces are in the samecluster because their person context has high similarity. Cluster 2 brings together a totalof thirteen faces of eight people from five documents. These faces are either connected tothe bombing in London’s transit system in 2005, or charged with terrorism and murderrelevant to September 11 hijacking of commercial airliners. In both Cluster 1 and 2, faceswith similar person contexts are brought together.

Figure 15 shows a cluster generated by DBSCAN that leverages the geographicalcontext. The cluster contains seven faces of David Cameron, and five faces of five di↵erentpeople. All of their geographical contexts have focus on Europe, especially, the UnitedKingdom. This example shows the potential in bringing persons of interest with similargeographical context in the same group.

We use a Latent Dirichlet Allocation (LDA) (Blei et al. 2003) based technique to

26 Md Abdul Kader

0

0.2

0.4

0.6

0 20 40 60 80 100

Wei

ght o

f clu

ster

s th

at a

re d

istri

bute

d to

T to

pics

Number of topics, T

Person context based clusteringBaseline

Total number of topics = 100 and total number of clusters = 200

Figure 16. Person context clustering has higher quality (small values with large numberof topics) than the baseline.

evaluate the contexts of each face cluster generated by our framework and compare thiswith a baseline approach where context of each face is generated using person entitiesfound only in the article where the image of the face is located. We first apply LDA togenerate the topics of each of the documents in the corpus. A good face cluster shouldbring the context from documents of the same topic. For a face cluster ci, clusteredusing contexts of the faces, we find the documents �(ci) from which the faces of ci wereretrieved. If the contexts of the faces of ci are good, then the documents of �(ci) should befrom a small number of topics. If �(ci) comes from too many topics, this would indicatethat the contexts of the faces that formed ci are scattered over many topics and hencefaces in ci are less contextual. The documents relevant to a baseline face cluster comefrom too many topics, where the faces in a person-context based cluster come from lownumber of topics. The weight in the vertical axis of Figure 16 is a representation of thenumber of clusters distributed to T topics and is computed by

Pc2CT

|c||F| , where CT is

the set of face clusters where the documents of the faces of each cluster are distributedto a total of T topics. Larger values with low T and lower weight with larger T representa better quality of context in the clusters. Figure 16 shows that the person context basedface clustering (green line) ended long before the baseline clustering (red line) in thex-axis indicating that the clusters produced by our method are more contextual.

4.7. Impact of the use of Context in Story Generation

While inclusion of image context in the core heuristic path finding part of the summariza-tion process imposes an additional constraint, the outcome of the use of this constraintbecomes evident when we compare the stories side by side. For a fair comparison wemake sure that both the methods have the same parameters (e.g., same ✓, branchingfactor b, and start-end pairs). Table 2 shows four pairs of sample stories generated withand without face contexts.

Our observation is that our method generates more coherent stories when the contextis used, which is to be expected because context overlaps between consecutive articlesreinforces a constraint that each story must be weaved using a certain theme. For allstart-end pair of documents in Table 2, the stories with context are more coherent than


Table 2. Sample stories generated with and without context. (URLs of the original newsdocuments are provided with the article IDs in blue and can be reached by clicking onthe IDs.)

Story with context Story without context

[NY435 ! NY317 ! NY175 ! NY427 ! NY642! NY552 ! NY525]

[NY435 ! NY201 ! NY525]

The story describes the request for delay andchange of trial location.

The intermediate article digresses from the focusand brings investigation in Russia into consider-ation.

[NY178 ! NY370 ! NY334 ! NY609] [NY178 ! NY505 ! NY458 ! NY609]The story connects triple murder with currentcases.

The intermediate articles are about trial an-nouncements.

[NY265 ! NY129 ! NY279] [NY265 ! NY129 ! NY124 ! NY279]The story focuses on the trial of a friend of Tsar-naev and the jury selection for the suspectedfriends.

One of the intermediate articles is o↵-topic and isabout a citation of winning a video game at trialof the accused Boston bomber’s friend.

[NY158 ! NY379 ! NY206 ! NY280] [NY158 ! NY280]This story describes former Governor’s testimonyfor Tsarnaev’s friend Robel Phillipos.

The testimony of the former Governor does notshow up in the story without context.

Table 3. Stories using Boston Marathon Bombing data. (URLs of the original newsdocuments are provided with the article IDs in blue and can be reached by clickingon the IDs.)

Story Explanation

NY286 (2014/04/15) ! NY370 (2014/10/24) !NY334 (2014/11/12) ! NY609 (2015/03/02) !NY677 (2015/04/10)

This story associates Boston bombers’ involve-ment in the Waltham triple murder. The storyincludes trial phase.

NY158 (2014/10/16) ! NY379 (2014/10/16) !NY206 (2014/10/28) ! NY280 (2014/10/28)

Former governor of Massachusetts testifies forTsarnaev’s friend Robel Phillipos. Phillipos wasfound guilty of making false statement to author-ities.

NY435 (2014/05/02) ! NY317 (2014/06/18) !NY340 (2014/08/14) ! NY525 (2015/02/06)

Tsarnaev’s lawyers urge appeals to move trial lo-cation to Washington, delaying trial date. Thestory shows that the request was denied.

NY121 (2014/04/22) ! NY273 (2014/07/10) !NY177 (2014/08/20) ! NY338 (2014/09/27)

This story highlights the trials of three friends ofTsarnaev. All three were accused of obstructingjustice by lying and destroying evidence.

the stories without context. From the table, we observe that each story without the useof context has some o↵-topic documents that may be relevant but the flow of the themeis broken. Sometimes we have the same story with and without the use of context butthose stories are not reported here.

28 Md Abdul Kader

4.8. Examples of Generated Stories using Boston Marathon Bombing

Data

New York Times returns 1028 documents with the query “Boston Marathon Bombing”.The summarization mechanism discovered a number of sub-events that provide a finemental model of branches of the Boston Marathon Bombing tragedy and happeningsafterwards. Table 3 lists a few of the stories discovered by our mechanism. The first storyof Table 3 is illustrated in Figure 6 on a storyboard using term clouds and faces.

The story of Figure 6 describes a connection between Boston bombers and Walthamtriple murder. The story moves forward till the penalty phase. The term cloud of thestoryboard highlights a person named Todashev, a victim of triple murder, along withthe bombers Tamerlan and Tsarnaev. Each related face list automatically selected fromthe knowledge base by the framework captures faces of relevant people very well. Forexample, the first face of the second document of the story is Todashev, whose face isfound repeatedly in the consecutive articles. Rest of the faces in the story are of theBoston bombers Tamerlan and Tsarnaev, police, and rescue crews.

The theme of the second story in Table 3 is focused on the conviction of Tsarnev’sfriend, Robel Phillipos, for lying to the FBI. Therefore, Robel Phillipos was found guiltyof making false statement to authorities.

The third sub-events of Table 3 summarizes Tsarnaev’s lawyers’ appeal to move triallocation to Washington which delayed the trial date. The fourth story describes the trialsof three friends of Tsarnaev who were accused of obstructing justice, lying, and destroyingevidences.

4.9. Characteristics of Stories in Terms of User-settable Parameters

Two important user-settable parameters in our method are maximum allowable distance✓ and branching factor (nearest neighbor), b. Figure 17 shows the impact of ✓ and b on thestatistical significance, average length and number of stories. To calculate the statisticalsignificance, p-value, we randomly pick up b documents from the entire candidate pooland check if the documents picked satisfy the distance threshold ✓, iterating the test5,000 times. We repeat this process for every junction-article of a discovered story. Theoverall p-value of story is calculated by multiplying all the p-values of every documentof the story except for the last one. Figure 17(left) shows that the significance decreases(i.e. p-value increases) with higher values of ✓ and b. This is an expected outcome sincehigher ✓ values imply less stringent overlap of content between consecutive articles. Lessstringent constraint may result in stories with loose connections between consecutivearticles. Similar argument can explain the plot in Figure 17(middle). Increasing the ✓value and branching factor b leads to shorter stories with loosely connected neighbors.The curve for branching factor 20 and 35 are exception which gives even shorter storiesthan larger branching factor until ✓ = 0.75. This exception is justified by the right plotwhere we see that there were not enough stories for those two branching factors until✓ = 0.75. For other branching factors, number of stories follow a similar upward trendwith increasing ✓.

4.10. Facial Key Point Selection

Detection of an appropriate number of key points is challenging because of di↵erent posesfaces can have. We applied two methods, Boosted Regression with Markov Networks


0.00E+00

2.00E-06

4.00E-06

6.00E-06

8.00E-06

1.00E-05

1.20E-05

0.45 0.55 0.65 0.75 0.85 0.95

p-va

lue

Maximum allowed distance, θ

2035506580

Branching factor, b

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

0.45 0.55 0.65 0.75 0.85 0.95

Avg.

sto

ry le

ngth


2035506580

Branching factor, b

0

200

400

600

800

1000

1200

1400

1600

1800

0.45 0.55 0.65 0.75 0.85 0.95

Num

ber o

f sto

ries


20 35 50 65 80Branching factor, b

Figure 17. Impact of search parameters on characteristics of stories.

(BoRMaN) (Valstar et al. 2010) and a Deep Convolutional Network Cascade (CNN) (Sunet al. 2013) method, for facial key point generation. The former discovers twenty facialkey points and the later finds five. Our observation is that the BoRMaN method performswell with faces taken under controlled environment, e.g., photos taken in laboratories.When we used detected faces from images, the BoRMaN method was able to detect facialpoints properly for 70% of the faces. For this experiment, we randomly picked up 300faces from our database and manually checked if the points detected are in the vicinityof the expected pixels. With the same 300 faces, the accuracy of the CNN method indetecting five facial key points was 99%. Figure 18 shows two examples where five keypoints are detected correctly by CNN but the twenty points detected by the BoRMaNmethod are cluttered in one region of the face.

4.11. Frontalization for Face Recognition

Although our framework targets a di↵erent problem than face recognition, we evaluatethe frontalization technique using face recognition to study its impact. In most facerecognition literature, ground truth faces of each people are taken in di↵erent posesunder the same environment (e.g., illumination). However, faces detected from the New

30 Md Abdul Kader

Twenty facial points: Boosted Regression

with Markov Networks

Five facial points: Deep Convolutional Network Cascade

Figure 18. Comparison of two facial key point detection techniques.

York Times images are not annotated and limited in number and poses. The experimentin this section compares face recognition accuracies with and without frontalization. Wepicked 5,690 faces of 401 persons from the Labeled Faces in the Wild (LFW) dataset(Huang, Ramesh, Berg & Learned-Miller 2007), which has annotated face images thatare not taken in a controlled environment. We picked only those persons for whom atleast five face images were available so that we can experiment using di↵erent training-test ratio. We used an one-vs-rest logistic regression with L2-regularization. Figure 19compares face recognition accuracies at di↵erent training and test splits with and withoutfrontalization. The figure shows that inclusion of frontalized features with facial featuresyields better accuracy. The Figure 19 also depicts that our frontalization technique hasa greater impact on face recognition when training data is limited. On average, thestandard deviation of the data for each generated point of Figure 19 was less than 0.005.

4.12. Prediction of Frontalization Angles

Our frontalization technique relies on a generalized linear model based classifier to com-pute the azimuth and elevation angles of a face in an original image. To compare thelinear model based classifier with a few other alternative mechanisms, we created a syn-thetic face model and rotated it in di↵erent angles to create di↵erent poses. That is, theground truth angles are known for all the poses and an error can be computed for pre-


0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Ac

cu

rac

y

Training-set ratio

Facial and frontalized featuresFacial features

Figure 19. Comparison of face recognition accuracies with and without Frontalizationtechnique.

0

3

6

9

12

-60 -40 -20 0 20 40 60

Mea

n Sq

uare

Err

or

Azimuth angle

Generalized LinearModelRegression Tree

Ensemble RegressionTree

0

20

40

60

80

100

-20 -10 0 10 20

Mea

n Sq

uare

Err

or

Elevation angle

Generalized Linear ModelRegression TreeEnsemble Regression Tree

(a) (b)

Figure 20. Mean square errors of (a) azimuth angle predictors and (b) elevation anglepredictors.

diction of those angles. Figure 20 shows a comparison of mean square errors using threemethods: Generalized Linear Model (GLM), Regression Tree and Ensemble RegressionTree to predict rotation angles. Figure 20(a) is for sideway rotations and Figure 20(b)shows the errors with elevation of the faces. In both the cases, GLM has lower errorsthan any other method at most of the angles. In this experiment, the sideway angles werevaried from �45� to +45� and the elevations were varied from �15� to +15�.

32 Md Abdul Kader

0

5

10

15

20

2000 12000 22000 32000 42000

Tim

e (s

)

Number of entities

Number of faces=1000 Number of faces=2000Number of faces=3000 Number of faces=4000

Figure 21. Runtime to generate entity contexts for faces.

4.13. Runtime Analysis

The plot in Figure 21 shows that as the number of entities increases for a fixed number offaces, the runtime increases almost linearly. Additionally, the runtime increases almostlinearly along the vertical axis as the number of faces grows. The plot indicates thatthe context generation mechanism is scalable for large datasets. The person contextgeneration time for 98,914 faces detected in the New York Times Dataset was aroundfive hours. The geographical contexts of all these detected faces were generated in 15minutes. The textual content in the dataset contained 65,240 person and 4,589 locationentities. The run time is obtained using a regular desktop computer with Intel Core i7Quad Core CPU @ 3.40GHz and 24GB RAM.

5. Conclusion

This paper presents an automated system F2ConText that e↵ectively retrieves holisticcontextual phenomenon from news articles. F2ConText fuses face features with textualentities to provide a better understanding of the contextual scope of persons of interest.The framework does not require any human supervision for mapping image features totextual snippets. Results show that our system captures meaningful contextual featuresthat can be leveraged by other machine learning applications. In the future, we will createa knowledge base of contextual mappings using multiple datasets, such as Wikipediaand multiple news archives. Additionally, we will investigate how contextual informationretrieved from news corpus can help predict future events around the globe.

Acknowledgements. This material is based upon work supported by the U.S. Army Engi-neering Research and Development Center under Contract No. W9132V-15-C-0006.


References

Ahonen, T., Matas, J., He, C. & Pietikainen, M. (2009), Rotation invariant image de-scription with local binary pattern histogram fourier features, in ‘SCIA’.

Alias-i. (2011), ‘LingPipe 4.1.0’, Accessed: Sep 30, 2018.URL: http://alias-i.com/lingpipe/

Bier, E. A., Ishak, E. W. & Chi, E. (2006), Entity workspace: an evidence file that aidsmemory, inference, and reading, in ‘ISI’, pp. 466–472.

Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003), ‘Latent dirichlet allocation’, JMLR 3, 993–1022.URL: http://dl.acm.org/citation.cfm?id=944919.944937

Bruni, E., Tran, G. B. & Baroni, M. (2011), Distributional semantics from text andimages, in ‘Proceedings of the GEMS 2011 workshop on geometrical models of naturallanguage semantics’, Association for Computational Linguistics, pp. 22–32.

Bruni, E., Tran, N.-K. & Baroni, M. (2014), ‘Multimodal distributional semantics.’, J.Artif. Intell. Res.(JAIR) 49(2014), 1–47.

Cai, D., Yu, S., Wen, J.-R. & Ma, W.-Y. (2003), Vips: a vision-based page segmentationalgorithm, Technical report, MSR-TR-2003-79.

Chakrabarti, S., Sarawagi, S. & Dom, B. (1998), Mining surprising patterns using tem-poral description length, in ‘VLDB ’98’, Vol. 98, pp. 606–617.

Chen, L., Hu, B., Zhang, L., Li, M. & Zhang, H. (2003), ‘Face annotation for familyphoto album management’, IJIG 3(01), 81–94.

Choi, J. Y., De Neve, W., Ro, Y. M. & Plataniotis, K. N. (2010), ‘Automatic faceannotation in personal photo collections using context-based unsupervised clusteringand face information fusion’, IEEE Transactions on Circuits and systems for VideoTechnology 20(10), 1292–1309.

Choi, J. Y., Yang, S., Ro, Y. M. & Plataniotis, K. N. (2008), Face annotation for per-sonal photos using context-assisted face recognition, in ‘Proceedings of the 1st ACMinternational conference on Multimedia information retrieval’, ACM, pp. 44–51.

Christopher D. Manning, P. R. & Schtze, H. (2008), Introduction to Information Re-trieval, Cambridge University Press. Chapter 13: Text classification and Naive Bayes.

Das-Neves, F., Fox, E. A. & Yu, X. (2005), Connecting topics in document collectionswith stepping stones and pathways, in ‘CIKM ’05’, pp. 91–98.

Ester, M., peter Kriegel, H., S, J. & Xu, X. (1996), A density-based algorithm for dis-covering clusters in large spatial databases with noise, in ‘KDD’.

Faloutsos, C., McCurley, K. S. & Tomkins, A. (2004), Fast discovery of connection sub-graphs, in ‘KDD ’04’, pp. 118–127.

Fang, L., Sarma, A. D., Yu, C. & Bohannon, P. (2011), ‘Rex: explaining relationshipsbetween entity pairs’, Proc. VLDB Endow. 5(3), 241–252.

Fauzi, F., Hong, J.-L. & Belkhatir, M. (2009), Webpage segmentation for extractingimages and their surrounding contextual information, in ‘MM’.URL: http://doi.acm.org/10.1145/1631272.1631379

Feng, S., Manmatha, R. & Lavrenko, V. (2004), Multiple bernoulli relevance modelsfor image and video annotation, in ‘Computer Vision and Pattern Recognition, 2004.CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on’, Vol. 2,IEEE, pp. II–II.

Feng, Y. & Lapata, M. (2008), Automatic image annotation using auxiliary text infor-mation., in ‘ACL’, Vol. 8, pp. 272–280.

34 Md Abdul Kader

Finkel, J. R., Grenager, T. & Manning, C. (2005), Incorporating non-local informationinto information extraction systems by gibbs sampling, in ‘Proceedings of the 43rdannual meeting on association for computational linguistics’, Association for Compu-tational Linguistics, pp. 363–370.URL: http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf

Fisichella, M., Stewart, A., Denecke, K. & Nejdl, W. (2010), Unsupervised public healthevent detection for epidemic intelligence, CIKM ’10, pp. 1881–1884.

FMS Inc. (2017), ‘Sentinel Visualizer’, Accessed: Sep 30, 2018.URL: www.fmsasg.com

GeoDataSource (2018), ‘World Cities Database’, Accessed: Sep 30, 2018.URL: www.geodatasource.com/world-cities-database

Guillaumin, M., Mensink, T., Verbeek, J. & Schmid, C. (2012), ‘Face recognition fromcaption-based supervision’, International Journal of Computer Vision 96(1), 64.

Gung, J. & Kalita, J. (2012), Summarization of historical articles using temporal eventclustering, in ‘HLT-NAACL ’12’, pp. 631–635.URL: http://dl.acm.org/citation.cfm?id=2382029.2382134

Hassner, T., Harel, S., Paz, E. & Enbar, R. (2015), E↵ective face frontalization in un-constrained images, in ‘Proceedings of the IEEE Conference on Computer Vision andPattern Recognition’, pp. 4295–4304.

Heath, K., Gelfand, N., Ovsjanikov, M., Aanjaneya, M. & Guibas, L. J. (2010), Imagewebs: Computing and exploiting connectivity in image collections, in ‘CVPR ’10’,pp. 3432–3439.

Heuer, R. (1999), Psychology of intelligence analysis, CIA ’99.

Hossain, M. S., Andrews, C., Ramakrishnan, N. & North, C. (2011), Helping intelligenceanalysts make connections, in ‘AAAI ’11Workshop on Scalable Integration of Analyticsand Visualization’.

Hossain, M. S., Butler, P., Boedihardjo, A. P. & Ramakrishnan, N. (2012), Storytellingin entity networks to support intelligence analysts, in ‘KDD ’12’.URL: http://doi.acm.org/10.1145/2339530.2339742

Hossain, M. S., Gresock, J., Edmonds, Y., Helm, R., Potts, M. & Ramakrishnan, N.(2012), ‘Connecting the dots between pubmed abstracts’, PloS one 7(1), e29509.

Huang, C., Ai, H., Li, Y. & Lao, S. (2007), ‘High-performance rotation invariant multi-view face detection’, TPAMI 29(4), 671–686.

Huang, G. B., Ramesh, M., Berg, T. & Learned-Miller, E. (2007), Labeled faces in thewild: A database for studying face recognition in unconstrained environments, Tech-nical report, Technical Report 07-49, University of Massachusetts, Amherst.

IBM Analytics for a Safer Planet (2016), Accessed: Sep 30, 2018.URL: www.ibm.com/analytics/us/en/safer-planet/

IN-SPIRE Visual Document Analysis (2014), Accessed: Sep 30, 2018.URL: http://in-spire.pnnl.gov/

Kader, M. A., Boedihardjo, A. P., Naim, S. M. & Hossain, M. S. (2016), Contextualembedding for distributed representations of entities in a text corpus, in ‘KDD BigMine2016’, Vol. 53 of Proceedings of Machine Learning Research, PMLR, San Francisco,California, USA, pp. 35–50.URL: http://proceedings.mlr.press/v53/kader16.html

Kader, M. A., Naim, S. M., Boedihardjo, A. P. & Hossain, M. S. (2016), Connecting thedots using contextual information hidden in text and images, in ‘AAAI’.


Kalva, P., Enembreck, F. & Koerich, A. (2007), Web image classification based on thefusion of image and text classifiers, in ‘ICDAR’, Vol. 1, pp. 561–568.

Kang, H., Plaisant, C., Lee, B. & Bederson, B. B. (2007), ‘Netlens: iterative explorationof content-actor network data’, Information Visualization 6(1), 18–31.

Karpathy, A. & Fei-Fei, L. (2015), Deep visual-semantic alignments for generating imagedescriptions, in ‘CVPR’, pp. 3128–3137.

Karpathy, A., Joulin, A. & Li, F. F. F. (2014), Deep fragment embeddings for bidirec-tional image sentence mapping, in ‘Advances in neural information processing systems’,pp. 1889–1897.

Kinsella, S., Murdock, V. & O’Hare, N. (2011), “i’m eating a sandwich in glasgow”:Modeling locations with tweets, in ‘SMUC’, pp. 61–68.

Koch, G. G., Koldehofe, B. & Rothermel, K. (2010), Cordies: Expressive event correlationin distributed systems, DEBS ’10, pp. 26–37.

Kumar, D., Ramakrishnan, N., Helm, R. F. & Potts, M. (2006), Algorithms for story-telling, in ‘KDD ’06’.

Lai, J. H., Yuen, P. C. & Feng, G. C. (2001), ‘Face recognition using holistic fourierinvariant features’, Pattern Recognition 34(1), 95–109.

Laxman, S., Sastry, P. & Unnikrishnan, K. (2005), ‘Discovering frequent episodes andlearning hidden markov models: A formal connection’, KDE 17(11), 1505–1517.

Le, D.-D. & Satoh, S. (2008), Unsupervised face annotation by mining the web, in ‘DataMining, 2008. ICDM’08. Eighth IEEE International Conference on’, IEEE, pp. 383–392.

Lee, H.-J., Lee, W.-S. & Chung, J.-H. (2001), Face recognition using fisherface algorithmand elastic graph matching, in ‘ICIP’, pp. 998–1001.

Li, H., Lin, Z., Shen, X., Brandt, J. & Hua, G. (2015), A convolutional neural networkcascade for face detection, in ‘Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition’, pp. 5325–5334.

Li, X., Cong, G., Li, X.-L., Pham, T.-A. N. & Krishnaswamy, S. (2015), Rank-geofm: Aranking based geographical factorization method for point of interest recommendation,in ‘SIGIR ’15’, pp. 433–442.URL: http://doi.acm.org/10.1145/2766462.2767722

Little, S., Jargalsaikhan, I., Clawson, K., Nieto, M., Li, H., Direkoglu, C., O’Connor,N. E., Smeaton, A. F., Scotney, B., Wang, H. & Liu, J. (2013), An information retrievalapproach to identifying infrequent events in surveillance video, ICMR ’13, pp. 223–230.URL: http://doi.acm.org/10.1145/2461466.2461503

Mishra, A., Mishra, N. & Agrawal, A. (2010), Context-aware restricted geographicaldomain question answering system, in ‘CICN ’10’, pp. 548–553.

National Research Council (2002), Making the Nation Safer: The Role of Science andTechnology in Countering Terrorism, The National Academies Press, Washington, DC.

OpenNLP (2017), ‘OpenNLP’, Accessed: Sep 30, 2018.URL: http://opennlp.apache.org

Palantir Gotham (2007), Accessed: Sep 30, 2018.URL: www.palantir.com/palantir-gotham/

Park, M.-H., Hong, J.-H. & Cho, S.-B. (2007), Location-based recommendation systemusing bayesian user’s preference model in mobile devices, in ‘UIC’, Vol. 4611, pp. 1130–1139.URL: http://dx.doi.org/10.1007/978-3-540-73549-6_110

36 Md Abdul Kader

Parkhi, O. M., Vedaldi, A. & Zisserman, A. (2015), ‘Deep face recognition’, BMVC1(3), 6.

Petz, G., Karpowicz, M., Furschuß, H., Auinger, A., Strıtesky, V. & Holzinger, A. (2014),‘Computational approaches for mining users opinions on the web 2.0’, InformationProcessing & Management 50(6), 899–908.

Rahman, S., Naim, S. M., Al Farooq, A. & Islam, M. M. (2010), Performance of mpeg-7edge histogram descriptor in face recognition using principal component analysis, in‘ICCIT’, pp. 476–481.

Rahman, S., Naim, S. M., Al Farooq, A. & Islam, M. M. (2012), ‘Combination of gaborand curvelet texture features forface recognition using principal component analysis’,IACSIT 4(3), 264.

Schro↵, F., Kalenichenko, D. & Philbin, J. (2015), Facenet: A unified embedding forface recognition and clustering, in ‘Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition’, pp. 815–823.

Shahaf, D. & Guestrin, C. (2010), Connecting the dots between news articles, in ‘KDD’10’, pp. 623–632.

Silva, M. J., Martins, B., Chaves, M., Afonso, A. P. & Cardoso, N. (2006), ‘Addinggeographic scopes to web resources’, CEUS 30(4), 378 – 399.URL: http://www.sciencedirect.com/science/article/pii/S0198971505000608

Socher, R. & Fei-Fei, L. (2010), Connecting modalities: Semi-supervised segmentationand annotation of images using unaligned text corpora, in ‘Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on’, IEEE, pp. 966–973.

Son, J.-W., Kim, A.-Y. & Park, S.-B. (2013), A location-based news article recommen-dation with explicit localized semantic analysis, in ‘SIGIR’.URL: http://doi.acm.org/10.1145/2484028.2484064

Stasko, J., Gorg, C. & Liu, Z. (2008), ‘Jigsaw: supporting investigative analysis throughinteractive visualization’, Information visualization 7(2), 118–132.

Stone, Z., Zickler, T. & Darrell, T. (2008), Autotagging facebook: Social network contextimproves photo annotation, in ‘Computer Vision and Pattern Recognition Workshops,2008. CVPRW’08. IEEE Computer Society Conference on’, IEEE, pp. 1–8.

Sun, Y., Chen, Y., Wang, X. & Tang, X. (2014), Deep learning face representation byjoint identification-verification, in ‘Advances in neural information processing systems’,pp. 1988–1996.

Sun, Y., Wang, X. & Tang, X. (2013), Deep convolutional network cascade for facialpoint detection, in ‘CVPR ’13’, pp. 3476–3483.

Szegedy, C., Toshev, A. & Erhan, D. (2013), Deep neural networks for object detection,in ‘NIPS’.

Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. (2014), Deepface: Closing the gap tohuman-level performance in face verification, in ‘CVPR’, pp. 1701–1708.

Tecuci, G., Boicu, M., Schum, D. & Marcu, D. (2010), Coping with the complexityof intelligence analysis: cognitive assistants for evidence-based reasoning, Technicalreport, LAC GMU.

Tian, Y., Liu, W., Xiao, R., Wen, F. & Tang, X. (2007), A face annotation frameworkwith partial clustering and interactive labeling, in ‘CVPR’07’, IEEE, pp. 1–8.

Turk, M. & Pentland, A. (1991), ‘Eigenfaces for recognition’, Cognitive Neuroscience3(1), 71–86.URL: http://dx.doi.org/10.1162/jocn.1991.3.1.71


United States Government (2009), ‘A tradecraft primer: Structured analytic techniquesfor improving intelligence analysis’, CIA CSI .

Valstar, M., Martinez, B., Binefa, X. & Pantic, M. (2010), Facial point detection usingboosted regression and graph models, in ‘CVPR ’10’, pp. 2729–2736.

Viola, P. & Jones, M. (2001), Rapid object detection using a boosted cascade of simplefeatures, in ‘CVPR’.

Wang, D., Hoi, S. C., He, Y., Zhu, J., Mei, T. & Luo, J. (2014), ‘Retrieval-based faceannotation by weak label regularized local coordinate coding’, IEEE transactions onpattern analysis and machine intelligence 36(3), 550–563.

Wu, B., Ai, H., Huang, C. & Lao, S. (2004), Fast rotation invariant multi-view facedetection based on real adaboost, in ‘FG’, pp. 79–84.

Xu, J. & Lu, T.-C. (2015), Seeing the big picture from microblogs: Harnessing socialsignals for visual event summarization, in ‘IUI ’15’, pp. 62–66.

Yao, B. Z., Yang, X., Lin, L., Lee, M. W. & Zhu, S.-C. (2010), ‘I2T: Image parsing totext description’, Proceedings of the IEEE 98(8), 1485–1508.

Yong-hong, T., Tie-jun, H. & Wen, G. (2005), ‘Exploiting multi-context analysis in se-mantic image classification’, JZUS-A 6(11), 1268–1283.URL: http://dx.doi.org/10.1007/BF02841665

Zhang, K., Zhang, Z., Li, Z. & Qiao, Y. (2016), ‘Joint face detection and alignmentusing multitask cascaded convolutional networks’, IEEE Signal Processing Letters23(10), 1499–1503.

Md Abdul Kader is a Cognitive Engineer at IBM. He received his Ph.D.degree in Computer Science from the University of Texas at El Paso in 2017.He earned his B.S. degree from the University of Dhaka, Bangladesh. Dr.Kader’s primary research interests are Data Mining and Machine Learningwith a focus on application areas of national security. He was the winner ofthe WalmartLabs Machine Learning challenge 2016.

Arnold Priguna Boedihardjo is a principal R&D scientist at RadiantSolutions. He received his PhD degree in Computer Science from VirginiaTech in 2010. Prior to joining Radiant Solutions, he was a senior researchscientist at the US Army Engineer and Research Development Center. Hisresearch interests include machine learning, geospatial data mining, big dataanalysis, and information retrieval.

38 Md Abdul Kader

Mahmud Shahriar Hossain has been an Assistant Professor of ComputerScience at the University of Texas at El Paso since 2013. He received hisPh.D. from Virginia Tech in 2012. His research focuses on Data Mining andMachine Learning aspects of Big Data Analytics. Along with establishingtheoretical foundations, Dr. Hossain’s research tackles knowledge discoveryproblems in many important areas of national and international interestincluding intelligence analysis, biomedical science, and sustainability.

Correspondence and o↵print requests to: Md Abdul Kader, IBM Innovation Center, Austin, TX 78758.

Email: [email protected]

Documents

F2ConText: How to Extract Holistic Contexts of Persons of ... · Md Abdul Kader1, Arnold P. Boedihardjo2 and M. Shahriar Hossain3 1IBM Innovation Center, Austin, TX 78758 2Radiant