Multimedia categorization for event identi cation in ... · Multimedia categorization for event identi cation in ... Intelligent Systems Faculty of Science University of Amsterdam

Multimedia categorization for eventidentification in social media collections

Patrick Diviacco5781205

M.Sc Thesis Artificial IntelligenceIntelligent Systems

Faculty of ScienceUniversity of Amsterdam

July 2011

1

Multimedia categorization for event identification in social media collectionsPatrick Diviacco c© July 2011

SupervisorMarcel Worring - UvA

2

Abstract

Our life is full of events of various types and visual and audio material allow us to document, understand,and experience them again. With the advent of social media communities dedicated to photography andvideo (Flickr, Google Picasa, Youtube, Vimeo, etc.), worldwide personal multimedia material has becomepublicly available on the web. However, due to the lack of organization, a considerable amount of uploadedmultimedia content is hardly retrievable, not only because social media community data is noisy but alsodue to the “semantic gap” [49]. Furthermore, as a consequence of the difficulties faced by search enginesin indexing and retrieving multimedia content, users find it hard to browse multimedia collections becauseof their disorganization. Events can play a crucial role if used as primary means to organize media: byautomatically identifying events and their associated user-contributed social media documents, we can bothcomplement and improve retrieval techniques and enable powerful event browsing user interfaces. In thiswork, we propose an event identification approach for social media collections to retrieve multimedia ma-terial related to specific events and organize social media collections on event basis. We rely on contextualmeta-information associated with the documents (e.g. textual descriptions or document creation time), thedocument’s visual features, users tagging behavior, and multimedia categorization techniques. Furthermore,we provide an example of an event-based browsing experience, with the design of a browsing application fortablet computers.

3

Contents

1 Introduction 5

2 Problem definition 6

3 Related Work 73.1 Event identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Event identification in news corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Event identification in personal photo collections . . . . . . . . . . . . . . . . . . . . . 83.1.3 Event identification in large-scale social media collections . . . . . . . . . . . . . . . . 8

3.2 Multimedia Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.1 Query-independent vs Query-dependent methods . . . . . . . . . . . . . . . . . . . . . 103.2.2 Model-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.3 Model-free approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Methodology 134.1 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.1 Textual features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1.2 Visual features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1.3 Category feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Features for aggregated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Similarity metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Retrieval Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4.1 Training module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4.2 Single-pass incremental clustering module . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Experiments 215.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Retrieval framework & Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Results & Discussion 256.1 Results for the experiment “Finding the optimal feature set” . . . . . . . . . . . . . . . . . . 256.2 Results for the experiment “Finding the optimal category set” . . . . . . . . . . . . . . . . . 266.3 Results for the experiment “Evaluation on a filtered collection” . . . . . . . . . . . . . . . . . 28

7 Visualization 29

8 Conclusions 31

4

1 Introduction

Our life is full of events of various types and visual and audio material allow us to document, understand,and experience them again. Events can be organized into different categories such as sports or music, andthey range from widely known events, such as political elections, to smaller, community-specific events, suchas local gatherings.

With the advent of social media communities dedicated to photography and videos (Flickr, Google Picasa,Youtube, Vimeo, etc.), worldwide personal multimedia material has become publicly available on the web.Multimedia communities let us share pictures and videos with our friends as well as anyone willing to viewour material. In this way, the web becomes a resource of collective experiences, views and memories [33].However, users often do not invest much effort in organizing their own multimedia documents.

Due to the lack of organization, a considerable amount of uploaded multimedia content is hardly re-trievable. As a result, the web becomes a tremendous source of information, which is hardly accessible tousers. Content-based indexing and retrieval techniques partially solve this problem. Retrieval, however, isparticularly difficult. Not only because social media communities data is noisy, but also due to the “semanticgap” [49], that doesn’t allow to easily and automatically index the content of photos and videos given theextraction of low-level visual features. Furthermore, as a consequence of the difficulties faced by searchengines in indexing and retrieving multimedia content, users find it hard to browse multimedia collections.Browsing semantically heterogeneous multimedia documents, is a nontrivial user task and navigating throughmultimedia collections becomes an unpleasant experience for users.

Events can play a crucial role if used as primary means to organize media, since many of the socialmedia documents have been produced during specific events. Consequently, organizing multimedia contenton event basis can not only facilitate retrieval for search engines, but it is also a very natural way for usersto browse a collection. The advantages of event based organization are twofold: by automatically identifyingevents and their associated user-contributed social media documents, we can both complement and improveretrieval techniques and enable powerful event browsing user interfaces.

In this work, we propose an event identification approach for social media collections to retrieve multi-media material related to specific events and organize social media collections on event basis. To accomplishour goal we rely on contextual meta-information associated with the documents (e.g. textual descriptions ordocument creation time), document’s visual features, users tagging behavior, and multimedia categorizationtechniques.

Furthermore, we provide an example of an event-based browsing experience, with the design of a browsingapplication for tablet computers which makes use of our event identification approach. In the mobile context,more than in other contexts, users usually want to have fast access to information, with shorter and simplerinteractions than with desktop computers. New event-based user interfaces are needed to provide a moreenjoyable browsing experience to the final user.

5

2 Problem definition

An event is defined as “a specific thing happening at a specific time and place” ([15]) and it belongs toone of the following three categories: micro, meso and macro events. Micro events are personal events (e.g.our wedding, the birth of our children or our holidays). Next to that there are events that we attend suchas, concerts and sport games (meso events). Finally, there are macro events happening around us in theworld, which make it to the news and affect many people around the globe. Social media collections hostsubstantial amounts of user-contributed materials (e.g. photographs, videos, and textual content) for a widevariety of events. In our work, we mainly consider multimedia material from micro and meso events.

Events can be used as primary means to organize media. In our work we aim to index large-scalemultimedia collections by events. Given a stream of new documents di just uploaded to a collection, we aimto determine if they belong to the same event and to cluster them into the same event set cj . For instance,given a set of unorganized documents and a subset of documents belonging to a specific sport event, suchas “Wold Cup 2010”, our ultimate goal is to identify and group documents from this subset into a clusterrepresenting the event.

The intuition behind our work is that the relatedness of the documents to a generic event category,is valuable information to be used together with the rich “context” associated with social media content.For this reason, we propose an “event identification” approach for social media collections that integrates“multimedia categorization”. For instance, if our task is to retrieve multimedia material related to the event“Radiohead concert in May 2009”, we propose to take into consideration the relatedness of the documentsto its general category “music”, and integrate such information in the event identification framework. Morein detail, we define similarity metrics to measure similarity between documents di and event clusters cj .Such metrics are based on both contextual meta-information associated with the collection documents (e.g.textual descriptions or document creation time), and a measure of documents relatedness to generic eventcategories categz.

Since we are interested in detecting multimedia material associated with specific events and we aim todetect such events in social media collections, our research focuses on event identification. It differs fromthe categorization problem that classifies multimedia documents into generic categories categz. In the firstcase the goal is to determine if a given document is related to, for instance, a specific Radiohead concertwhich occurred in May 2009 (represented by one cluster), while in the second case the goal is to determinewhich category categz (e.g. “music”) the document belongs to. Most of events-related multimedia miningtechniques focus on the second task, that is, the categorization of documents into event categories or classes[8], [5], [7]. We address the first problem: although we make use of multimedia categorization techniques,our ultimate goal is not to categorize multimedia documents into generic classes but to identify documentsrelated to the same specific event.

Summarizing, we propose an event identification approach for social media collections that makes use ofcontextual information and multimedia categorization to identify specific events, retrieve related multimediamaterial and organize social media collections on event basis.

6

3 Related Work

In this section we give an overview of the state of art in the two main fields of our research: eventidentification in social collections, and multimedia categorization.

3.1 Event identification

The topic of event identification is not new; first papers addressing it appeared already in 1998 as part ofthe Topic Detection and Tracking (TDT) initiative [16] even though they weren’t dealing with multimediacollections, but textual material only.

In [17] the authors introduce two different types of event identification methods: retrospective and onlinedetection. The former refers to discovery of previously unidentified events inside a collection: such eventsoccurred in the past and can be identified through mining of large multimedia collections. The latter strivesto identify in real time new events from live news feeds: such events have just occurred or they are currentlyhappening and can be identified by analyzing live multimedia streams. As we mentioned before, in our workwe propose to target online detection of events in large-scale collections, constantly evolving and growing insize. However we build our approach using retrospective detection over a past set of collection documents.

In the next paragraphs we give an overview of literature about event identification. We start describingthe first efforts dealing with news corpora without multimedia content. Then we describe approaches copingwith the problem of events identification in personal media collections. Finally, we discuss work copingwith our problem: event identification in social media community data. Differently from approaches dealingwith personal collections, the latter are built over large-scale collections consisting of millions of documentsprovided by multiple users. For this reason methods studied for personal collections cannot be directlyapplied. Methods need to be scalable to huge datasets and they cannot rely on the assumption that alldocuments are provided by the same user.

3.1.1 Event identification in news corpora

As one of the very first efforts in event detection, [17] presents a simple agglomerative clustering algorithm,called augmented Group Average Clustering, to discover events from text corpora.

Arguing that most of the existing research focusing on Retrospective news Event Detection (RED) makeuse of only the contents of the news articles, the authors of [27] propose to do explorations on both contentand time information and introduce a probabilistic model to incorporate these both sources of information ina unified framework. Similarly, the authors of [26] also utilize both time and content information. However,in contrast to TDT, which attempts to cluster documents as events using clustering techniques, in [26] thefocus is on detecting a set of bursty features for a bursty event. The main technique employed in the paperis a free probabilistic approach which fully utilizes the time information to determine a set of bursty featureswhich may occur in different time windows.

The previous works identify events in news corpus collections. Recently new methods targeting othertypes of data have been proposed.

7

3.1.2 Event identification in personal photo collections

First works addressing the event identification problem with multimedia content have been studied forpersonal photo collections. A key characteristic in the personal photo collection domain is the generalassumption of “a single camera”, which reduces event identification to a problem of temporal segmentation.Events are considered to be a single segment of time over which a single activity was taking place, providinga coherent, unifying context. Prior work on this problem has applied a number of techniques: some relyprimarily on time [19] [29], others use both locations and times [20], [22], the text annotation associated withphotos [21], combine contextual data with low-level images analysis [31] [30] or model the user’s picture-taking behavior [28].

The above mentioned works differ from approaches coping with the problem of events identification insocial large-scale collections described in the next section. While the first are based on “personal collections”,the latter are built over “social media collections”. Social media collections are large-scale collections con-sisting of millions of documents. For this reason, methods studied for personal collections cannot be directlyapplied but need to be scalable to huge datasets. Furthermore, the “single camera” assumption is not validanymore, since multiple users contribute to these collections. In these new settings the event identificationproblem cannot be reduced to a temporal segmentation problem.

In our work we focus on “social media collections” only. In the next section we analyze works dealingwith the event identification problem in large-scale social media collections.

3.1.3 Event identification in large-scale social media collections

Research efforts working with large-scale social media collections mainly analyze user-provided annotationsassociated with the documents such as titles, tags and descriptions, together with automatically generatedmeta-information such as creation time and location.

To the best of our knowledge the Flickr web service is the most analyzed online social media collection,because of its worldwide success among web users and the huge amount of public content available [32].Flickr consists of more than 5 billion images provided by more than 40 million users. The images are usuallyannotated with textual descriptions, tags, creation time and location information.

In [4] the authors propose an approach for detecting Flickr photos depicting events. Given a set of Flickrphotos with both user tags and other metadata including time and location (latitude and longitude), thealgorithm aims to discover a set of photo groups, where each group corresponds to an event. The methodconsists of three steps: (1) based on temporal and spatial distributions, tags are identified as related toevents or not; (2) after detecting event-related tags, they are further classified into periodic-event (daily,monthly or yearly occurrence) or aperiodic-event (occurring just once) tags (3) finally, for each tag clusterrepresenting an event, the set of photos corresponding to the event are retrieved. However, this approachstrongly relies on geographical information, still inexistent for many pictures.

In [1], similarity metrics are built considering multiple context features of the documents: textual descrip-tions, time and location. In their settings, clustering performance improves when they combine the variety ofsocial media features judiciously. Their work is based on collections consisting of events-related documentsonly, which is a not realistic assumption since multimedia collections such as Flickr, contain documents ofany kind.

The approach described in [7] targets to both classify pictures into events and identify specific events.It relies entirely on user generated information and they experiment both with simple types of features,such as tags, time information, photo titles and descriptions, as well as with different combinations amongthem. Based on this information, they construct classifiers, which automatically assign the pictures to theirassociated event categories. Since they consider also tags describing meso events, they basically performevent identification as well.

The techniques in [7] are purely classification-based, whereas in [1] clustering techniques are employedas well. According to the authors, [7] is simpler and has a broader applicability because it doesn’t relyon geographical information. However, also [1] doesn’t necessarily rely on the documents geo-coordinates(which are not available for most of the collection used in their experiments) and, most importantly, the

8

event classes are not known beforehand and they are not directly used to identify events (differently from[7]). In other terms, while [7] needs a specific tag describing a meso-event to know about its existence, [1]considers a set of features (including the tags associated with the event-related documents) to determine thedocuments similarity and eventually cluster them into the same event.

Other approaches exploit the social nature of online social media collections to improve document clus-tering. The authors of [9] propose to use different types of social links, in addition to the rich context featuresassociated with social media documents [1]. While social links between document pairs may be too weak tocapture similarity, links between clusters of documents may be more revealing.

The work [18] describes an effort of detecting events from interactions of users with the collection.Although not all interactions provides insights relevant to events, the approach directly clusters query-document pairs without addressing the issue of noise. However, such an approach is restricted to web searchcompanies rather than general users without such information.

We have given an overview of the current state of the art in event identification from textual news corpora,personal multimedia collections and social media collection. The works from the last category are the mostinteresting for us since we address the problem to retrieve multimedia material associated to specific eventsoccurred from social media collections.

Since we intend to combine multimedia categorization techniques with event identification ones, in thenext section, we provide an overview of the state of art in multimedia categorization.

3.2 Multimedia Categorization

There are two major approaches in multimedia categorization:

• Model-based approaches require models to be trained for a predefined set of visual concepts usinglabeled examples. The trained models are then used to tag new images according to their relevance tothe concepts [46], [47], [48], [61], [64], [62], [63].

• Model-free approaches assume that visually similar images are annotated by a similar set of tags. Fora given image, tags are recommended among those associated with its nearest neighbors by visualcontent similarity [12], [59], [58], [60], [6], [43], [3], [45], [2].

Model-based and model-free are not the only distinction between multimedia categorization approaches.Indeed, we can also distinguish between methods relying on the query to categorize the documents (query-dependent) and those which are not (query-independent). Figure 1 summarizes the approach typologies.

Figure 1: Typologies of multimedia categorization approaches

In the next sections existing literature works are presented by comparing query dependent and indepen-dent methods and discussing model-based and model-free approaches.

However, before going through the various approaches, we introduce the main problem, that multimediacategorization algorithms have to deal with: the semantic gap.

9

Semantic Gap The majority of current content-based image retrieval techniques are primarily based onlow-level features. However, humans usually judge image similarity based on high level concepts instead ofthe similarity measures between low-level features. This leads to the semantic gap problem in image databaseindexing and retrieval.

More precisely “the semantic gap is the lack of coincidence between the information that one can extractfrom the visual data and the interpretation that the same data have for a user in a given situation” [49].

High-level concept indexing using textual terms seems a reasonable solution to the limitation of low-levelfeature indexing. It has higher expressive power since we use it to describe most aspects of image content.In addition, keywords directly relate to the user’s vocabulary. However, it suffers from the main bottleneckin media semantics that human effort is required.

3.2.1 Query-independent vs Query-dependent methods

A distinction between categorization techniques can be made depending on whether they rely on the searchquery for the categorization task or not.

Query-independent methods improve tagging quality either by adding new annotations [46], [47], [48] orby removing existing noisy ones [56], [57]. They are mostly model-based and several of them are discussedin the next paragraph.

Given a user query, query-dependent methods try to improve search results either by re-ranking searchresults using pseudo relevance feedback algorithms [42], ,[50], [51], [52], [53] or by expanding the originalquery [46], [54], [55].

Re-ranking algorithms are used because tag-based social image retrieval can yield search results that areinconsistent in terms of image relevancy. The assumption is that the majority of search results are relevantto the query and relevant examples tend to have similar visual patterns such as color and texture.

For example, in [42] a relevance-based ranking scheme for social image search is proposed, to automaticallyrank images according to their relevance to the query tag. It integrates both the visual consistency betweenimages and the semantic correlation between tags in a unified optimization framework. Visual repetition ina large collection of social images is an important signal to infer a common “visual theme” for the query tag.Finding such a consistent visual theme and their relative strength is the basis of the image ranking schemeproposed.

In spite of the positive results, density estimation which is often used to measure visual similarity amongpictures is known to be inaccurate when feature dimensionality is high and samples are insufficient, whichis mostly the case in the re-ranking scenario. Also, the associated computational expense puts the utility ofre-ranking methods for social image retrieval into question.

Beside re-ranking algorithms, query expansion methods augment the original query by adding relevantterms. The methods are either lexicon-driven (integrating synonyms from online dictionaries such as Word-Net [66]) or corpus-driven (selecting related terms from snippets returned by search engines such as Google[65]).

3.2.2 Model-based approaches

Multimedia categorization methods in social media collections often are model-based and heavily rely oncomplicated machine learning algorithms. In general, the methods boil down to learning a mapping betweenlow-level visual features and high level semantic concepts. That is, by treating tags as visual concepts,they first train concept classifiers for each tag and then use the learnt classifiers to categorize multimediadocuments.

Classifiers which take the low-level features of images and recognize/classify them into some high-levelconceptual categories, have been designed in [46], [47], [48], [61], [64]. Besides, semi-supervised learningmethods are also explored in recent literature [62], [63].

10

Visual Vocabularies In model-based approaches, the set of high-level visual concepts are usually manuallyselected such that they are relatively easy to model (e.g. water, lake, building) and indexed in visualvocabularies.

For example, in [44], a visual tag dictionary is introduced, in which each tag is interpreted with thedistribution of visual words, which is analogous to the conventional dictionaries that explain terms withtextual words: firstly, a large image dataset is gathered, secondly key-points are detected in each image.Finally, the key-points are grouped into clusters and each cluster is treated as a visual word as it representsthe pattern of the image local patches.

Similarly, the authors of [11] propose to learn a visual vocabulary by decomposing images in a collectiveway. A set of local features are extracted from each image or region and k-means clustering is performedover the training features, where each cluster centroid is set as a visual word.

Compared to a potentially unlimited vocabulary existing in social tagging, currently only a very limitednumber of visual concepts can be effectively modeled using small-scale datasets. Moreover, uncontrolledvisual content contributed by amateurs creates a broad domain environment having significant diversity invisual appearance even for the same concept [49] and make the learned classifiers unreliable and hardlygeneralizable.

In conclusion, training such a large amount of model-based classifiers is computationally prohibitive.Furthermore, although training can often be performed offline, larger training sets typically incur highcomputational costs at test time. As a result, a variety of data sampling approaches have been proposed forclassifier training. Nevertheless, despite their success in small-scale image databases, model-based approachesare not scalable to handle a massive amount of social-tagged images and model-free approaches have beenconsidered in large-scale collections.

3.2.3 Model-free approaches

An emerging point of view is that “model-free” or weak learning techniques can be preferable in large-scaledomains [46]. Given enough training data, simple models can often do as well or better than more complexmodels.

Retrieval approaches for automated photo tagging usually retrieve k social images that share the largestvisual similarity with an image in order to tag it. To measure similarity between images, such methodsdesign a feature representation scheme to extract salient visual features, and a distance measure method toeffectively calculate distances for the extracted features. As an example, the authors of [6] focus their effortson tackling the second challenge. Assuming features are represented in vector space, their goal is to learnan optimal distance metric.

Model-free methods are developed in [59], [58] to learn visual concepts from web images. The intuitionis that relies on the assumption that there exists a well-annotated and unlimited-scale image database suchthat for any unlabeled image outside the database, we can find its visual duplicate. However, due to thesemantic gap problem described before, that is the inconsistency between visual similarity and semanticsimilarity, irrelevant tags may also be propagated.

Methods such as the above mentioned ones treat tags uniformly, they do not attempt to differentiatebetween tags used to describe visual content of images or other aspects. However, tags are often inconsistentand disorganized, since they provide information about location, time, subjective aspects, device settings,etc. Consequently, not all tags contributed by common users objectively describe the associated images, thatis, not all tags are representative of the visual content.

Tag Relevance The relevance of a tag given the visual content is often subjective. Relevance is indeed arelationship between an image and a user. Consequently, to find images relevant to a majority of users, anobjective criterion of tag relevance is required: a tag is relevant to an image if the tag accurately describesobjective aspects of the visual content, meaning the content can be easily and consistently recognized bycommon knowledge.

Intuitively, a visually representative tag (such as “sunset”, “sky”, “tiger”) easily suggests the scene orobject an image may describe. On the other hand, tags like “2009”, or “Asia”, often fail to suggest anything

11

meaningful related to the visual content of the annotated image.It is important to underline that tags are mostly used once per document. This implies that within a

document, relevant tags and irrelevant ones are not distinguishable by their occurrence frequency. Hence,given the fact that tags are ambiguous, noisy, and limited, a fundamental problem in social image retrievalis how to reliably learn the relevance of a tag with respect to the visual content it is describing.

The authors of [12] argue that the automatic identification of tags which are content-related can aid moreintelligent use of the social images and tags and thus facilitate the researches and applications over theseresources. To do this, they propose a data-driven method to analyze the relatedness between the tags andthe content of the images.

The authors of [43] aim to quantify tag representativeness of the visual content also known as tag visual-representativeness. Their intuition is as follows: if a set of images all share a similar visual concept, thensuch a set is visually coherent. Furthermore, if all users have implicitly developed consensus on a specifictag associated with all images from this set, then such tag is visually representative. Two distance metrics,namely cohesion and separation are used in their approach to quantify the visual-representativeness of a tagby measuring: (1) how well the set of tagged images presents similar visual content among them, and (2)how distinct the common visual content is with respect to the entire image collection.

In [3], a probabilistic approach is adopted to estimate the initial relevance score of each tag for one imageindividually, and then refine the relevance scores by implementing a random walk process over a tag graphin order to mine the correlation of the tags. In the construction of the tag graphs, they have combined anexemplar-based similarity by visual clue and a concurrence-based similarity based on tags co-occurrences.Their effort aims to maximize inter-label difference and at the same time to minimize intra-label differenceof the target label representations.

In [45] two types of scalability are considered: the number of training examples and the number ofcategories. The algorithm works as follows: firstly color features representing each photo are extracted.Secondly the training features are indexed using a set of spatial trees that enable efficient approximatenearest neighbor search. Thirdly, the training images in the local neighborhood of a test image are locatedby searching each tree and a corresponding set of weak annotations are calculated via distance-weightedvoting. Finally, the weak annotations are combined using boosting to produce a final annotation score foreach tag.

Finally, the authors of [2] propose to learn the relevance of a tag with respect to an image from taggingbehavior of visual neighbors of that image. For a given image, its k nearest neighbors are obtained bycomputing visual similarity through low-level features. Tags that frequently appear among the nearestneighbors (with respect to the tags prior distribution among all images) are considered relevant to the givenimage.

Differently from [45], which uses a voting method that incorporates distances in the feature space, in [2],such distances are only used to determine the visual neighbors of a given document. Indeed the neighborvoting algorithm used in [2], does not directly make use of the visual similarity distances to compute tagrelevance, but only considers tag co-occurrences in the neighborhood. By propagating common tags throughlinks induced by visual similarity, each tag accumulates its relevance credit by receiving neighbor votes.

Furthermore, differently from other query-independent approaches the work in [2] preserves original tagsand estimates tag relevance by votes from visually similar neighbors: only common tags shared by neighborsare propagated and new tags are not introduced to an image. Such a self-validation mechanism reducesthe risk of incorrectly propagating irrelevant tags, due to the inconsistence between visual similarity andsemantic similarity.

12

4 Methodology

As specified before, in our work we address the problem of identifying specific events from social mediacollections and retrieve related multimedia documents. Given a collection of documents, our goal is toassign each document di of the collection to a specific event, represented by a cluster cj . The events arenot known beforehand, and are discovered by exploring the collection. In order to identify all multimediamaterial associated with each specific event, similar documents are progressively clustered and after all datais processed, each cluster cj consists of documents associated with the same event.

We compute similarity measures between documents and clusters to determine how much a document isrelated to a specific event. The similarity measures are built by combining techniques from both “event iden-tification” and “multimedia categorization” approaches. We selected two interesting works from literatureto design our approach: [1] and [2].

As mentioned before, on the one hand we use the contextual information associated with the documents,more precisely a set of features, such as, for instance, textual descriptions or documents creation time: foreach feature we compute an appropriate similarity metric. We adopted and designed similarity metricso(di, cj) from [1], exploiting a rich family of features to determine the similarity between a document diand an event cluster cj . Similar to their approach, we take advantage of all contextual information usuallyavailable in social media collections: both user-provided data such as textual content (titles, descriptionsand tags) and automatically generated content (creation time and geo-coordinates).

In addition we compute the category feature categz, that is the “relatedness” of each document to genericevent categories such as “music”, “sport”, etc., and we build an additional similarity measure based on thecategorization scores. We use a subset of collection tags to determine the scores for each category. Indeed,each category consists of a set of conceptually similar keywords and the scores are computed according tothe document tags wi matching with keywords of any category.

Tags can be more or less relevant to the visual content of a document. We adopted the framework from[2] to compute the relevance of tags associated with each document. As mentioned above, we consider asubset of tags matching with event category keywords, (e.g. the tag “concert” is also a keyword for thecategory “music”). Event categories are described more in detail in the next sections. The relevance ofeach tag wi in a document di, is measured according to the tags of visual neighbors nk of that document:for a given document, its k nearest neighbors are obtained by computing visual similarity through low-levelfeatures. Essentially, the more a tag co-occurs in the current document di and its neighbors nk, the morerelevant it is.

The intuition behind our approach is that the relatedness of a document to a generic event category, isvaluable information, not only for the categorization task, but also for the specific event identification task.In other terms, we want to verify if the “relatedness” measure of a document di and cluster cj to generic eventcategories is useful for associating the document to the specific event represented by the cluster. Comingback to the example of the introduction, given a set of documents related to the specific event “Radioheadconcert in May 2009”, we propose to take into consideration their “relatedness” to generic event classes (e.g.“music”, “sport”), for our purposes.

Finally, we employ an incremental clustering framework to partition documents into clusters (similar to[1]). Each cluster corresponds to an event and includes the social media documents associated with the event.Therefore, the event identification problem is posed as follows: given a set of social media documents di wheresome documents are associated with an (unknown) event, the goal is to partition this set of documents into

13

clusters cj such that each cluster corresponds to all documents that are associated with one event.Before proceeding with the description of the retrieval framework, we give an overview of the data

representation and list the similarity metrics for each context feature.

4.1 Data representation

Our document representation consists of 3 types of features: textual, visual and category features. In thissection we describe how these features are computed. In the subsequent section we describe the similaritymetrics built upon them. All features are computed in the feature extraction step (see figure 3).

4.1.1 Textual features

As a distinctive characteristic, social media documents include a variety of meta-information, dependent onthe type of document (e.g. a “duration” feature is meaningful for videos but not for photos). However, manysocial media sites share a core set of features. These features, named textual features include:

• author, with an identifier of the user who created the document (e.g. “wilshirepix”),

• title with the name of the document (e.g. “Shot with an iPhone”),

• description with a short paragraph summarizing the document contents (e.g. “Taken at a FlamingLips concert at the Greek Theatre. Thanks for all the great comments, views and faves.”),

• tags, with a set of keywords describing the document contents (e.g. “iPhone”, “picture”, “flaminglips”),

• time/date with the time and date when the document was published (e.g. “August 17, 2009”),

• location with the location associated with the documents (e.g. “longitude = -2.216792, latitude=53.460167”).

An example of contextual information associated to a Flickr document is shown in figure 2. Textualfeatures are collected from the meta-information associated with the social media collection documents.

4.1.2 Visual features

Low-level visual features are extracted from all images in the collection. They are not directly used todetermine how much a document is related to a specific event, but they are used to compute the categoryfeature, described below.

We choose a combined 64-dimensional global feature as in [2]: for each image, we extract the 44-dimensional color correlogram in the 44-bin HSV color space, 14-dimensional color texture moments, and6-dimensional RGB color moments. The three features are successively normalized into unit length andconcatenated into the final 64-d feature.

4.1.3 Category feature

On top of the above mentioned features, the category feature is computed for each document. The categoryfeature describes how much a document is related to the event categories. It’s computed as follows: first,event categories and associated keywords are defined, second, the visual features are used to select thek visual neighbors of each document, third the document neighbors vote for its tags relevance, finally thecategory scores are retrieved with Okapi BM25 retrieval framework (similar to [2]). Following is a descriptionof each step.

14

Figure 2: A Flickr document

Category definition The event categories are defined in advance, and they categorize all types of events.Each category has a representative set of keywords, describing its content.

Keywords are words conceptually near to the category definition (e.g. “concert” is a keyword of thecategory “music”). They are selected from the category descriptions in the original collection, and can beoptimized with the help of online lexical databases. In the next sections we describe how keywords are usedto define category scores given the document tags. Following is the list of the categories with the relatedkeywords:

• music: concert, nightlife, rave

• performing : visual arts, theatre, dance, opera, exhibition, art, visual

• media: film, book, readings, reading, movie, cinema, script, playscript

• social : rally, gathering, user group, mass meeting, group meeting

• education: lecture, workshop, instruction, teaching, pedagogy, didactic, public lecture, talk

• commercial : convention, expo, exposition, flea market, market, marketplace

• festival : big event

• sport : recreation, athletics

• comedy : stand-up, improv, comic theatre, clowning, clown

• politics: rally, fundraiser, meeting, mass meeting

15

• family : kid, child, dad, father, mother, mum, children, parents

• conference: tradeshow, group discussion

• community : neighborhood, residential district, residential area, district, residential, vicinity, locality

• technology : engineering, applied science, science, computers

k Neighbors search For each document di, we perform k-Nearest Neighbor search to find its k visualneighbors nk, so that we can enable the Neighbor voting algorithm to determine the relevance of tags totheir document. The visual dissimilarity between images is measured using the Euclidean distance betweenthe extracted visual features. The value of k, defining the number of neighbors to consider, is empiricallyfound.

Neighbor Voting After we determine the document neighborhood we compute tag relevance. For eachneighbor nk, we use its tags to vote on tags of the document di (as in [2]). We get document tags from thedocument’s textual features, previously collected from the collection. However, at this step, only a subset oftags matching with the category keywords are considered.

The initial term frequency tf of tags for all documents in the collection is 1, since tags are added onceto each document di. When a tag co-occurs in the current document di and one of its neighbors nk, his tfis increased by 1. If it co-occurs in another neighbor, it is increased again, and so on. As a consequence, itsrelevance grows.

Category scoring Once all the collection documents have been voted upon, we adopt Okapi BM25, awell-founded ranking function for text retrieval [68], to compute the category scores. For each document,the category scores (given by the tag’s relevance) are computed, defining the relatedness of the document tothe categories (e.g. music: 0.2343, sport: 0.8242 etc).

Given a query cq containing all the category keywords w1, ... , wn , the relevance score of a category fora document di is computed as,

score(cq, di) =∑i

qtf(wi)idf(wi)tf(wi)(k1 + 1)

tf(wi) + ki(1− b + b LdiLave)

(1)

where, qtf(wi) is the frequency of the keyword wi in cq (in our case it is always 1) and tf(wi) thefrequency of the tag wi in document di. tf(wi) is the key in formula above: before neighbor voting, eachdocument tag wi has tf(wi) equal to 1. However, after neighbor voting this is not anymore the case, sincerelevant tags have an incremented tf(wi), as explained above.

Ldi is the total number of tags in di, Lave the average value of Ldi over the whole collection and idf(wi)is the inverse document frequency. Finally, the variable k1 is a positive parameter for regularizing the effectof tag frequency and the other parameter b (0 ≤ b ≤ 1) determines the scaling by Ldi.

When all the scores score(cq, di) are computed, we have finished to estimate all document - categoryrelatedness values, and they comprise the category features.

4.2 Features for aggregated data

A representation for aggregated data is needed since similar documents are progressively clustered and, asexplained later, we measure how much clusters and documents are similar. The representation for aggregateddata is quite similar to the representation of the documents in the collection. Given a set of documents tomerge, each feature is differently merged according to its typology.

• for each textual feature such as title, description and tags, the terms in the documents are aggregatedto yield the cluster terms,

16

• for the time/date feature, the average time of all merged documents is computed

• for the location feature, the geographic mid-point of all merged documents is used

• for the “event category” feature, the scores are averaged for each category

4.3 Similarity metrics

We define a similarity metric for each document feature (similarly to the approach in [1]) in a way that isappropriate for the feature’s domain to exploit the various context features. The similarity metrics rely onboth the textual and category features.

In the clustering framework, the similarity metrics are used to measure the similarity between documentsand clusters, as will be explained in the next sections. Following is the list of similarity metrics.

Textual features based similarity metrics

• Title, description and tags similarity metric. User provided content is represented as a tf.idf weightvector and the cosine similarity metric ∆w is used to measure the similarity between vectors, as definedin [35].

• Time/date similarity metric. For time/date, values are represented as the number of minutes elapsedsince the Unix epoch (e.g. since January 1st, 1970) and the similarity of two time/date values t1 andt2 is computed as follows:

∆t =

{0 if t1 and t2 are more than one year apart

1− |t1−t2|y otherwise(2)

where y is the number of minutes in a year.

• Location similarity metric. Location metadata associated with social media documents is repre-sented with latitude-longitude pairs and the similarity of two locations L1 = (lat1, long1) and L2 =(lat2, long2) is computed as

∆l = 1−H(L1, L2) (3)

where H(.) is the Haversine distance [36], an accepted metric for geographical distance.

Categorization based similarity metric The categorization-based similarity metric is based on thecategory features: given z categories and two sets of category scores categ1z and categ2z , the similarity isdefined as:

∆c =∑z

1− |categ1z − categ2z |N

(4)

where N is a normalization factor.

17

Figure 3: Features extraction, training and clustering modules

4.4 Retrieval Framework

Figure 3 illustrates the retrieval framework, consisting of a module in which document features are built,a training module to learn the classifiers and a “Single-pass incremental clustering” module to cluster thedocuments.

In the first module, all document features are estimated and computed. Textual features, such as textualdescription and creation time, are mined from the documents meta-information, low-level visual featuresare extracted from images and category features are computed. Finally, all features are integrated into thedocument representation.

Once all features have been computed, the documents are split into a training set and a test set for thetraining and clustering module respectively.

In the training module, first documents associated with the same event are merged into centroids, secondthe documents-centroids similarity scores are computed, third the similarity instances are balanced in orderto have the same amount of positive and negative pairs. Finally the classifiers, used to determine if a

18

document belongs to a specific cluster, are learnt.The last module performs single-pass incremental clustering over the test set. Each document is assigned

to a cluster according to the classifier results, and a list of clusters is updated until all documents have beenanalyzed.

In the next sections we describe the training and single-pass incremental clustering modules in detail.

4.4.1 Training module

In the training module, the centroids are created in advance by merging all documents from the trainingset belonging to the same event. Successively, the similarity scores between all documents and centroids arecomputed.

Positive instances, that is, instances with document and centroid belonging to the same event, are bal-anced with respect to the negative ones. Finally the classifiers are fed with the similarity instances andlearnt for the clustering module. Following is a description of the training steps.

Building the centroids As first step in the training module, a list of event centroids cj is computed inadvance by merging all documents di from the training set associated with the same event.

As explained before, the content of textual fields such as title, description and tags is aggregated andnumerical attributes such as creation time, geo-coordinates and event categories relatedness are averagedover all centroid documents.

Compute similarities Successively, similarity scores o1(di, cj), ..., om(di, cj), for each similarity metricdefined in the previous section, are computed for all document - centroid pairs (di and cj).

In [1], it is noted that the similarity between document d and a cluster c can be computed by comparingeither the features of d to those of the cluster c or by directly comparing d to the documents in the cluster c.

However, their experimental results show that representing each cluster using the centroid of its docu-ments is both more efficient and more robust. Document-centroids similarities offer better performances andeach document is compared against all centroids, which are computed in advance, as explained before, bygrouping all documents from the same event in the training set. The centroid for a cluster of documents cis defined as 1

|c|∑dεc d.

Sampling strategy The selection of training examples from which to learn the similarity classifiers mustbe balanced. Ideally, we want to predict the similarity of every document to every cluster created by mergingdocuments from the dataset. However, creating a training example for each document-cluster instance resultsin a skewed label distribution, since a large majority of instances in the training dataset do not belong tothe same event.

As a consequence, a classifier trained with a skewed label distribution yields poor clustering solutionssince it is much more likely to predict that two items do not belong to the same cluster, thus splitting singleevents across many clusters. For this reason, the best sampling strategy consists in balancing the positiveand negative instances for each event in the training set.

Since most of the document-centroids pairs are negative instances, only 10% of the original pairs havebeen kept to train the classifiers, in order to balance positive and negative labels.

Learning the classifiers Once the training set is built, document-centroid similarity instances are fed tothe classifiers. Thus, the classifiers with similarity scores as features are learnt for the successive clusteringstep.

In the experiments from [1], Logistic Regression and Support Vector Machine classifiers perform betterthan others (though not significantly). For this reason we decided to select these classifiers for our experimentsas well.

Once the classifiers are learnt, the training step is complete. In the next section the single-pass incrementalclustering module is discussed.

19

4.4.2 Single-pass incremental clustering module

In the social media collection scenario, the clustering algorithm of choice should be scalable, to handle thelarge volume of data in social media sites, and not require a priori knowledge of the number of clusters, sincesocial media sites are constantly evolving and growing in size.

Based on these observations, we intend to use a single-pass incremental clustering algorithm (as in [1])considering each document in turn, and determining the suitable cluster assignment based on the similarityof the elements to any existing cluster.

The clustering algorithm consists of the following steps:

• In the beginning, the cluster list is empty and k = 0. At the first iteration, the first document d1 isdirectly added to the clusters list, a new cluster c1 is created with the features of d1 and k is set to 1.

• At each iteration, given the documents to cluster d2, ... , dn, the algorithm considers each documentdi, in order, and computes its similarity o(di, cj) against each existing cluster cj , for j = 1, ..., k. Whena document is classified, either it is assigned to an existing cluster or to a new event cluster. In theformer case, the cluster is updated by merging its features with the assigned document features. Inthe latter case a new cluster is created with the features of di and k is incremented.

When all documents have been assigned to a cluster, the algorithm ends and each generated clustercontains the documents associated with its specific event.

In the next section we discuss the experiment in detail.

20

5 Experiments

We evaluate our approach on large-scale, real-world datasets of events and documents from social mediacollections. In the next subsections, we provide detailed information about the dataset we used for ourexperiments, the retrieval framework and classifiers, the evaluation metric, and the experiments.

5.1 Dataset

The Upcoming Events [13] dataset provides a list of events associated with Flickr [14] documents throughspecial identifiers. Our dataset consists of a year of such event-related Flickr documents, gently provided bythe authors of [1].

The dataset contains 9515 unique events, with an average of 28.42 photographs per event, for a total of270425 photographs, taken between January 1, 2006, and December 31, 2008. Only 32.2% of the photosinclude location information in the form of geo-coordinates.

For our experiments, the dataset has been partitioned by the ratio 70:30% for training set and testingset respectively. The former is used for training the similarity functions, the latter for the clustering ofdocuments into events.

Assumptions A number of assumptions have been made in our experiments. Such assumptions have beenderived from [1] and [2], since we combined techniques from these works:

• Documents from the collection can be related to an event even if produced before or after it. Suchmultimedia documents are still relevant because they indirectly provide information about the event(e.g. photos of the venue in which the event will occur or photo summaries created after the event).Consequently, in our database there are documents with the textual feature time having a value thatis not necessarily included in the event time frame, nevertheless they are associated with such event.

• Each social media document corresponds to exactly one event, which means, there can’t be documentsbelonging to two or more events. Consequently, in our clustering framework each document can becorrectly assigned to exactly one cluster.

• All documents from the collection are event related and there are no other typologies of documents.We focus on events related multimedia material only, therefore all documents from the original Flickrcollection not related to a specific event have been filtered out. Consequently all documents aresupposed to be assigned to the respective event, by the clustering framework.

5.2 Retrieval framework & Classifiers

We indexed the documents and computed tf-idf vectors for a subset of textual features (title, description andtags) with Apache Lucene [37]. For all the remaining textual features (time and location), visual featuresand the category feature, we added numerical attributes to each document representation.

An implementation for Lucene of the BM25 Okapi Retrieval Framework [67] has been used to computethe category features in the neighbor voting algorithm.

21

Finally, the classifiers have been implemented with the Weka toolkit [34]. For our experiments, we selectedSupport Vector Machine (Weka’s sequential minimal optimization implementation) and logistic regressionas classifiers, given the fact that they outperformed the other classifiers in the experiments in [1].

5.3 Evaluation

We adopted NMI (Normalized Mutual Information [38], [39]) to measure the performances of the trainedclassifiers after the training step and the final clustering framework.

NMI is a quality metric, originally proposed as the objective function for cluster ensembles [39]. Itmeasures how much information is shared between actual “ground truth” events, each with an associateddocument set, and the clustering assignment. The ground truth is computed with the Upcoming events iden-tifiers: all documents in the collection are labeled with an event identifier associating it with the respectiveground-truth events.

NMI balances the following clustering properties in its evaluation:

• the homogeneity of documents from the same event, within each cluster, and

• the fragmentation, that is, number of clusters that documents for each event are spread across.

Specifically, for a set of clusters C = c1, ...cj and events E = e1, ..., ek, where each cj and ek is a set ofdocuments,

NMI(C,E) =I(C,E)

(H(C)+H(E))2

, (5)

where I(C,E) is the mutual information, measuring the mutual dependence of C and E, and describinghow much the clusters are similar to the events in terms of homogeneity and fragmentation. It is defined as:

I(C,E) =∑k

∑j

ek ∩ cjn

logn|ek ∩ cj ||ek||cj |

, (6)

where n is the total numbers of documents. H(C) is the entropy, measuring the uncertainty associatedwith C and E, and standing as normalization factor. It is defined as:

H(C) =∑j

|cj |n

log|cj |n

(7)

and

H(E) = −∑k

|ek|n

log|ek|n

. (8)

5.4 Experimental setup

We organized our experiments in three sets (see figure 4, 5 and 6 for an overview). For each experiment, weevaluate the performance in both training and clustering and we provide the results for both Support VectorMachine and Logistic Regression classifiers. The performance is expressed in terms of NMI scores.

In the training step, we evaluate the classifiers using the test set and event centroids created in advance(differently from the clustering step). Successively, the documents are assigned to the respective centroidsby the trained classifiers. Finally, documents assigned to the same centroids are grouped into clusters forthe NMI evaluation.

In the clustering step, documents are clustered as result of the clustering framework: in this case, thecentroids are not created in advance, but progressively updated by the clustering algorithm when newdocuments are considered. Finally, the generated clusters are evaluated with NMI.

To summarize the three sets of experiments:

22

• Finding the optimal feature set. In the first set of experiments we aim to find the optimalset of document features. We build different feature sets and compare them. Each feature set isa combination of the following document features: textual features, consisting of User Annotations(Title/Description/Tags) together with Time, and Category Scores computed using different k-NearestNeighbor values.

• Finding the optimal category set. Once we have found the optimal feature set, we run thesecond set of experiments to compare the performance with different category sets: original categories,categories with improved keywords, merged category set derived from the merging of conceptuallysimilar categories, and a combination of the last two category sets.

• Evaluation on a filtered collection. Finally, we run the third set of experiments with the samesettings as the previous set, but by filtering out all the documents without the category feature, thatis, the documents that haven’t been matched to any category in the feature extraction step.

Following is a detailed explanation of all experiment sets.

Finding the optimal feature set In the first set of experiments we aim to find the set of documentfeatures that leads to highest performance. Following is the list of feature sets used for the experiments:

• Baseline, which includes textual features only, but not the category feature (as in the original experi-ment in [1]),

• User Annotations + Category feature, considering Title & Description & Tags features only from thetextual features together with the category feature,

• Time + Category feature, considering the time feature only from the textual features together withthe category feature and,

• All features, including both textual features (User Annotations (Title & Description & Tags) andTime) and the category feature,

For each feature set we evaluate different versions of the category feature by calculating different values ofk for the neighborhood voting algorithm. The value of k defines the number of visual neighbors to considerin the tag relevance voting. The larger k, the larger the visual neighborhood of the image: the challenge isto find the optimal neighborhood size offering the most accurate tag relevance. Also, the bigger k, the morecomputational expensive the voting algorithm. We used the following k values in our experiments: 10, 100,500, 1000. Once we determined tags relevance, we used it to compute the category feature, as explained inthe previous sections.

Finding the optimal category set In the second set of experiments we evaluate our approach withdifferent category sets, to determine the set that optimizes the category feature performance.

First, we determine an improved keyword set by consulting Wordnet synsets. Second, we merge categoriesaccording to a set of rules, derived from semantic relationships between categories and associated keywords(suggested by Wordnet). Finally, we combine the two approaches. Following is the list of built category sets:

• Upcoming (+ Wordnet [66]) categories: the original categories, whose keywords are derived from theUpcoming [13] categories descriptions and refined with Wordnet synsets (sets of synonyms).

• Categories with improved keywords: the original categories with additional keywords, selected amongthe most popular tags from the training set, that are semantically close to the category concepts. Theselection is manually performed with the help of Wordnet.

• Merged categories: categories deriving from the merging of categories. Categories are merged accordingto the set of rules, described below.

23

• Merged categories with improved keywords: categories obtained by combining the previous two settings.

We defined the following set of rules to create the “Merged category set”, derived from the semanticrelationships between categories and associated keywords, suggested by Wordnet synsets:

• categories that are conceptually similar, (e.g. “music” and “festival”) and therefore it might be correctto assign the same document to more than one category.

• categories that are too generic and might conceptually include other categories (e.g. “performing”might include “music” )

• categories with more than one keyword in common (e.g. the keywords stage and live might belong tothe category festival, but also performing)

• categories sharing the same domain: e.g. family and politics both belong to the domain society

Following are the resulting merged categories, used in the merged category set and merged category setwith improved keywords:

• Art: music, performing, festival, comedy, media

• Academic / Work: education, commercial, technology

• Sport

• Society: politics, family

Evaluation on a filtered collection The third set of experiments has the same settings as the previousone, but it runs over a filtered collection. Our purpose is to evaluate our approach on such a subset of theoriginal collection, consisting of documents without missing features.

Therefore, we removed all documents without the category feature, that is, all documents whose tagshave not been matched with at least one category keyword in the feature extraction step. Consequently, inthis third experiment, all documents have at least one or more category scores, describing their similarity toone or more categories.

Evaluating the clustering framework over a collection with only such documents helps us to better de-termine how effective categorization is in the clustering framework.

24

6 Results & Discussion

In this section we discuss the results of our work. Figure 4, 5 and 6 show the results for the experimentsto find the optimal feature set, to find the optimal category set and the evaluation on a filtered collection,respectively.

It should be mentioned that our implementation of the baseline approach performs better than theimplementation in [1]. Such improvement might be caused by the similarity scores computed by ApacheLucene, that might differ from the scores of the Lemur Toolkit.

Figure 4: “Finding the optimal feature set” experiment results

Figure 5: “Finding the optimal category set” experiment results

Figure 6: “Evaluation on a filtered collection” experiment results

6.1 Results for the experiment “Finding the optimal feature set”

It is difficult to notice substantial differences in performance among different feature sets. The baselineperforms already near to optimal, leaving marginal room for improvement.

Surprisingly, time + category feature set performs better than User Annotations (Title/Description/Tags)+ category features suggesting that information about the creation time of the documents is more valuablethan user annotations. However, both lead to worse performances than the baseline. The All featuresapproach does perform slightly better than other approaches, and we use it for all next experiment sets.

25

Also, increasing k to enlarge the visual neighborhood doesn’t seem to improve the category feature andconsequently contribute to higher experiments performance. For this reason, we fixed k = 100 for the nextexperiment sets, as a good trade-off between clustering performance and computational efficiency.

Beside comparing approaches with different feature sets, we also aimed to understand what prevents usto get perfect results. For this reason, we inspected the generated clusters to determine which contain mostmisclassified documents to have insight about which are the categories with low quality clusters.

Since our clustering algorithm is unsupervised, in order to identify the generated clusters, we necessarilyneed to determine the mapping with the ground truth events. In other terms, we need to know to whichevent, each cluster is associated to, before being able to perform the evaluation. For this purpose, weadopted max-weighted bi-partite matching [69], to match the clusters set and the events set and we usedmutual information values as weights. More in detail, the mutual information between ground truth eventsand generated clusters has been computed for each event-cluster pair and is defined as:

I(cj , ek) =ek ∩ cj

nlog

n|ek ∩ cj ||ek||cj |

(9)

where ek is a ground truth event, cj is a generated cluster and n is the total number of documents inthe collection. Once the mapping between ground truth events and clusters is determined, we averaged themutual information values over all events belonging to the same category. We evaluated both Baseline andAll features (k=100) approaches. Figure 7 shows the averaged mutual information per category for bothapproaches. The mutual information values are small because n is big (in figure 7 the values decimal pointis shifted ten digits to the left because of the decimal exponent E−10).

The events from the categories “Music”, “Sport” and “Festival”, have the highest averaged mutual depen-dences, that is, their clusters better classify the documents, while clusters belonging to “Social, Conference,Community, Technologies” perform worse. It is important to underline that the size of the categories isnot homogeneous, and the mutual dependences values have been normalized with the amount of events percategory. Finally, the All Features approach, consisting of the Baseline Textual features together with theCategory feature, improves the clustering performance for most categories.

Figure 7: Averaged mutual information per category for Baseline and All Features approaches

6.2 Results for the experiment “Finding the optimal category set”

Before commenting on the results for the second set of experiments, it should be underlined that approxi-mately only 20% of the collection documents have the category feature. This is due to the fact that most ofdocuments haven’t a tag matching with any category keyword. Also, the categories do not have a similaramount of documents. For example, “music” has 17704 documents, but comedy has 534 documents only.

In order to measure the quality of the category feature, computed in the feature extraction module, wecomputed category confusion matrices over all documents from the training set. The confusion matrices givean overview of the category misclassifications for all documents.

26

We computed category confusion matrices for all category sets: the original categories, categories withimproved keywords, merged category set and merged category set with improved keywords, shown in figures8, 9, 10 and 11 respectively. The confusion matrices results are consistent across all sets and show that themisclassification percentages are reduced with the use of improved keywords and that the merged categorysets performs better than the sets based on the original categories.

Also, the confusion matrices empirically confirm the validity of the rules we used to merge the categories.Most of the misclassifications are indeed solved by the rules, as can be seen by analyzing the confusionmatrices. Therefore, the confusion matrices fortify the criteria we used to merge the categories.

Figure 8: Confusion matrix with classification percentages for original categories

Figure 9: Confusion matrix with classification percentages for categories with improved keywords

Figure 10: Confusion matrix with classification percentages for merged categories

In figure 5) the results of training and clustering with different category sets are compared. The approachleading to (slightly) best results is the one based on the merged category set with improved keywords, whichtakes advantage of both the definition of the new merged categories and the refinement of the categorykeywords. Furthermore, both approaches with the merged category set and improved keywords category

27

Figure 11: Confusion matrix with classification percentages for merged categories with improved keywords

set, performs slightly better than the approach based on original categories. Finally, all approaches takingadvantage of category sets perform better than baseline.

Since all results from the “Finding an optimal feature set” experiments are consistent and expected, wedidn’t find it interesting to compare other feature sets than All features set (the one that performed best),in this set of experiments.

6.3 Results for the experiment “Evaluation on a filtered collection”

In this last set of experiments we filter out all documents with missing features from the collection. Inother terms, only documents with the category feature are considered to better understand its impact on theperformance of the retrieval framework.

Overall, all results improve as expected (see figure 6) because we are dealing with a subset of the collectionin which there are no missing features. Also, results are consistent with the previous experiment results andthe approach based on Merged category with improved keyword set performs best in this case as well.

More in detail, in the training set evaluation, the baseline is already very near to optimal, while theapproach combining merged categories and improved keywords gives results near to perfection. Similarly,the clustering results are consistent with the training set results, and confirm the considerations previouslymade for the other experiments.

28

7 Visualization

In this section we provide an example of an interactive visualization to browse multimedia collections.We took advantage of the results provided by our event identification framework in designing the application,showing how events can be used as primary means to organize media.

As an example of a visualization, we designed and implemented an application for iPad [70], to browseevent-related Flickr documents (see figures 12, 14 and 13). In the mobile context, more than in other contexts,users usually want to have fast access to information, with shorter and simpler interactions than withdesktop computers. For this reason, event-based user interfaces provide a enjoyable browsing experience bycategorizing documents per events and categories, allowing users to easily go through multimedia collections.

Our visualization organizes documents by event clusters and it displays the contextual meta-informationassociated with the documents in an appropriate way. Following are the functionalities provided by thevisualization:

Figure 12: Flickr events app

• users can browse through events rather than documents. Events are displayed in a scrollable grid (seeFigure 12). Each event has a title label: the titles of all documents belonging to the event are extractedfrom the textual features, and the most common title among them is used. Also, each event has acategory icon referring to an event category such as “music” or “performing”, which is the category withthe highest averaged score among all event documents. Finally, a random image from the documents

29


is used for the event cover.

• Users can filter the events by event category. A popup with a list of category switchers, allow tofilter out all events included in one or more categories, customizing the content listed in the grid (seeFigure 14). When a category is deselected, all clusters belonging to that category are removed from thevisualization. In this way users can browse collections with, for example, “music” related documentsonly, or combine multiple event categories.

• Each event content is organized in a pleasant magazine-layout page (see Figure 13) and it is displayedwhen a user tap on a grid item. Event title, and all event related images are included in an interactiveview, and when the user taps on the main image, a slideshow with full screen photos of the event isshown. The longest description from the event documents is added to the event page. Furthermore,users can easily visualize the next event by flipping the page with a swipe gesture: such interactiongives the feeling of reading a real magazine.


30

8 Conclusions

Social media communities made worldwide personal multimedia material publicly available on the web.However, users often do not invest much effort in organizing their own multimedia documents. Consequentlythe web becomes a tremendous source of information, which is hardly accessible to search engines and users.

This problem can be solved by using events as primary means to organize media. Since many of thesocial media documents have been produced during specific events, organizing multimedia content on eventbasis can not only facilitate retrieval for search engines, but it is also a very natural way for users to browsea collection. By identifying events, and their associated user-contributed social media documents, we canboth complement and improve retrieval techniques and enable powerful event browsing user interfaces.

In this work, we proposed an event identification approach for social media collections to retrieve multi-media material related to specific events and organize social media collections on event basis. To accomplishour goal we rely on contextual meta-information associated with the documents (in particular title, descrip-tion, tags, content, document creation time and location), documents visual features, users tagging behavior,and multimedia categorization techniques. The intuition behind our work is that the relatedness of the doc-uments to a generic event category, is valuable information to be used together with the rich “context”associated with social media content. For this reason, we propose an “event identification” approach forsocial media collections that integrates “multimedia categorization”.

We organized our experiments in three sets in which we aim to find the optimal set of document features,the optimal category set and evaluate on a collection with documents having the category feature only.

For the first set of experiments, it is difficult to notice substantial differences in performance amongdifferent feature sets. The baseline performs already near to optimal, leaving little margin for improvement.Increasing the visual neighborhood doesn’t seem to improve the category feature. The All features approachperforms slightly better than other approaches, and we use it for all the following experiments.

In the second set of experiments, the category set leading to (slightly) better results is merged categoryset with improved keywords, which takes advantage of both the definition of the new merged categories andthe refinement of the category keywords, from the previous sets. Furthermore, both the merged category setand improved keywords category set, performs slightly better than the original categories. The confusionmatrices for the merged category set gives an intuition about it, since the misclassification percentages arelower than in the original categories. Finally, all approaches taking advantage of category sets perform betterthan the baseline.

In the third set of experiments, approximately only 20% of the collection have been considered, sinceonly documents without the category feature have been filtered out to better understand its impact on theretrieval framework performance. In the training set evaluation, the baseline is already very near to optimal,and the approach combining merged categories and improved keywords gives results near to perfection. Theclustering results are consistent with the training set results, and confirm the considerations made for theprevious experiments.

Last, we provide an example of event-based browsing experience, with the design of a browsing applicationfor tablet computers. We took advantage of the results provided by our event identification framework indesigning the visualization. Our purpose was to show how event-based user interfaces provide a moreenjoyable browsing experience to the final user, in particular in the mobile context, where users usually wantto have fast access to information, with shorter and simpler interactions than with desktop computers.

31

References

[1] Hila Becker, Mor Naaman, Luis Gravano. “Learning Similarity Metrics for Event Identification in Socialmedia” WSDM ’10: proceedings of the third ACM international conference on web search and datamining. ACM, New York, NY, USA, pp 291–300, 2010

[2] Xirong Li , Cees G. M. Snoek , Marcel Worring. “Learning tag relevance by neighbor voting for socialimage retrieval” IEEE transactions on multimedia, v.11 n.7, p.1310-1322, 2009

[3] Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang, Hong-Jiang Zhang “Tag Ranking”, InternationalWorld Wide Web Conference (WWW), 2009

[4] L Chen. “Event detection from Flickr data through wavelet-based spatial analysis”, Proceeding of the18th ACM conference on Information and knowledge management, pp 523, 2009

[5] T Rattenbury, N Good. “Towards automatic extraction of event and place semantics from flickr tags”,Proceedings of the 30th annual international ACM SIGIR conference on Research and development ininformation retrieval, pp 103, 2007

[6] L Wu, SCH Hoi, R Jin, J Zhu. “Distance metric learning from uncertain side information with ap-plication to automated photo tagging”, Proceedings of the seventeen ACM international conference onMultimedia, pp 135, 2009

[7] Claudiu S. Firan, Mihai Georgescu, Wolfgang Nejdl, Raluca Paiu. “Bringing order to your photos:event-driven classification of flickr images based on social knowledge”, Proceedings of the 19th ACMinternational conference on Information and knowledge management, pp 189, 2010

[8] Joshi, D. and Luo, J. “Inferring generic activities and events from image content and bags of geo-tags”,Proceedings of the 2008 international conference on Content-based image and video retrieval, 2008

[9] Hila Becker, Bai Xiao. Mor Naaman. Luis Gravano. “Exploiting Social Links for Event Identification inSocial Media”, 2011

[10] S Overell, B Sigurbjornsson. “Classifying Tags Using Open Content Resources”, Proceedings of theSecond ACM International Conference on Web Search and Data Mining, pp 64, 2009

[11] Teng Li, Tao Mei, Shuicheng Yan, In-So Kweon, Chilwoo Lee. “Contextual decomposition of multi-labelimages”, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 2270, 2009

[12] Yinghai Zhao, Zheng-Jun Zha, Shanshan Li and Xiuqing Wu. “Which Tags Are Related to VisualContent?”, Advances in Multimedia Modeling, pp 669, 2010

[13] Upcoming.yahoo.com

[14] Flickr.com

[15] Topic detection and tracking evaluation. http://www.itl.nist.gov/iad/mig//tests/tdt/

32

[16] J. Allan, R. Papka, and V. Lavrenko. “On-line new event detection and tracking.”, In SIGIR, 1998

[17] Y Yang, T. Pierce, and J. Carbonell. “A study of retrospective and on-line event detection”, In SIGIR,1998

[18] Q. Zhao, T.-Y. Liu, S. S. Bhowmick, and W.-Y. Ma. “Event detection from evolution of click-throughdata”, In KDD, pp 484-493, 2006

[19] A. Graham, H. Garcia-Molina, A. Paepcke, and T. Winograd. “Time as essence for photo browsingthrough personal digital libraries.”, In Proc. of ACM/IEEE-CS Joint Conf. on Digital Libraries,, 2002

[20] M. Naaman, Y. J. Song, A. Paepcke, and H. Garcia-Molina. “Automatic organization for digital pho-tographs with geographic coordinates”, In Proc. JDCL, 2004

[21] A. Stent and A. Loui. “Using event segmentation to improve indexing of consumer photographs.”, InProc. SIGIR, pp 59-65, 2001

[22] A. Pigeau and M. Gelgon “Organizing a personal image collection with statistical model-based ICL clus-tering on spatio-temporal camera phone meta-data”, J. of Visual Comm. and Image Rep., pp 15(3):425-445, 2004

[23] M. Dubinko, R. Kumar, J. Magnani, J. Novak, P. Raghavan, and A. Tomkins. “Visualizing tags overtime.”, In Proc. WWW, pp 193-202, 2006

[24] A, Jaffe, M. Naaman, T. Tassa, and M. Davis. “Generating summaries and visualization for largecollections of geo-referenced photographs”, In Proc. Multimedia Information Retrieval, pp 89-98, 2006

[25] M. Naaman, A. Paepcke, and H. Garcia-Molina. “From where to what: Metadata sharing for digitalphotographs with geographic coordinates”, In Proc. CoopIS, 2003

[26] G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu. “Parameter free bursty events detection in text streams”,In VLDB, 2005

[27] Z. Li, B. Wang, M. Li, and W.-Y. Ma. “A probabilistic model for retrospective news event detection”,In SIGIR, 2005

[28] Das, M. Loui, A.C. “Detecting significant events in personal image collections”, Source: 2009 IEEEInternational Conference on Semantic Computing, pp 116, 2009

[29] M. Cooper, J. Foote, A. Girgensohn, and L. Wilcox. “Temporal event clustering for digital photocollections”, In Proceedings of the eleventh ACM international conference on Multimedia, pp 364–373,2003

[30] Wei Jiang, Alexander C. Loui. “Semantic event detection for consumer photo and video collections”,Source: 2008 IEEE International Conference on Multimedia and Expo, pp 313, 2008

[31] Loui, A. C. Savakis, A. “Automated Event Clustering and Quality Screening of Consumer Picturesfor Digital Albuming”, IEEE TRANSACTIONS ON MULTIMEDIA, 2003, VOL 5; PART 3, pages390-402.

[32] Flickr on Wikipedia. http://en.wikipedia.org/wiki/Flickr

[33] J. F.T.M. van Dijck, Flickr and the Culture of Connectivity: Sharing Views, Experiences, Memories.”forthcoming in: Memory Studies 4 (4), 2011

[34] Weka toolkit. http://www.cs.waikato.ac.nz/ml/weka/

[35] G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedingsof the 27th ACM International Conference on Research and Development in Information Retrieval, 2004

33

[36] R. W. Sinnott. Virtues of the Haversine. Sky and Telescope pp 68-159, 1984

[37] lucene.apache.org

[38] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge Univ.Press, 2008

[39] A. Strehl, J. Ghosh, and C. Cardie. Cluster ensembles - a knowledge reuse framework for combiningmultiple partitions. Journal of Machine Learning Research, pp 3:583-617, 2002

[40] E. Amigo, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metricsbased on formal constraints. Information Retrieval, 2008

[41] P Serdyukov, V Murdock. “Placing Flickr Photos on a Map”, Proceedings of the 32nd internationalACM SIGIR conference on Research and development in information retrieval, pp 484, 2009

[42] Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang, Hong-Jiang Zhang “Boost search relevance fortag-based social image retrieval”, ICME, 2009

[43] Aixin Sun, Sourav S. Bhowmick, “Quantifying Tag Representativeness of Visual Content of SocialImages”, MM’10, Firenze, Italy, 2010

[44] Meng Wang, Kuiyuan Yang, Xian-Sheng Hua, Hong-Jiang Zhang, “Visual Tag Dictionary: InterpretingTags with Visual Words”, WSMC’09, Beijing, China, 2009

[45] M Cooper. “Image categorization combining neighborhood methods and boosting”, Proceedings of theFirst ACM workshop on Large-scale multimedia retrieval and mining, pp 11, 2009

[46] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan, “Matching words andpictures”, JMLR, pp 1107-1135, 2003

[47] E. Chang, G. Kingshy, G. Sychay,and G. Wu., “Cbsa: content-based soft annotation for multimodalimage retrieval using bayes point machines”, TCSVT, pp 13(2):26-38, 2003

[48] J. Li, J.Z. Wang, “Real-time computerized annotation of pictures”, TPAMI, pp 30(6):985-1002, 2008

[49] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. “Content-based image retrieval at theend of the early years”, TPAMI, pp 22(2):1349-1380, 2000

[50] R. Yan, A. Hauptmann, and R. Jin, “Multimedia search with pseudo-relevance feedback.” In CIVR, pp649-654, 2003

[51] G. Park, Y. Baek, and H.-K. Lee, “Majority based ranking approach in web image retrieval.”, In CIVR,pp 499-504, 2003

[52] R. Fergus, P. Perona, A. Zisserman, “A visual category filter for google images.”, In ECCV, pp 242-256,2004

[53] W. Hsu, L.Kennedy, and S.-F. Chang, “Video search reranking via information bottleneck principle”,In ACM Multimedia, pp 35-44, 2006

[54] T.-S. Chua, S.-Y. Neo, K.-Y. Li, G. Wang, R. Shi, M. Zhao, H. Xu “TRECVID 2004 search and featureextraction task by NUS PRIS”, In TRECVID Workshop, 2004

[55] G. Begelman, P. Keller, F. Smadja, “Automated tag clustering: Improving search and exploration inthe tag space”, In WWW Collaborative Web Tagging Workshop, 2006

[56] Y. Jin, L. Khan, L. Wang, M. Awad. “Image annotations by combining multiple evidence & Wordnet”,In ACM Multimedia, pp 706-715, 2005

34

[57] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang, “Content-based image annotation refinement”, In CVPR,pp 1-8, 2007

[58] X.-J. Wang, L. Zhang, X. Li, and W.-Y. Ma. Annotating images by mining image search results. TPAMI,2008

[59] A. Torralba, R. Fergus, and W. Freeman. “80 million tiny images: a large dataset for non-parametricobject and scene recognition”, TPAMI, 2008

[60] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. “Nus-wide: A real-world web imagedatabase from national university of Singapore”, In Proc. CIVR, 2009

[61] G. Carneiro, A. B. Chan, P. Moreno, and N. Vasconcelos. “Supervised learning of semantic classes forimage annotation and retrieval”, IEEE Tran. PAMI, pp 394-410, 2006

[62] W. Li, M. Sun, “Semi-supervised learning for image annotation based on conditional random fields”, InCIVR, pp 463-472, 2006

[63] X. He and R. S. Zenel, “Learning hybrid models for image annotation with partially labeled data”, InNIPS, pp 625-632, 2008

[64] J. Tang, S. Yan, R. Hong, G. J. Qi, T. S. Chua. “Inferring semantic concepts from community-contributed images and noisy tags” In Proceeding of ACM Multimedia, 2009

[65] www.google.com

[66] wordnet.princeton.edu

[67] Lucene-BM25, http://nlp.uned.es/ jperezi/Lucene-BM25

[68] K. S. Jones, S. Walker, and S. E. Robertson. “A probabilistic model of information retrieval: de-velopment and comparative experiments”, part 2. Jour. Information Processing and Management, pp36(6):809–840, 2000

[69] Maximum-Weighted Bi-partite matching, http://en.wikipedia.org/wiki/Matching (graph theory)#Maximum matchings in bipartite graphs

[70] Apple iPad, http:www.apple.comipad

35