PicShark: Mitigating Metadata Scarcity Through Large-Scale P2P ...people.csail.mit.edu/pcm/papers/PicShark.pdf · PicShark: Mitigating Metadata Scarcity Through Large-Scale P2P Collaboration

The VLDB Journal manuscript No.(will be inserted by the editor)

PicShark: Mitigating Metadata ScarcityThrough Large-Scale P2P Collaboration

Philippe Cudre-Mauroux · Adriana Budura ·Manfred Hauswirth · Karl Aberer

Received: date / Accepted: date

Abstract With the commoditization of digitaldevices, personal information and media sharing isbecoming a key application on the pervasive Web.In such a context, data annotation rather than dataproduction is the main bottleneck. Metadata scarcityrepresents a major obstacle preventing efficientinformation processing in large and heterogeneouscommunities. However, social communities also openthe door to new possibilities for addressing localmetadata scarcity by taking advantage of globalcollections of resources. We propose to tackle thelack of metadata in large-scale distributed systemsthrough a collaborative process leveraging on bothcontent and metadata. We develop a community-basedand self-organizing system called PicShark in whichinformation entropy – in terms of missing metadata –is gradually alleviated through decentralized instanceand schema matching. Our approach focuses on semi-structured metadata and confines computationally

The work presented in this article was supported by the Swiss

NSF National Competence Center in Research on Mobile Infor-

mation and Communication Systems (NCCR MICS, grant num-ber 5005-67322), by the EPFL Center for Global Computing as

part of the European project NEPOMUK No FP6-027705, and bythe Lıon project supported by Science Foundation Ireland under

Grant No. SFI/02/CE1/I131.

P. Cudre-Mauroux � · A. Budura · K. Aberer

School of Computer and Communication SciencesEPFL, 1010 Lausanne – SwitzerlandTel.: +41-21-693-6787Fax: +41-21-693-8115

E-mail: {firstname.lastname}@epfl.ch

M. HauswirthDigital Enterprise Research Institute

National University of Ireland

Galway – IrelandE-mail: [email protected]

expensive operations to the edge of the network,while keeping distributed operations as simple aspossible to ensure scalability. PicShark builds onstructured Peer-to-Peer networks for distributedlook-up operations, but extends the application ofself-organization principles to the propagation ofmetadata and the creation of schema mappings. Wedemonstrate the practical applicability of our methodin an image sharing scenario and provide experimentalevidences illustrating the validity of our approach.

Keywords Metadata Scarcity ·Metadata Heterogene-ity · Metadata Entropy · Peer-to-Peer Collaboration ·Peer Data Management

1 Introduction

Until recently, the creation of digital artifacts – suchas electronic documents, images, or videos – was con-strained by the limited availability of devices capable ofcapturing and handling information in binary form. To-day, the situation has radically changed with the com-moditization of digital devices. Typewriters have nowtotally disappeared from the office space, whereas emailhas become one of the main communication channels.Mobile phones can handle information written as bidi-mensional bar-codes, while personal computers casuallystore and process gigabytes of personal images. In thisnew context, we argue that the lack of metadata, ratherthan the lack of data, has become the main bottleneck.

The problem became apparent a few years ago whenend-users suddenly had to resort to third-party tools tofind relevant pieces of information on their own com-puter. At that time, several projects proposed to indexinformation based on metadata to enhance the searchprocess. Microsoft’s Stuff I’ve Seen [14], for instance,

2

relies on time-stamp metadata like Last Time Modifiedor Last Time Opened to display search results, whileGoogle Desktop1 indexes documents based on metadataextracted from the files payload. Local search based onindexed metadata is considered as a common featurenowadays and has been integrated in most operatingsystems.

The lack of metadata resurfaces today as a newproblem in distributed settings. More and more plat-forms allow end-users to share their digital content inlarge communities: Flickr, YouTube, and MySpace arewell-known examples of that trend. In distributed envi-ronments, however, automatically-generated metadatasuch as Last Time Opened, Filename, or Size often can-not be exploited in a meaningful way by arbitrary userssearching for a specific file. In a distributed setting,users typically have never encountered the file they aresearching for and are thus unaware of its technical de-tails. Higher-level, more meaningful metadata like De-scription, Event, or Location are much more relevantin distributed environments, but still often require hu-man attention, one of the scarcest resources in our dig-ital society. As a result, the majority of digital contenton current collaborative platforms simply cannot be re-trieved by third-parties because of the lack of adequatemetadata.

In the following, we tackle the problem of metadatascarcity in large-scale collaborative environments. Wefocus on semi-structured metadata formats and pro-pose a radically new approach to foster global searchcapabilities from incomplete, local, and heterogeneousmetadata. Our approach is based on a new metric formetadata scarcity and on Peer-to-Peer (P2P) interac-tions. The main contributions of our work are:– the formalization of the problem of sharing semi-

structured metadata in distributed settings, explic-itly taking into account metadata incompletenessand metadata heterogeneity

– the definition of a new metric – called metadata en-tropy – to capture the degree of incompleteness oruncertainty related to semi-structured metadata

– the description of a bottom-up and recursive pro-cess based on instance and schema matching to infermetadata in collaborative P2P contexts

– the presentation of a system architecture supportingmetadata inference in distributed environments

– the experimental evaluation of our metadata infer-ence process on a large set of several hundreds ofannotated images.

We start with a general description of the problemsrelated to the sharing of semi-structured metadata in

1 http://desktop.google.com/

Section 2 and formalize our problem in Section 3. Ourmetadata sharing approach is presented in detail in Sec-tion 4. We describe the architecture of our prototypeand the results of our experimental evaluation in Sec-tion 5. Finally, we give a survey of related work in Sec-tion 6 before presenting our conclusions.

2 Sharing Semi-Structured Metadata

2.1 On Semi-Structured Metadata

While the use of unstructured metadata drew consider-able attention on the Web in recent years – e.g., throughkeyword annotation of images or HTML pages – the fo-cus recently shifted back to more structured metadataformats. Unstructured metadata such as tags are am-biguous by nature and lack precise semantics, making itvery difficult to support structured searches a la SQL.Structured representations such as relational tables aremuch easier to process automatically, as they constrainthe representation of data through complex data struc-tures and schemas.

In the following, we focus on recent formats that letend-users freely define and extend their own schemasaccording to their needs. We qualify those formats assemi-structured formats since they tend to blur theseparation between the data and schemas and to im-pose looser constraints than the relational model to thedata. Such formats are today sprouting from variouscontexts and encompass a large variety of data mod-els. The Extensible Markup Language (XML) [6], forexample, relies on hierarchies of elements to organizedata or metadata. Ontological metadata tie metadatato formal descriptions where classes of resources (andproperties) are defined and interrelated. This class ofmetadata standards is currently drawing a lot of at-tention with the advent of the Semantic Web and itsassociated languages (e.g., RDF/S [24], Adobe’s XMP2

or OWL [25]). Semi-structured formats are gaining mo-mentum. They are flexible enough to allow easy defini-tion and extension of schemas, while sufficiently struc-tured to support automated processing and complexsearches (e.g., through languages such as XQuery [5] orSPARQL [27]).

2.2 On the Difficulty of Sharing Semi-StructuredMetadata

Our goal is to enable global search capabilities forshared resources based on semi-structured metadata

2 http://www.adobe.com/products/xmp/

3

in large-scale, heterogeneous and distributed settings.Although semi-structured metadata formats aregetting increasingly popular, support for meaningfullysharing semi-structured metadata outside of theiroriginal context or community of interest is oftenlacking. Semi-structured metadata are intrinsicallydifficult to share, since their values only make sensein a given context – as opposed to keyword metadataor textual tags, which supposedly convey predefined,global semantics. Hence, large-scale collaborativeapplications typically disregard semi-structuredmetadata or treat them as simple unstructuredkeywords ignoring their intrinsic structure.

A straightforward solution to our problem would beto use a common language, like RDF, for all metadata.Though necessary, we argue that this syntactical align-ment step only represents the tip of the iceberg in ourcase. Even with a global, common language, fundamen-tal problems remain: systems would still be unable toretrieve all relevant resources given a query, as meta-data can still be incomplete and unrelated one to an-other because of the various schemas introduced by theusers.

In the end, two fundamental hurdles prevent semi-structured metadata from being shared meaningfully:

Metadata Incompleteness: Though more and moretools rely on some semi-automatic annotationschemes to add metadata to resources, fully-automated solutions remain impractical. Most ofthe time, human attention is still required forproducing high-quality, meaningful metadata.Realistically, a (potentially large) fraction of theshared resources will not be annotated by the user,leaving some (most) of the related semi-structuredmetadata incomplete. This incompleteness severelyhampers any system relying on user-generatedmetadata.

Metadata Heterogeneity: Some of the vocabularyterms introduced by end-users to annotate contentlocally may not make sense on a larger scale. Newvocabulary terms – new attributes or propertiesused locally by some community – should berelated to equivalent vocabulary terms coming fromdifferent communities to guarantee interoperability.This is a semantic heterogeneity issue requiringa decentralized integration paradigm, as we haveto deal with large and distributed communities ofusers, which develop without any central authoritythat could enforce vocabulary terms globally. Asimilar issue arises when a user makes an explicitreference to a local resource in the collaborativesetting: the reference can be totally irrelevant to

<rdf:RDF xmlns="http://japancastles.org/jpcastle/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<Work rdf:about="http://japancastles.org/Himeji23.jpg"> <dc:description> Himeji-Jo </dc:description> <#castle_city> Himeji </castle_city> <dc:type> image </dc:type> <dc:creator> #Photographer23 </dc:creator> <dc:date> 2002-12-4 </dc:date> <dc:publisher> </dc:publisher> …

Local Voca-bulary Term

Local Resource

IncompleteMetadata

Fig. 1 The two fundamental problems behind semi-structured

metadata scarcity in distributed settings: metadata incomplete-ness caused by missing values, and metadata heterogeneity at-

tributable to local vocabulary terms and local resources.

most of the other users who are not aware of theresource in question.

These two problems are the main reasons why semi-structured metadata are scarce in distributed settings:either metadata are incomplete or they cannot be prop-erly interpreted and are thus simply discarded. An RDFdocument exhibiting concrete examples of those twoproblems is shown in Fig. 1.

2.3 Opportunities for Reducing Metadata ScarcityCollaboratively

In the rest of this contribution, we gradually allevi-ate metadata scarcity by tackling the two aforemen-tioned problems in a large-scale resource-sharing con-text. Hence, we focus on methods to solve both meta-data incompleteness and metadata heterogeneity forsemi-structured metadata attached to the shared re-sources. Even if sharing semi-structured metadata is in-trinsically difficult, we argue that simultaneously shar-ing both the resources and their associated metadatain large-scale communities opens the door to new op-portunities for supporting global search on the sharedresources.

Assuming that we can relate shared resources se-mantically inside a given community of users, e.g., bya low-level analysis of their content or by a semanticanalysis of their metadata, metadata can be propagatedwithin the community to reduce metadata scarcity. Bytaking into account sets of resources shared in a givencommunity, we can thus augment individual metadataby combining local metadata attached to a resourcewith other metadata originating from similar resources.

4

We call this process metadata imputation in Fig. 2.Data imputation is a field aiming at replacing miss-ing values in a data set by some plausible values (seeFarhangfar et al. [16] for a recent survey of the field).

By relating resources and metadata coming fromdifferent communities of users, we can further enhancethe process by creating schema mappings between se-mantically related communities of users. Schema map-pings associate vocabulary terms of one community torelated terms coming from another community. Theyallow the reformulation of a query posed against a givenschema into a semantically similar query written interms of another schema. We refer to this process asquery propagation in Fig. 2, where straight arrows rep-resent mappings between the schemas of two given com-munities. Schema mappings can reduce semantic het-erogeneity by enabling the propagation of a local queryacross the whole network of communities by followingseries of mapping links iteratively.

In this article, we additionally take advantage ofschema mappings to propagate existing metadataacross semantically heterogeneous communities,and thus to reduce metadata scarcity even further.Metadata imputation is this time contingent on theavailability of schema mappings relating the schemasof heterogeneous communities. We refer to this processas metadata propagation in Fig. 2.

In turn, metadata that have been propagatedthrough schema mappings can be exploited in order toinfer new mappings or verify existing mappings andto increase the accuracy of metadata propagation.This shows a clear interrelation between metadataincompleteness and metadata heterogeneity, as min-imizing metadata incompleteness through metadatapropagation takes advantage of schema mappings usedto minimize semantic heterogeneity, and vice-versa.

In the following, we propose a distributed processto reduce the overall scarcity and heterogeneity of themetadata in an autocatalytic process, where both meta-data and mappings get reinforced recursively by puttinglocal metadata into a global community-based context.Taking a global view on the system, we observe thatthe global semantics are not fixed a priori, but evolveas users interact with the system and guide the meta-data sharing process by exporting new resources, byadding new metadata, or by providing positive or neg-ative feedback based on the results retrieved for theirqueries. The way the semantics of the system dynam-ically evolves is typical of an emergent semantics sys-tem [8], where no global semantics is defined a prioriand where the discovery of the proper interpretation ofsymbols results from a self-organizing process guidedby local interactions.

Community of interest(Oriental Culture)

Community of interest(Hiking in Kansai)

Community of interest(Japanese Castles)

MetadataImputation

MetadataImputation

MetadataImputation

MetadataPropagation

MetadataPropagation

QueryPropagation

PLACES

KyotoNara

Fig. 2 Sharing semi-structured metadata: metadata incomplete-ness is mitigated by imputing additional metadata for sets of sim-

ilar resources inside a community of interest, while metadata het-erogeneity is gradually alleviated through pairwise schema map-

pings.

3 Formal Model

The problem we want to tackle can be formally intro-duced as follows: a large set of autonomous informationparties we name peers p ∈ P store resources (e.g., cal-endar entries, pictures, or video files) r ∈ Rp locally.Peers take advantage of schemas S ∈ S to describetheir resources with semi-structured metadata.

Peers using the same schema to describe resourcesform a community of interest. Communities of interestdevelop independently of our system through social in-teractions or schema enforcement. They typically resultfrom best practice or standardization efforts, or fromcommunities of users using specialized tools imposinga custom schema (see for example the W3C IncubatorGroup Report [15] for recent examples of specializedschemas related to the image annotation domain).

Schemas consist of a finite set of vocabulary termst ∈ T . We focus on vocabulary terms representing at-tributes (a.k.a. properties), which invariably exist inone way or another in all the semi-structured metadataformats we have encountered, but classes of resourcescan be taken into account by our approach as well. Inthe setting we consider, we assume that the number ofshared resources is typically significantly higher thanthe number of peers, which is itself significantly higherthan the number of schemas: |R| >> |P| >> |S|.

Peers store the semi-structured metadata attachedto their resources as attribute-value pairs. By takinginto account the resource each attribute-value pairdescribes, semi-structured metadata can be seen asternary relations. We call such relations metadatastatements (r, t, v). A statement (r, t, v) associates avalue v to a local resource r through a vocabulary term

5

(attribute) t. Values v appearing in the statements caneither represent literals l ∈ L or local resources. Forinstance, the dc:description statement related to theresource whose metadata are shown in Fig. 1 can bewritten as follows:

(http : //japancastles.org/Himeji23.jpg,

dc : description, Himeji− Jo).

We say that a statement evaluates to true if it exists inone of the databases of the peers, to false otherwise. Wesuppose at this stage that all annotations are complete,in the sense that for each annotated resource, all vocab-ulary terms defined in the annotation schemas are as-sociated with a certain value. This also constraints thenumber of statements attached to each resource basedon schemas.

Peers can pose queries locally in order to retrievespecific resources based on vocabulary terms, literals,and other local resources. Queries take the form of con-junctions of triple patterns [31]:

r? : (r1, t1, v1), . . . , (rn, tn, vn)

where rk, tk, vk represent variables or (respectively) alocal resource, vocabulary term, or value, and r? is adistinguished resource variable appearing in at least oneof the triple pattern (rk, tk, vk). Note that joins can beexpressed by multiple occurrences of the same variablein that notation. We say that a resource r0 ∈ R is ananswer to the query q, and write q |= r0, if, when sub-stituted for the distinguished variable, there exists avaluation (i.e., a value assignment) for all other vari-ables in the conjunction of triple patterns such that theresulting statements all evaluate to true.

Now, let us assume that some of the statements areincomplete and that the peers have a means to exportresources through some common infrastructure (e.g.,the World Wide Web or a Distributed Hash Table).Our goal is to recontextualize the local statements inthe common infrastructure to support global search ca-pabilities. We create additional statements in such away that any peer posing a query q against its localschema can retrieve a maximal number of relevant re-sources r | q |= r from the global set of shared resourcesR while minimizing false positives and user’s involve-ment under the following restrictions:

Metadata Incompleteness: Some values vk appearing inthe statements are replaced by null-values ⊥k in-troducing incomplete statements (rk, tk,⊥k). Null-values are formally considered as being equivalentto the values they stand for but cannot be distin-guished by the peers, which basically consider thevalues as unknown. For instance, supposing that the

dc:description from Fig. 1 is left incomplete, thestatement can be written as:

(http : //Himeji23.jpg, dc : description, ⊥dc:desc).

Metadata Heterogeneity: Each local resource and vo-cabulary term is assigned a set of fixed interpreta-tions rI from a global domain of interpretations ∆I

with rI ⊆ ∆I . These interpretations are used tointerrelate semantically similar resources. Arbitrarypeers are not aware of such assignments, i.e., theyare not aware of the global semantics of the system.We define two resources ri and rj as equivalent, ex-pressed by ri ≡ rj , if and only if rIi = rIj . We definethat a resource ri subsumes another resource rj , ex-pressed by rj v ri, if and only if rIj ⊆ rIi . True-ness of statements is relative to the equivalence andsubsumption relations, in the sense that if a state-ment (r, t, v) evaluates to true, then all statements(r′, t′, v′) | r′ v r, t′ v t, v′ v v also evaluate to true.For instance, the following statement:

(http : //Himeji23.jpg, located in City, Himeji)

evaluates to true if

(http : //Himeji23.jpg, castle city, Himeji)

andcastle city ≡ located in City .

Taking advantage of those definitions, we canintroduce the notions of metadata completenessand soundness. We say that a set of N statements{(r, t1, v1), . . . , (r, tN , vN )} pertaining to a resource r

is complete when vi 6= ⊥ ∀ vi. A set of statementsis sound if all the statements evaluate to true. Wegenerally assume that our process starts with setsof statements that are sound but incomplete. Ourrecontextualization process then tries to complete thestatements while minimizing the number of unsoundstatements generated.

3.1 Metadata Entropy

We introduce in the following a new metric for captur-ing metadata scarcity. We call this metric metadata en-tropy as it is similar in nature to the notion of entropydefined in information theory. In our context, meta-data entropy either relates to the incompleteness of themetadata statements or to the uncertainty of inferredstatements. Keeping track of metadata entropy is im-portant to detect the resources requiring further recon-textualization (too many incomplete statements), and

6

to propagate metadata in a meaningful way by asso-ciating uncertainty to the metadata that are inferredautomatically.

We extend our model to write statement as quadru-ples (r, t,v,p) where v is a list of possible values vk ∈ v

for the statement and pk ∈ p stands for the probabilityof the statement (r, t, vk) evaluating to true. Using thisnotation, we can for example write the following:

(http : //Himeji23.jpg, castle city,

(Himeji, Kyoto), (0.9, 0.1))

to express that two different cities were related to agiven picture.

The entropy H(r, t,v,p) of a statement measuresthe degree of uncertainty related to its set of possiblevalues v.

We wish metadata entropy H(·) to satisfy the fol-lowing desirable properties:

1. Metadata entropy H(·) should be a continuous func-tion based on the various probabilities p attached tothe values of a statement. It should not, however,depend on the order in which the probabilities orthe values are given.

2. Metadata entropy H(·) should evaluate to zero forsound statements.

3. Metadata entropy H(·) should be maximal and eval-uate to one when all values attached to a statementare equiprobable, i.e., when no value is more prob-able than any other.

4. H(p(X,Y )) should be equal to H(p(X))H(p(y)) fortwo independent random variables X and Y .

The only function satisfying these four properties [21]is:

H(r, t,v,p) = −K∑k=1

pklogK(pk)

where K is the number of possible values in v. Theentropy of all complete statements initially exportedby the peers is zero, as they consider a single possi-ble value with a probability of 1 of evaluating to true(i.e., we consider that all statements stored locally atthe peers are correct; if some of these statements arecreated semi-automatically, we can alternatively startwith a smaller value 0 < p < 1). Incomplete statements(r, t,⊥) start with an entropy of one initially, represent-ing an unknown (and potentially infinite) set of iden-tically distributed values. Their entropy will decreaseover the course of our recontextualization process asplausible values get discovered trough metadata impu-tation and propagation.

We define the entropy of a resource as the arithmeticmean of the entropy of its N associated metadata state-ments:

H(r) =N∑n=1

H(r, tn,vn,pn)N−1.

A resource with half of its metadata statements leftincomplete will thus start with an entropy of 0.5.

4 Recontextualizing Semi-Structured Metadata

We present our general approach for generating addi-tional data to recontextualize shared metadata in large-scale distributed settings below. A specific implemen-tation of this approach is described in Section 5 in thecontext of an image sharing scenario.

The heterogeneity, autonomy, and large number ofpeers we consider precludes the use of centralized tech-niques. Traditional integration techniques (e.g., the me-diator architecture [34]) are impractical in our con-text as no global schema can be enforced in heteroge-neous and decentralized communities. Classical meta-data management techniques (e.g., tableaux reasoning)are not applicable either, due to the lack of shared in-formation (resources, vocabulary terms) and the sheersize of the problem which excludes methods scaling ex-ponentially – or even linearly – with the size of thedata [13].

Instead, we propose local, probabilistic heuristicsaiming at recontextualizing metadata extracted froma specific source to a decentralized collaborative con-text. Following a long tradition of providing scalableapplication-level services on top of an existing physicalnetwork, we push the “intelligence” of the approach to-wards the edge of the network, i.e., perform all complexoperations locally at the peers, while only consideringsimple in-network operations on a shared hash-table. Inthe following, we suppose that all resources and peersare identified by globally unique identifiers. Our heuris-tics are based on decentralized data indexing, data im-putation, and data integration techniques. We detailbelow how metadata are exported, how statements areimputed inside a community of interest, how pairwiseschema mappings are created to alleviate metadata het-erogeneity, how metadata are propagated across com-munities, and finally how user queries are handled. Ahigh-level illustration of the recontextualization processis given in Fig. 3.

7

Metadata Extraction

FeaturesExtraction

Metadata Propagation

Data Indexing (shared hash-table)

Schema Matching

Metadata Imputation

ExportResources

User

Fig. 3 The metadata recontextualization process: users start bysharing resources, which are indexed in a hash-table along with

metadata and content features. Features are used to match re-

sources and impute new metadata, while existing metadata areused to create schema mappings, which in turn are used to prop-

agate metadata from one community to the others.

4.1 Exporting Local Metadata through Data Indexing

Our recontextualization process starts with the exportand proper indexing of the shared resources and theirassociated metadata. We index the location p0 of eachresource r0 a peer wants to share in a shared hash-table.We then index all metadata statements (r0, t, v, 1) per-taining to the resource that has just been indexed. Theindexing process continues recursively by indexing allresources r′ appearing as values v in the already indexedmetadata, and their respective statements (r′, t′, v′, 1).Fig. 4 shows an example of the indexing process on asimple RDF graph with recursion depths limited to zeroand one. All statements are exported using a commonrepresentation (e.g., XML serialization of RDF triples)and are indexed in such a way that they can be effi-ciently retrieved based on their resource r, term t, orvalue v. Higher recursion values lead to sharing moreinformation, which can then be used in the rest of therecontextualization process to relate semantically simi-lar resources. On the other hand, higher recursion val-ues also impose higher network traffic and higher loadon the shared hash-table.

4.2 Dealing with Metadata Incompleteness throughIntra-Community Metadata Imputation

Our objective turns now to determining plausible val-ues for incomplete statements based on sets of relatedstatements. This can be regarded as a data imputationproblem [16] where values are missing within a givencommunity of interest. In our context, metadata areincomplete due to the users’ unwillingness to fully an-notate the resources. Hence, metadata can be incom-plete irrespective of the resource they are attached to

Mulhouse

Himeji23.jpg

Image

Photographer23 Photographer

Type

Creator

Eric.bmpPicture

BirthLocation

Type

07/01/1972

BirthDate

Fig. 4 Indexing resources and statements from an RDF/S graph;

the inner and outer boxes correspond to a recursion depth limitedto respectively zero and one and starting from the Himeji23.jpg

resource.

or of their actual value (values are missing completely atrandom [16]). We base our imputation process on a K-Nearest Neighbor imputation, which has been shown asbeing very effective for contexts such as ours [4] and hastwo distinctive advantages in the present situation: i)it does not require building a predictive model for eachpredicate for which a value is missing and ii) it canbe based on a simple index lookup in a shared hash-table [18].

K-Nearest Neighbor imputation is based on a notionof distance between the objects it considers. Capitaliz-ing on several decades of research on content analysis,we generate feature values for each resource we have toindex. Feature values represent the content of the re-source and/or its metadata. Feature values can for ex-ample be based on a low-level analysis of the resource(e.g., image analysis) or on an analysis of machine-generated metadata (see next section for some concreteexamples). Features should be extracted in such a waythat similar resources get closely related feature val-ues, which might or might not be verified in practiceand which naturally impacts on the effectiveness of ourapproach (see Section 5.2). Feature extractors mightbe different for different types of resources (e.g., pic-tures, text files, etc.). We index each resource r basedon its feature value FV (r) in the shared hash-table tobe able to retrieve resources with similar feature values.We define the distance used by the imputation processbased on those values: D(r, r′) = |FV (r′)−FV (r)| (seeSection 5.1 for concrete examples of features and dis-tances).

Algorithm 1 gives a list of the operations undertakenduring an imputation round. The imputation can bebroken down into three main operations: neighbor se-lection, value inspection, and value aggregation.

Neighbor Selection: For each resource r associatedwith at least one incomplete statement (r, t,⊥), we

8

Algorithm 1 Imputation Process Operationsfor all resource r to index do

incompleteStatements = r.getIncompleteStatements()

if incompleteStatements.count() > 0 then/*Neighbor Selection*/

neighbors = getNearestNeighbors(r,K, τ)

for all incompleteStatement in incompleteStatementsdo

for all neighbor in neighbors do

/*Value Inspection*/plausibleV alues = getV aluesFromNeighbor(

neighbor, incompleteStatement)

likelihoods = assessLikelihoods(neighbor,incompleteStatement, r)

listOfV alues.add(plausibleV alues)

listOfLikelihoods.add(likelihoods)end for

/*Value Aggregation*/aggregate(incompleteStatement,

listOfV alues, listOfLikelihoods)

end forend if

end for

search for K similar resources (neighbors) r′ in thehash-table, such that r and r′ are annotated using thesame schema, D(r, r′) is minimal – and below a simi-larity threshold τ – and H(r′) as low as possible. Thatis, we search for resources coming from the same com-munity of interest that are most similar according toour feature value metric and whose statements are assound and complete as possible. The exact value ofK depends on the context and is typically determinedby cross-validation [36] using a sample validation set.When fewer than K similar resources exist in the radiusof the similarity threshold τ , abstract resources withincomplete statements (r⊥, t,⊥) with D(r, r⊥) = τ areconsidered. We introduce abstract resources to explic-itly preserve null values when few plausible values areavailable from the neighborhood of a given resource.

Value Inspection: For each incomplete statement(r, t,⊥), we consider the I values v′ki appearing in thecorresponding statement (r′k, t,v

′k,p

′k) attached to the

kth neighbors as a possible value. We set the likelihoodl′ki of this value being sound for the incompletestatement under consideration as being proportionalto the probability p′ki of the value being itself sound,and inversely proportional to the distance between thetwo resources:

l′ki = p′kiD(r, r′k)−1.

Thus, sound statements or statements coming fromvery similar instances are systematically preferred.

Value Aggregation: Finally, we aggregate the valuesv′ki and likelihoods l′ki coming from the k chosen neigh-

r (Himeji23.jpg)

r'1(HimejiJo.gif)

r'2(OsakaJo.jpg)

D(r,r'1) = 2

D(r,r'2) = 4

τ = 8 r'⊥

Title = "Himeji-Jo" (p = 1)

Metadata Statements

for resource r'1:

Title = "Osaka-Jo" (p = 0.7)Title = "Himeji-Jo" (p = 0.3)

for resource r'2:

Title = "Himeji-Jo" (p = 0.66)Title = "Osaka-Jo" (p = 0.20)Title = ⊥ (p = 0.14)

for resource r:

Fig. 5 An example of data imputation: statements coming from

two nearby candidate resources r′1 and r′2 and an abstract in-

stance r⊥ are combined to complete the statements attached toresource r whose Title is missing.

bors into D distinct values vd and probabilities pd, with

pd =

∑∀k,i|v′

ki=vdl′ki∑K

k=1

∑Ii=1 l

′ki

.

Fig. 5 shows an example of the imputation processfor an incompletely annotated image file r, combiningthe statements of its two nearest neighbors r′ and r′′.

Note that a similar imputation process can againtake place later on, once the statements have alreadybeen recontextualized but new resources have been in-dexed, for example periodically every T period of timefor resources with a high entropy. In a dynamic con-text and for high values of K, one should additionallyavoid storing too many unlikely values by eliminatingall values with low probabilities (pd < pmin).

4.3 Dealing with Metadata Heterogeneity throughPairwise Schema Mappings

So far, metadata imputation was limited to a givencommunity of interest. To take advantage of all state-ments shared by peers outside of a given community,we extend the application of self-organization principlesto the creation of mappings relating similar communi-ties of interests. We create mappings between pairs ofschemas to link semantically related vocabulary terms.Mappings are used to identify equivalent terms in thedata propagation process (see below Section 4.4), andto reformulate queries iteratively as in a peer data man-agement system [2,32].

To create schema mappings relating communities,we first have to to identify equivalent terms t′ ≡ t′′

from different schemas S′ and S′′. Various methodscan be used to discover those equivalences automati-cally. Schema matching is an active area of research [28]

9

but is not however the focus of this work. In our large-scale, decentralized context, retrieving all data from theshared hash-table – for instance for the purpose of cre-ating a corpus [23] – would be prohibitively expensive.Methods focusing on selective search queries and pig-gybacking on other operations should instead be usedin order to minimize the overhead on the shared infras-tructure.

Fundamentally, we base the semantics of a mappingrelating two terms t′ and t′′ on the likelihood of a state-ment (r, t′, v) being sound, knowing that a similar state-ment on t′′ (r, t′′, v) is indeed sound:

P (t′ ≡ t′′) = P ((r, t′, v) evaluates to true

| (r, t′′, v) evaluates to true) ∀r, v.

More pragmatically, we use a simple instance-basedschema matching approach piggybacking on theimputation process to approximate this value. Wecreate a new mapping (t′,≡, t′′, p≡) whenever twostatements (r′, t′, v′, p′) and (r′′, t′′, v′′, p′′) on twosimilar resources r′ and r′′ with D(r′, r′′) < τ withequivalent values v′ ≡ v′′ are discovered during theneighbors selection phase. The process of decidingwhether or not two values are equivalent can bebased on lexicographical and linguistic analyses (e.g.,edit distance between strings, equivalence basedon synonyms appearing in a thesaurus). Note thatthe distance defined by the feature values is hereinstrumental in creating the mappings, since failing torecognize two resources as being similar invalidates thewhole process.

The probability p≡ that this relation holds is de-rived by retrieving analogous statements (rj , t′, vj , pj)(rk, t′′, vk, pk) from the shared hash-table:

p≡ =Ppj ∀(rj ,t

′,vj ,pj),(rk,t′′,vk,pk)|D(rj ,rk)<τ∧vj≡vkP

pj∀(rj ,t′,vj ,pj),(rk,t′′,vk,pk)|D(rj ,rk)<τ .

The probability is thus computed by counting the num-ber of equivalent values appearing in the instances con-sidered as being similar for the two terms. For instance,indexing two similar images r1 and r2 with D(r1, r2) <τ with two statements sharing the same value

(r1, located in City, Himeji)

and(r2, castle city, Himeji)

would trigger the creation of a mapping betweenlocated in City and castle city by retrieving theset of similar resources annotated with either ofthe two terms and by comparing the values of theirstatements. The creation of mappings for the other

terms appearing in the schemas can be conductedsimultaneously.

Incomplete statements (i.e., statements with v = ⊥)are not taken into account in these computations. Un-sound statements (i.e., statements with p < 1) are takeninto account in this process by weighting their impor-tance with their likeliness (i.e., less likely values vi withprobabilities pi close to zero will have less impact thanmore probable values). Note that subsumption map-pings can be exported and discovered in an identicalmanner by taking into account subsumption relationsv in place of the equivalence relations above.

In highly dynamic environments where new state-ments are inserted on a continuous basis, recomputingp≡ each time a new pair of potentially related terms isdiscovered would be expensive. In such situations, thedecision to recompute p≡ can be based on the numberof times the triggering condition is observed by a partic-ular peer, or on an analysis of the graph of schemas andmappings determining whether or not more mappingswould be useful for reformulation purposes [9].

4.4 Dealing with Metadata Incompleteness throughInter-Community Metadata Propagation

We can now extend the imputation process by propa-gating metadata across different communities of inter-est following schema mappings. The process is similarto the imputation process confined to a single commu-nity of interest (see Section 4.2), but takes this timeinto consideration all neighbors annotated with equiv-alent schemas related through mappings.

For a resource r and given an incomplete statement(r, t,⊥), K neighbors r′k are considered such thatD(r, r′k) is minimal and below τ , H(r′k) is as low aspossible, and based on the existence of statements(r′k, t

′k,v′k,p

′k) such that t and t′k are either identical

or related through a mapping (t,≡, t′k, pk≡). Thelikelihoods l′ki attached to the values is then weightedwith pk≡ to account for the fact that the mapping isin itself uncertain:

l′ki = p′kiD(r, r′k)−1pk≡.

As an example, suppose that the resource r′1in Fig. 5 is annotated with a different schemaconsidering an attribute legend similar to title, with(title, ≡, legend, 0.5). The uncertain mapping wouldthen reduce the significance of r1’s contribution,lowering the likelihood of Himeji− Jo to 0.52.

Note that a value can be propagated iterativelyacross series of communities of interest in that manner,and that propagated values can in turn bootstrap thecreation of new schema mappings.

10

4.5 Possible Answers and User Feedback

User queries r? : (r1, t1, v1), . . . , (rn, tn, vn) can be re-solved by iterative lookup on the shared hash-table: foreach triple pattern in the query, candidate triples areretrieved by looking-up one of the constant terms of thetriple pattern in the shared hash-table [10]. Answers tothe query are then obtained by combining the candidatetriples. In addition to the certain answers obtained inthat way, possible answers [11] are generated by refor-mulating queries following (probabilistic) schema map-pings to query distant communities of interest:

r′? : (r1, t′1, v1), . . . , (rn, t′n, vn)

| (∃Sj | t′1, . . . , t′n ∈ Sj)∧t′1 ≡ t1, . . . , t′n ≡ tn

and by taking into account the probabilities attached tothe values of the entropic statements generated by themetadata imputation and propagation processes. Theresulting answers can be ranked with respect to theirlikelihood to present the most likely results first to theuser. Optionally, resources with a high entropy (i.e., re-sources with many incomplete or unsound statements)can at this stage be proposed to the user in order totake advantage of his feedback to classify those highlyuncertain resources and to bootstrap a new data impu-tation round.

5 PicShark: Sharing Annotated Pictures in theLarge

To demonstrate the viability of our metadata recon-textualization strategies, we developed a system calledPicShark. PicShark is an application built on top of asemantic overlay network allowing global searches onshared pictures annotated with incomplete, local andsemi-structured metadata. We consider PicShark as aconcrete instantiation of the imputation principles de-scribed above. PicShark concentrates on the image an-notation domain, but it would be straightforward toextend our application to other resources (e.g., textualdocuments, videos, or music files) by taking advantageof different feature domains and defining new distances.

Our approach extends the principle of data inde-pendence [19] by separating a logical layer – a semanticoverlay managing structured metadata and schemas –from a networking layer consisting of a structured Peer-to-Peer overlay used for efficient routing of messages(see Fig. 6). The networking layer is used to implementvarious functions at the logical layer, including queryresolution, information imputation, and information in-tegration.

P-Grid (P2P Network)

GridVine (Semantic Overlay Network)

PicShark

Retrieve(key)Insert(key, value) Return(Value)

Export(pics)

Return(pic)

Insert(RDF) SearchFor(Query) Return(RDF)

Inse

rt(ke

y, pi

c) Return(pic)

Retri

eve(

key)

Return(thumbs)

Select(thumb)Search(query)

Fig. 6 The PicShark architecture: PicShark uses P-Grid to storeshared resources and GridVine to share semi-structured meta-

data.

We use P-Grid [1] as a substrate for storingall shared information in a Distributed Hash-Table(DHT). Indexing of statements is handled byGridVine [10]. GridVine provides efficient mechanismsfor storing triples (or quadruples in our case) in adecentralized way, and facilitates efficient resolution ofconjunctive queries in O(log(n)) messages, where n isthe number of peers in the system.

On top of this architecture, PicShark takes care offostering global semantic interoperability by recontex-tualizing local statements exported to the P2P network.Users can export sets of local pictures through the Ex-port(pics) method and search for pictures by specifyingconjunctive queries against their local schemas.

Fig. 7 gives an overview of the various componentsused in PicShark. Data indexing takes advantage ofmetadata extractors (see below) to syntactically alignthe statements before sharing them through GridVine.Creation of mappings is handled by aligners, while fea-ture extractors are used to extract the feature vectorsused to relate similar images.

5.1 Information Extraction in PicShark

PicShark uses metadata extractors to extract localmetadata from the images and to syntactically alignthe metadata to a common representation. PicSharkuses RDF/S as a common syntax, and converts allsupported metadata formats to this representation.The application currently supports two very differentsemi-structured metadata formats: PSA, which is ahierarchical, proprietary format used by Photoshop

11

P-Grid

Core

Network

GridVine

RDF

OWL

RDFS

Extractors

FeaturesExtractor

Color + Textures

Time +Space

MetadataExtractor

XMPExtractor

PSAExtractor

InformationAligner

AttributeAligner

PicShark

Query ResolverReformulator

Fig. 7 The PicShark components: PicShark uses metadata ex-

tractors to syntactically export all metadata using a common

format in the shared hash-table, aligners to align schemas on asemantic level, and feature extractors to extract low-level features

representing the images.

Album3 and XMP, which is a standard based onRDF/S. Both standards are extensible and letusers define new vocabulary terms to annotate theirpictures. The PSA Extractor extracts semi-structuredstatements and vocabulary terms from the localrelational database used by Photoshop Album to storeall metadata, while its XMP counterpart extractsstatements and vocabulary terms from the payloadof the pictures. The extractors generate all missingGUIDs (for local vocabulary terms and pictures), indexstatements using GridVine and images using P-Griddirectly. All statement values are stored as strings.The system directly compares stemmed versions of thestrings to determine whether or not two values areequivalent.

Features can be extracted from the images eitherby a low-level analysis based on sixty texture and colormoments, or by the extraction of spatial and temporalmetadata from the images. With time-stamps directlyembedded into most digital images and with the prolif-eration of GPS devices and localization services (suchas ZoneTag4), we believe that the combination of bothtemporal and spatial information represents a new andcomputationally inexpensive way of identifying relatedimages (see also below Section 5.2 for a discussion onthat topic). Based on the extracted features, similaritysearch retrieving closely related resources during theimputation process can be implemented efficiently inour setting by using locality-sensitive hashing [18] at thenetworking layer. Distances for both feature spaces aredefined as standard Euclidian distances (respectivelyfor sixty and two dimensions).

3 http://www.adobe.com/products/photoshopalbum/starter.html4 http://research.yahoo.com/zonetag/

5.2 Performance Evaluation

Evaluating the performance of a system like PicSharkis intrinsically difficult for several reasons. PicShark is(to the best of our knowledge) the first application tak-ing advantage of semi-structured, heterogeneous, andincomplete metadata statements. As semi-structuredstatements from popular image organization applica-tions such as Photoshop Album or Extensis Portfolio5

are kept in local databases and are not searchable onthe Web, constituting a realistic and sufficiently largedata set is currently difficult. Using tag collections frompopular tagging portals such as Flickr is impossible aswell, as the tags are very noisy and totally unstructured.Moreover, recontextualization is a highly recursive, dis-tributed and parallel process, such that getting a clearidea of the ins and outs of the operation is difficult forlarge data sets or numerous peers.

In the following, we describe a set of controlled ex-periments pertaining to a set of three hundred photos6,which were manually annotated using Adobe Photo-shop Album Starter Edition 3. The set of photos is di-vided into three subsets, each taken by a different in-dividual during a common trip to Japan. All sets wereannotated using Photoshop Album. The first two sub-sets use the same schema, while the third subset wasannotated using a different – but semantically related –schema. Schemas were designed by the photographersand consist of about ten attributes each. Temporal in-formation was directly taken from the time-stamp em-bedded by the cameras, while spatial information (i.e.,GPS coordinates) was added manually to each picture.

5.2.1 Intra-Community Imputation

This first experiment focuses on analyzing resultspertaining to metadata imputation in a givencommunity of interest. We start by exporting thefirst two subsets of 100 images each, along with theirmetadata. We randomly drop each statement – exceptspatial and temporal information, which are alwayspreserved – with a probability pMissing to simulatemetadata scarcity. We then recontextualize the 100images from the first subset one by one using imagesand statements from the second subset to simulateintra-community recontextualization (remember thatboth subsets use the same schema). We alternativelybase the imputation process on either low-levelfeatures or spatial and temporal metadata. As ourimage set is pretty homogeneous, we set τ = ∞ and

5 http://www.extensis.com/6 both photos and semi-structured metadata are available for

download at http://lsirwww.epfl.ch/PicShark/

12

Fig. 8 Normalized total entropy pertaining to the first sub-

set of images, for metadata missing with various probabilitiespMissing; at each step, we recontextualize one of the 100 im-

ages from the first subset with its two nearest neighbors from the

second subset, using either low-level features (LL) or spatial andtemporal information (S + T ).

K = 2, i.e., we always take the two nearest neighborsto recontextualize a given picture.

Fig. 8 shows the evolution of the total metadataentropy pertaining to the first subset of photos dur-ing the recontextualization process, for various valuesof pMissing ranging from 20% to 80%. The figure givesa normalized value of the total entropy (the absoluteentropies start at 82, 166, and 329 for pMissing =20%, 40%, and 80% respectively). Total entropy offersin our context a finer granularity to analyze the processthan, say, a standard recall metric, which would be in-adequate to capture the distribution of values attachedto the propagated statements. The curves depicted onFig. 8 represent average results obtained over 10 con-secutive runs. Note that the results are pretty stable:the standard deviation never exceeds 10% of the abso-lute value. The entropy – and thus, the uncertainty onthe set of images – decreases as more and more picturesget recontextualized. The imputation process based onspatial and temporal values (S + T ) is slightly betterthan the process based on low-level features (LL) atfinding images with very related statements. For highpMissing values, many values are missing and fewermetadata statements get propagated.

The impact of the nearest-neighbor search is bestillustrated by Fig. 9 and Fig. 10, which respectivelydepict the aggregated probability for the sound andunsound metadata generated by the system. We callaggregated probability the sum of the probabilitiesattached to propagated metadata (propagatedmetadata with ⊥ values are not taken into account).Note that propagating metadata usually decreasesthe total entropy of the system, except when highly

Fig. 9 Aggregated probability of the sound statements gener-

ated by the system, for metadata missing with various probabil-ities pMissing; at each step, we recontextualize one of the 100

images from the first subset with its two nearest neighbors from

the second subset, using either low-level features (LL) or spatialand temporal information (S + T ).

uncertain metadata are generated (e.g., when a ⊥value is replaced by two generated values with 50%probability each). S + T is systematically better thanLL at finding good neighbors, as it always generatesmore sound and less unsound statements than LL.This is not surprising, as finding similar photos basedon color and texture moments only is known to bea challenging problem in general. S + T generateshigh-quality metadata that are sound more than 80%of the time. On the other hand, S + T is often wrongwhen propagating metadata about people appearingon the pictures: here, spatial and temporal informationis typically not sufficient and a combination of bothS + T and LL would probably be more effective.

In absolute terms, more statements are propagatedfor pMissing = 40%. For pMissing = 20%, few meta-data are propagated (few values are missing), while forpMissing = 80%, few values are available for propaga-tion initially.

5.2.2 Inter-Community Propagation

In the the second part of the experiment, we continuethe recontextualization process started above and fur-ther recontextualize the 100 photos coming from thefirst subset with 100 photos coming from the third sub-set annotated with a different schema. In that way, wesimulate the creation of mappings and the propagationof metadata across different communities of interest.

First, the third set of images and their related meta-data are exported. We do not drop metadata in thethird set, thus simulating a large set of metadata en-coded according to different schemas. Schema mappings

13

Agg

rega

ted

Prob

abili

ty o

f Uns

ound

Ass

ertio

ns

Fig. 10 Aggregated probability of the unsound statements gen-erated by the system, for metadata missing with various proba-

bilities pMissing; at each step, we recontextualize one of the 100

images from the first subset with its two nearest neighbors fromthe second subset, using either low-level features (LL) or spatial

and temporal information (S + T ).

are created between the two schemas using the instance-based method described in the preceding section. Oncethe mappings are created, we further recontextualizeeach of the 100 images of the first set with their twoclosest-neighbors from the third set. We only use S+T

this time, as LL systematically yields inferior resultsas for the intra-community recontextualization step de-scribed above. Fig. 11 gives the evolution of the nor-malized entropy for the first set of images. More un-certain metadata are propagated than for the previouscase due to the mappings, which were generated totallyautomatically based on the values of the statementsand are uncertain in this case. Images with a high en-tropy (e.g., for pMissing = 80%), however, benefit alot from this second recontextualization round, sincetheir statements were still largely incomplete after thefirst recontextualization round and since all statementsfrom the third set are complete.

Fig. 12 shows the aggregated probability of thesound statements generated during this second roundof recontextualization. Unsound statements follow asimilar trend, but never represent more than 20%of the generated statements. At the end of ourrecontextualization process and depending on thevalue of pMissing, 60% to 75% of the initial entropyof the system induced by incomplete metadata hasbeen alleviated. Most statements contain now entropicmetadata that are sound in their majority (lessthan 20% of the propagated statement are unsoundon average with S + T ). Also, schema mappingsrelating the two communities of interest have beencreated automatically. Thus, we are now able toquery the system and retrieve relevant images from

Fig. 11 Normalized total entropy pertaining to the first sub-

set of images, for metadata missing with various probabilitiespMissing; at each step, we recontextualize an image from the

first subset of images with its two nearest neighbors from the

third subset, based on spatial and temporal information (S+T ).

Fig. 12 Aggregated probability of the sound statements gener-ated by the system, for metadata missing with various probabil-

ities pMissing; at each step, we recontextualize an image from

the first subset of images with its two nearest neighbors from thethird subset, based on spatial and temporal information (S+T ).

both communities, while this was totally impossiblebefore the recontextualization process because of theheterogeneity and the lack of metadata.

6 Related Work

To the best of our knowledge, our approach is the firstone to tackle metadata scarcity in distributed settings.We place our work at the confines of decentralized dataintegration, tagging systems, personal informationmanagement, and data imputation techniques.

The way we propagate queries in PicShark is typ-ical of a new type of large-scale data integration in-frastructures named Peer Data Management Systems(PDMSs). PDMSs integrate data by replacing the cen-

14

tralized mediator by an unstructured network of pair-wise schema mappings and by reformulating the queriesiteratively from one schema to the others. The complex-ity of query reformulation in PDMSs is investigated inthe context of the Piazza [32] project, while the Hyper-ion [2] system proposes to use mappings at both the in-stance and at the schema levels to reformulate queries.PicShark is the first PDMS taking into account prob-abilistic tuples and supporting the propagation of in-stances from one database to the others through schemamappings.

Tagging systems, allowing communities of users toadd unstructured text labels to resources shared online,are very popular today. While they represent an effec-tive method for gathering large amounts of metadata,the tags they take advantage of represent unstructuredinformation and therefore are difficult to process auto-matically. Thus, several recent research efforts concen-trate on extracting additional structured informationfrom unstructured tags. Rattenbury et al. [29] applyburst-analysis techniques to extract event and place se-mantics from image tags based on their usage patterns.Schmitz et al. [30] use a subsumption-based model toextract ontologies from Flickr tags, while the ELSABersystem [22] organizes tags in hierarchies in order to en-able semantic browsing. Wu et al. [35] study the emer-gence of semantics from tags, resources, and users co-occurrences. In a similar context, Aurnhammer et al. [3]proposed an emergent semantics approach to retrieveimages based on collaborative tagging. All these initia-tives recognize the importance of semi-structured meta-data to improve search in large-scale settings, but nonof them tackles scarcity or heterogeneity of metadata.

Similar to PicShark, personal informationmanagement systems try to organize data originatingfrom user desktops. Haystack [20] is an informationmanagement system, which uses extractors and letsnon-technical users teach the application how toextract Semantic Web content to generate RDF triplesfrom various sources. Gnowsis [17] is a semanticdesktop where semantic information is collected fromdifferent applications on the desktop and integratedwith information coming from external taggingportals. Semantic annotations are either extracted orderived from user interaction. The Semex System [12]is a platform for personal information managementthat reconciles heterogeneous references to the samereal-world object using context information andsimilarity values computed from related entities. Thesystem leverages on previous mappings provided bythe users and on object and association databasesto foster interoperability. Reconciliation of datawas also recently revisited in the context of the

ORCHESTRA [33] project. In ORCHESTRA,participants publish their data on an ad hoc basis andsimultaneously reconcile updates with those publishedby others. P-Tag [7] is a system which automaticallygenerates personalized tags for annotating web pages,based on the data residing on the user’s personaldesktop. Closer to our work, Naaman et al. [26] addidentity tags to photos in local photo collections,based on time and location of photographs andco-occurrence of people. Contrary to PicShark, noneof these approaches addresses data scarcity or takesadvantage of communities of users to collaborativelyaugment the data that is shared.

Data imputation, finally, denotes techniques aimingat replacing missing values in a data set by some plausi-ble values (see Farhangfar et al. [16] for a recent surveyof the field).

7 Conclusions

With the rapid emergence of socially-driven ap-plications on the Web, self-organization principleshave once again proven their practicability andscalability: through Technorati Ranking7, FlickrInterestingness8 or del.icio.us recommendations9,an ever-increasing portion of the Web self-organizesaround end-user input. While most efforts concentrateon unstructured metadata (i.e., keyword) management,we proposed in the article to tackle the problem oforganizing structured, heterogeneous metadata inlarge-scale settings. We advocated a decentralized,community-based and imperfect (in terms of soundnessand completeness) way of augmenting semi-structuredmetadata through self-organizing assertions. OurPicShark system aims at creating metadata automati-cally by using intra and inter-domain propagation ofentropic statements and schema alignment throughdecentralized instance-based schema matching.

PicShark represents a first proof-of-concept ofthe applicability of self-organization principles to theorganization of semi-structured, heterogeneous andpartially annotated content in large-scale settings.We showed in our experiments how incompletemetadata could be enhanced collaboratively using ourapproach. To the best of our knowledge, PicShark iscurrently the only system capable of using incompleteand heterogeneous data sets such as the one weused to foster global, structured search capabilitiesautomatically. This first implementation effort opens

7 http://www.technorati.com/8 http://www.flickr.com/9 http://del.icio.us/

15

the door to many technical refinements. As futurework, we plan to improve our imputation process toinclude personalized and fuzzy classification rules torelate semantically similar content. Also, we intend toanalyze the system churn – in terms of total entropy,user feedback, and recently indexed information –in order to determine the optimal scheduling ofrecontextualization and schema matching rounds.Finally, we want to improve the deployability of ourapplication in order to test our approach in situ onlarge and heterogeneous communities of real users,and are currently launching an initiative jointly withan art center in that context.

References

1. K. Aberer, P. Cudre-Mauroux, A. Datta, Z. Despotovic,M. Hauswirth, M. Punceva, and R. Schmidt. P-grid: A self-

organizing structured p2p system. ACM SIGMOD Record,

32(3), 2003.2. M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R.J.

Miller, and J. Mylopoulos. The Hyperion Project: From Data

Integration to Data Coordination. SIGMOD Record, Special

Issue on Peer-to-Peer Data Management, 32(3), 2003.3. M. Aurnhammer, P. Hanappe, and L. Steels. Integrating

collaborative tagging and emergent semantics for image re-

trieval. In Collaborative Web Tagging Workshop, 2006.4. G. Batista and M. C. Monard. An analysis of four miss-

ing data treatment methods for supervised learning. Applied

Artificial Intelligence, 17(5-6), 2003.5. S. Boag, D. Chamberlin, M.F. Fernandez, D. Florescu, J. Ro-

bie, and J. Simeon (Ed.). XQuery 1.0: An XML Query

Language. W3C Candidate Recommendation, June 2006.

http://www.w3.org/TR/xquery/.6. T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler,

and F. Yergeau (Ed.). Extensible Markup Language

(XML) 1.0. W3C Recommendation, February 2004.

http://www.w3.org/TR/REC-xml/.7. P.-A. Chirita, S. Costache, W. Nejdl, and S. Handschuh. P-

tag: large scale automatic generation of personalized anno-

tation tags for the web. In International World Wide WebConference (WWW), 2007.

8. P. Cudre-Mauroux. Emergent Semantics: Rethinking Inter-

operability for Large-Scale Decentralized Information Sys-

tems. PhD thesis, 3690, EPFL – Switzerland, 2006.9. P. Cudre-Mauroux and K. Aberer. A Necessary Condition

For Semantic Interoperability in the Large. In Ontologies,

DataBases, and Applications of Semantics for Large ScaleInformation Systems (ODBASE), 2004.

10. P. Cudre-Mauroux, Suchit Agarwal, and Karl Aberer. Grid-

vine: An infrastructure for peer information management.

IEEE Internet Computing, 11(5), 2007.11. N. N. Dalvi and D. Suciu. Efficient Query Evaluation on

Probabilistic Databases. In International Conference on

Very Large Data Bases (VLDB), 2004.12. X. Dong and A. Y. Halevy. A platform for personal informa-

tion management and integration. In CIDR, 2005.13. F.M. Donini. Complexity of reasoning. In The description

logic handbook: theory, implementation, and applications.

Cambridge University Press, 2003.14. S. T. Dumais, E. Cutrell, J. J. Cadiz, G. Jancke, R. Sarin,

and D. C. Robbins. Stuff I’ve seen: a system for personal

information retrieval and re-use. In International ACM SI-GIR Conference on Research and Development in Informa-

tion Retrieval (SIGIR), 2003.

15. R. Troncy et al. (Ed.). Image Annotation on the Se-

mantic Web. W3C Incubator Group Report, August

2007. http://www.w3.org/2005/Incubator/mmsem/XGR-image-annotation/.

16. A. Farhangfar, L. A. Kurgan, and W. Pedrycz. Experimen-tal analysis of methods for imputation of missing values in

databases. Intelligent Computing: Theory and ApplicationsII, 5421(1), 2004.

17. C. Fluit, B. Horak, G. A. Grimnes, A. Dengel, D. Nadeem,L. Sauermann, D. Heim, and M. Kiesel. Semantic desktop

2.0: The gnowsis experience. In International Semantic Web

Conference (ISWC), 2006.

18. P. Haghani, S. Michel, P. Cudre-Mauroux, and K. Aberer.

LSH At Large - Distributed KNN Search in High Dimen-sions. In International Workshop on Web and Databases

(WebDB), 2008.

19. J. M. Hellerstein. Toward network data independence. ACM

SIGMOD Record, 32(3), 2003.

20. D. R. Karger, K. Bakshi, D. Huynh, D. Quan, and V. Sinha.

Haystack: A general-purpose information management toolfor end users based on semistructured data. In Biennial Con-

ference on Innovative Data Systems Research (CIDR), 2005.

21. A.I. Khinchin. Mathematical Foundations of Information

Theory. Dover Publications, Inc., New York., 1957.

22. R. Li, S. Bao, B. Fei, Z. Su, and Y. Yu. Towards effective

browsing of large scale social annotations. 2007.

23. J. Madhavan, P. A. Bernstein, A. Doan, and Alon Y. Halevy.

Corpus-based schema matching. In International Conferenceon Data Engineering (ICDE), 2005.

24. F. Manola and E. Miller (Ed.). RDF Primer. W3C Rec-ommendation, February 2004. http://www.w3.org/TR/rdf-

primer/.

25. D.L. McGuinness and F. van Harmelen (Ed.). OWL Web On-

tology Language Overview. W3C Recommendation, Febru-

ary 2004. http://www.w3.org/TR/owl-features/.

26. M. Naaman, R.B. Yeh, H. Garcia-Molina, and A. Paepcke.

Leveraging context to resolve identity in photo albums.In ACM/IEEE-CS Joint Conference on Digital Libraries

(JCDL), 2005.

27. E. Prud’hommeaux and A. Seaborne van Harmelen (Ed.).

SPARQL Query Language for RDF. W3C Candidate Recom-mendation, April 2006. http://www.w3.org/TR/rdf-sparql-

query/.

28. E. Rahm and P. Bernstein. A survey of approaches to auto-

matic schema matching. VLDB Journal, 10(4), 2001.

29. T. Rattenbury, N. Good, and M. Naaman. Towards auto-

matic extraction of event and place semantics from flickr

tags. In International ACM SIGIR Conference on Researchand Development in Information Retrieval (SIGIR), 2007.

30. P. Schmitz. Inducing ontology from flickr tags. In Collabo-

rative Web Tagging Workshop, Edinburgh, Scotland, 2006.

31. A. Seaborne. RDQL - A Query Language

for RDF”. W3C Member Submission, 2004.

http://www.w3.org/Submission/RDQL/.

32. I. Tatarinov, Z. Ives, J. Madhavan amd A. Halevy, D. Suciu,

N. Dalvi, X. Dong, Y. Kadiyaska, G. Miklau, and P. Mork.The Piazza Peer Data Management Project. ACM SIGMODRecord, 32(3), 2003.

33. N. E. Taylor and Z. G. Ives. Reconciling while tolerating dis-

agreement in collaborative data sharing. In SIGMOD Con-ference, 2006.

34. G. Wiederhold. Mediators in the Architecture of Future In-formation Systems. IEEE Computer, 25(3), 1992.

16

35. X. Wu, L. Zhang, and Y. Yu. Exploring social annotationsfor the semantic web. In International World Wide Web

Conference (WWW), New York, NY, USA, 2006.

36. Y. Yang, T. Ault, and T. Pierce. Combining multiple learn-ing strategies for effective cross validation. In International

Conference on Machine Learning (ICML), 2000.

Documents

PicShark: Mitigating Metadata Scarcity Through Large-Scale P2P ...people.csail.mit.edu/pcm/papers/PicShark.pdf · PicShark: Mitigating Metadata Scarcity Through Large-Scale P2P Collaboration