14
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014 1533 Image Phylogeny Forests Reconstruction Filipe de O. Costa, Student Member, IEEE, Marina A. Oikawa, Zanoni Dias, Siome Goldenstein, Senior Member, IEEE , and Anderson de Rezende Rocha, Member, IEEE Abstract—Today, a simple search for an image on the Web can return thousands of related images. Some results are exact copies, some are variants (or near-duplicates) of the same digital image, and others are unrelated. Although we can recognize some of these images as being semantically similar, it is not as straightforward to find which image is the original. It is not easy either to find the chain of transformations used to create each modified version. There are several approaches in the literature to identify near-duplicate images, as well as to reconstruct their relational structure. For the latter, a common representation uses the parent–child relationship, allowing us to visualize the evolution of modifications as a phylogeny tree. However, most of the approaches are restricted to the case of finding the tree of evolution of the near-duplicates, with few works dealing with sets of trees. Since one set of near-duplicates can contain n independent subsets, it is necessary to reconstruct not only one phylogeny tree, but several trees that will compose a phylogeny forest. In this paper, through the analysis of the state-of-the- art image phylogeny algorithms, we introduce a novel approach to deal with phylogeny forests, based on different combinations of these algorithms, aiming at improving their reconstruction accuracy. We analyze the effectiveness of each combination and evaluate our method with more than 40000 testing cases, using quantitative metrics. Index Terms— Image forensics, multimedia phylogeny, optimal branching. I. I NTRODUCTION I N RECENT years, sharing digital content became a trivial task due to the technological boom of hardware, software as well as the advent of social networks. Once one image Manuscript received December 23, 2013; revised May 7, 2014; accepted July 4, 2014. Date of publication July 17, 2014; date of current version September 4, 2014. This work was supported in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Brasilia, Brazil, in part by the National Council for Scientific and Technological Development under Grant 307018/2010-5, Grant 304352/2012-8, Grant 306730/2012-0, Grant 477692/2012-5, Grant 477662/2013-7, and Grant 483370/2013-4, in part by the Fundação de Amparo à Pesquisa do Estado de São Paulo under Grant 2010/05647-4 and Grant 2013/08293-7, in part by Microsoft Research, and in part by the European Union through the Reverse Engineering of Audio-Visual Content Data Project, which was supported by the Future and Emerging Technologies Programme within the Seventh Framework Programme for Research of the European Commission, under Grant FET-268478. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Teddy Furon. The authors are with the RECOD Laboratory, Institute of Com- puting, University of Campinas, Campinas 13083-970, Brazil (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). This paper has supplementary downloadable material available at the IEEE Xplore website. The file consists of further explanations and examples regarding the methods developed in this paper. The material is 800KB in size. Furthermore, the datasets are registered on the address http:// figshare.com/articles/Image_Phylogeny_Forests_Reconstruction/1012816, and the source code and documentation are available in a public repository on http://repo.recod.ic.unicamp.br/public/projects. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIFS.2014.2340017 or video is shared by a user, depending on its contents, it can easily go viral, being republished by many other users in different channels on the Web. Additionally, with the popularization of image editing software and online editor tools, in most of the cases, not only their exact duplicates will be available, but also manipulated versions of the original source. This scenario easily leads to copyright infringement, sharing of illegal contents and, in some cases, negatively affect or impersonate the public image of people or corporations. When small changes are applied during the redistribu- tion, usually without interfering in their semantic meaning, they are called near-duplicate objects. Several approaches for Near-Duplicate Detection and Recognition (NDDR) have been developed targeting at different applications, such as photography collections arrangement [1], [2], multimedia file matching [3], copyright infringement detection [4], [5] and forgery detection [6], [7]. However, more challenging than identifying different versions of a given document is to inves- tigate the structure of modifications applied to generate these documents and the relationship among them. In this context, a new research field, called Multimedia Phylogeny [8] has arisen, in which researchers are not only interested in determining if objects within a set are near- duplicate of each other, but also in finding their ancestral relationship and the original source (root or patient zero), reconstructing the order and the transformations that originally created the near-duplicate set. The main idea is inspired by the evolutionary process observed in Biology, in which an organ- ism inherits characteristics from its ancestors, and can also go through mutations over time, producing new species [9]. If we look at the complete structure relating this population, moving from the root to the leaves of the phylogeny tree, we are able to identify the ancestral lineage, the descendants of each ancestor and have an overall idea of their evolution timeline. A similar structure can be considered for a document that goes under slight modifications over time. If we are able to reconstruct the phylogeny tree of a set of near-duplicate objects, we can easily visualize their past history and ancestry information. In this paper, we will focus on algorithms dealing with the Image Phylogeny problem. In this case, we are interested in identifying the relationships among a set of Near- Duplicate Images (NDI), including transformations such as rotation, translation, cropping, brightness and contrast adjust- ment, among others. Those relationships are represented by an oriented tree, whose root is the least modified image and most probably the image first used to create others in the set, and the leaves, which represent the images with more modifications. There are several forensic applications to image phylogeny solutions. For instance, consider postings on the Internet of 1556-6013 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Image Phylogeny Forests Reconstruction

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014 1533

Image Phylogeny Forests ReconstructionFilipe de O. Costa, Student Member, IEEE, Marina A. Oikawa, Zanoni Dias,

Siome Goldenstein, Senior Member, IEEE, and Anderson de Rezende Rocha, Member, IEEE

Abstract— Today, a simple search for an image on the Webcan return thousands of related images. Some results are exactcopies, some are variants (or near-duplicates) of the same digitalimage, and others are unrelated. Although we can recognizesome of these images as being semantically similar, it is not asstraightforward to find which image is the original. It is not easyeither to find the chain of transformations used to create eachmodified version. There are several approaches in the literatureto identify near-duplicate images, as well as to reconstruct theirrelational structure. For the latter, a common representationuses the parent–child relationship, allowing us to visualize theevolution of modifications as a phylogeny tree. However, most ofthe approaches are restricted to the case of finding the tree ofevolution of the near-duplicates, with few works dealing withsets of trees. Since one set of near-duplicates can contain nindependent subsets, it is necessary to reconstruct not only onephylogeny tree, but several trees that will compose a phylogenyforest. In this paper, through the analysis of the state-of-the-art image phylogeny algorithms, we introduce a novel approachto deal with phylogeny forests, based on different combinationsof these algorithms, aiming at improving their reconstructionaccuracy. We analyze the effectiveness of each combination andevaluate our method with more than 40 000 testing cases, usingquantitative metrics.

Index Terms— Image forensics, multimedia phylogeny, optimalbranching.

I. INTRODUCTION

IN RECENT years, sharing digital content became a trivialtask due to the technological boom of hardware, software

as well as the advent of social networks. Once one image

Manuscript received December 23, 2013; revised May 7, 2014; acceptedJuly 4, 2014. Date of publication July 17, 2014; date of current versionSeptember 4, 2014. This work was supported in part by the Coordenaçãode Aperfeiçoamento de Pessoal de Nível Superior, Brasilia, Brazil, in partby the National Council for Scientific and Technological Development underGrant 307018/2010-5, Grant 304352/2012-8, Grant 306730/2012-0, Grant477692/2012-5, Grant 477662/2013-7, and Grant 483370/2013-4, in part bythe Fundação de Amparo à Pesquisa do Estado de São Paulo under Grant2010/05647-4 and Grant 2013/08293-7, in part by Microsoft Research, and inpart by the European Union through the Reverse Engineering of Audio-VisualContent Data Project, which was supported by the Future and EmergingTechnologies Programme within the Seventh Framework Programme forResearch of the European Commission, under Grant FET-268478. Theassociate editor coordinating the review of this manuscript and approving itfor publication was Dr. Teddy Furon.

The authors are with the RECOD Laboratory, Institute of Com-puting, University of Campinas, Campinas 13083-970, Brazil (e-mail:[email protected]; [email protected]; [email protected];[email protected]; [email protected]).

This paper has supplementary downloadable material available at the IEEEXplore website. The file consists of further explanations and examplesregarding the methods developed in this paper. The material is 800KBin size. Furthermore, the datasets are registered on the address http://figshare.com/articles/Image_Phylogeny_Forests_Reconstruction/1012816, andthe source code and documentation are available in a public repository onhttp://repo.recod.ic.unicamp.br/public/projects.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIFS.2014.2340017

or video is shared by a user, depending on its contents, itcan easily go viral, being republished by many other usersin different channels on the Web. Additionally, with thepopularization of image editing software and online editortools, in most of the cases, not only their exact duplicateswill be available, but also manipulated versions of the originalsource. This scenario easily leads to copyright infringement,sharing of illegal contents and, in some cases, negatively affector impersonate the public image of people or corporations.

When small changes are applied during the redistribu-tion, usually without interfering in their semantic meaning,they are called near-duplicate objects. Several approachesfor Near-Duplicate Detection and Recognition (NDDR) havebeen developed targeting at different applications, such asphotography collections arrangement [1], [2], multimedia filematching [3], copyright infringement detection [4], [5] andforgery detection [6], [7]. However, more challenging thanidentifying different versions of a given document is to inves-tigate the structure of modifications applied to generate thesedocuments and the relationship among them.

In this context, a new research field, called MultimediaPhylogeny [8] has arisen, in which researchers are not onlyinterested in determining if objects within a set are near-duplicate of each other, but also in finding their ancestralrelationship and the original source (root or patient zero),reconstructing the order and the transformations that originallycreated the near-duplicate set. The main idea is inspired by theevolutionary process observed in Biology, in which an organ-ism inherits characteristics from its ancestors, and can also gothrough mutations over time, producing new species [9]. If welook at the complete structure relating this population, movingfrom the root to the leaves of the phylogeny tree, we are able toidentify the ancestral lineage, the descendants of each ancestorand have an overall idea of their evolution timeline.

A similar structure can be considered for a document thatgoes under slight modifications over time. If we are ableto reconstruct the phylogeny tree of a set of near-duplicateobjects, we can easily visualize their past history and ancestryinformation. In this paper, we will focus on algorithms dealingwith the Image Phylogeny problem. In this case, we areinterested in identifying the relationships among a set of Near-Duplicate Images (NDI), including transformations such asrotation, translation, cropping, brightness and contrast adjust-ment, among others. Those relationships are represented by anoriented tree, whose root is the least modified image and mostprobably the image first used to create others in the set, and theleaves, which represent the images with more modifications.

There are several forensic applications to image phylogenysolutions. For instance, consider postings on the Internet of

1556-6013 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1534 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014

images containing illegal or abusive content (e.g., a person in abullying situation, fake and defamatory image of celebrities orpoliticians, etc.), which were redistributed after being modifiedby different users. With phylogeny algorithms, it becomeseasier to understand the evolutionary process among the setof replicated images and trace their past history, allowing usto find the original image (or the least modified), which canalso give hints about the creator of these images.

Researchers in this area mainly tried to solve thisproblem by considering only one tree for each set of near-duplicates [10]–[13]. However, in some cases, we may havemultiple sets of NDIs, in which the source images come fromdifferent cameras or from the same camera but from a slightlydifferent point in space and time. Therefore, we are not inter-ested in reconstructing a single tree, but a phylogeny forest,a set of phylogeny trees, for representing their relationship.

Approaches for finding phylogeny forests are useful fortracing the original documents within a large set of seman-tically similar images. In other words, if there is more thanone original image (the same scene, but taken from anotherview, for example), phylogeny forest approaches should beused for finding the correct trees of each image.

In this paper, we first make a brief review of the state-of-the-art approaches for reconstructing phylogeny trees, whichfollow two main strategies: using a modified version ofthe Kruskal’s minimum spanning tree algorithm [14], oroptimal branching strategies [13], such as the algorithmsproposed independently by Chu & Liu [15], Edmonds [16]and Bock [17] to construct minimum spanning trees ondirected graphs with known roots. Usually, each approach forsolving the image phylogeny problem is analyzed separately.We extend existing approaches to automatically find forestsand, based on the analysis of their properties and comple-mentarity, we also introduce a solution to combine themtoward a more stable, robust and effective algorithm. Finally,following the same methodology used in previous approaches,we evaluate the proposed methods using two different datasetscomprising more than 40,000 testing cases.

II. IMAGE PHYLOGENY

The analysis of the content of pairs of multimedia objectsto investigate their relationship is a challenging task, sincethere are several operators that can be applied on one objectto change its contents. However, these operators also leavebehind useful artifacts that can be explored for reconstructingthe evolutionary history of these multimedia objects.

A first definition and formalization of Image Phylogeny wasintroduced by Dias et al. [11], aiming at identifying the causalrelationships among a set of NDIs and the transformations thatled from one to the other. Given a set of images known to benear-duplicates, a dissimilarity function d is used to compareeach pair of images, returning small values for similar imagesand large values for more distinct images.

Once the dissimilarity is calculated for all pairs of images,an asymmetric dissimilarity matrix M is constructed, whichcan be seen as a complete, directed graph with weights oneach edge. To reconstruct the image phylogeny tree relatingthese images, the authors proposed to use the classic Kruskal’s

minimum spanning tree algorithm [14] adapted for orientedgraphs, which was named Oriented Kruskal (OK) algorithm.As a result, images closer to the root of the tree representthe least modified versions, while the leaves represent theimages with more significant transformations. Experimentsperformed using both synthetic images and images collectedfrom the Internet provided good results, though, for large-scalescenarios, the calculation of the entire dissimilarity matrixbecame an expensive operation. In [18], the authors proposeda solution to this problem, by modifying the original recon-struction algorithm to allow missing values in the dissimilaritymatrix and reduce the calculation time.

In the literature, there are also other approaches following asimilar trend, aiming at finding the structure of the evolutionof images on the Internet [10], [19]–[22]. Extensions to theoriginal image phylogeny algorithm were also proposed forreconstructing the tree of evolution of a set of near-duplicatevideos (Video Phylogeny) [12], as well for dealing with morethan one phylogeny tree [8], [23]. In addition, the phylogenyof audio clips were also investigated by [24] using the originalwork proposed by Dias et al. [8].

In the next sections, we will give more details about thebasic definitions used in phylogeny algorithms for reconstruct-ing trees and forests, given a set of n near-duplicate images.

A. Image Phylogeny Tree (IPT)

Following the formalization described in [8], let T be afamily of image transformations, and let T be a transformationsuch that T ∈ T . The dissimilarity function d between twoimages IA and IB can be defined as the lowest value ofdIA,IB , such that

d(IA,IB) = minT−→

β

|IB − T−→β

(IA)| point-wise comparison L, (1)

for all possible values of the parameter β in T . Equation 1calculates the dissimilarity between the best transformationmapping IA onto IB , according to the family of transforma-tions T and IB . Then, the comparison between the images canbe performed by any point-wise comparison method L (e.g.,sum of squared differences).

The dissimilarity of each pair of images is calculated by firstfinding interest points in each pair of images, using SURF [25],which will be used to estimate warping and cropping parame-ters robustly using RANSAC [26]. Pixel color normalizationparameters are calculated using the color transfer techniqueproposed by Reinhard et al. [27]. Then, one of the imagesis compressed with the same compression parameters as theother. As a final step, both images are uncompressed andcompared pixel-by-pixel according to the minimum squarederror (MSE) metric. After calculating the dissimilarity for eachpair of images, we have a dissimilarity matrix Mn×n , where nis the number of near-duplicates and each region of the matrixrepresents the dissimilarity between one pair of images.

To reconstruct the IPT associated with the dissimilaritymatrix, the authors first proposed the Oriented Kruskal algo-rithm. Considering a graph G that represents the dissimilaritymatrix, the Oriented Kruskal algorithm starts considering eachvertex (near-duplicate) of G as one root of the phylogeny tree.

COSTA et al.: IMAGE PHYLOGENY FORESTS RECONSTRUCTION 1535

Fig. 1. Optimum branching algorithm simulation. (a) Starting from node 1, the edge with the lowest cost arriving in each node v is selected. Next, thealgorithm needs to deal with the circuit in red. (b) A dummy node vd is created, and the edge weights of the nodes connected to vd are updated. (c) Finally,after processing all nodes within and outside the circuit, the algorithm recursively finds the optimal branching.

Then, the algorithm sorts all edges (i, j) of G and analyzeseach position according to the sorted order, joining differenttrees and checking whether or not one edge can be added inthe tree.

More recently, Dias et al. presented in [13] other approachesfor IPT reconstruction, comparing their performance with theOriented Kruskal algorithm. The first method is a variation ofthe algorithm for minimum spanning tree of Prim [28], adaptedfor oriented graphs. This Oriented Prim approach initiallyreceives the dissimilarity matrix M and, for each possible rootout of n, it reconstructs the tree rooted at the chosen node andcalculates the cost for building such tree. After evaluating allpossible roots, the final IPT is the one with the lowest recon-struction cost, which is calculated by summing the weightof the edges in the tree. However, this algorithm presentedlow effectiveness for the tree reconstruction, considering theancestry relationship among the images.

The second approach presented in [13] is based on the Opti-mum Branching (OB) algorithm developed independently byChu and Liu [15], Edmonds [16] and Bock [17]. ConsideringG = (V , E) be the input graph and r one node of thephylogeny tree, the first step in this algorithm consists ofassuming the node r as a possible root, as done in the OrientedPrim Algorithm. As an example, suppose that vertex 1 inFigure 1 was chosen as the initial root r . Then, for eachnode v ∈ V | v �= r , the edge arriving at v with the lowestcost is selected. If there is no circuit in the found graph, thealgorithm returns such branching. Otherwise, it deals with itby creating a dummy node vd . For each edge connecting anode x outside the circuit and a node y in the circuit, theweight w is updated by taking into account w(x, y) plus thelowest edge weight in the circuit minus the weight of the edgearriving at y in the circuit. In this example, w(3, 7) = 6,the lowest edge weight inside the cycle is w(7, 6) = 2, andthe weight of the edge that arrives at the vertex y = 7 isw(8, 7) = 3. Thus, w(3, vd ) = 6 + 2 − 3 = 5. On the otherhand, to update each edge connecting a node z inside thecircuit and a node u outside the circuit, it is necessary totake into account w(z, u) plus the lowest edge weight in thecircuit minus the weight of the edge arriving at z inside thecircuit. For instance, w(vd , 4) = w(8, 4)+w(7, 6)−w(5, 8) =11+ 2− 6 = 7.

After processing all nodes within and outside the circuit, thealgorithm recursively finds the optimal branching Ai+1 in thisgraph, resulting in the updated graph depicted in Figure 1(b).Next, the algorithm needs to cope with the circuit itself. This isachieved by first discarding the edge in the current optimumbranching connecting a node x to the dummy node vd andupdating such edge with the correct one in the original graph.In this case, the edge (3, vd ) is replaced by edge (3, 7) in thefinal branching. In addition, the algorithm updates all edgesin the optimum branching within the circuit that do not arrivein y (in the example, the edge (8, 7) is removed). Finally,the algorithm just needs to update possible edges from nodeswithin the circuit to outside that are part of the optimumbranching. The algorithm ends with the graph in Figure 1(c)with solid lines highlighting the edges that belong to theoptimum branching.

The algorithm’s complexity is given by the recurrenceT (V , E) = T (V0, E0)+O(V , E), where O(V , E) is the com-plexity of trying to generate one tree. As in each recursive callthe number of vertices is reduced in, at least, one unit, |V0| <|V |, we have that T (V , E) = O(V (V + E)). In our case,as E = �(V 2) (complete graph), then T (V , E) = O(V 3).A faster implementation provided by Tarjan [29] also exists,which runs in O(V 2), for dense graphs [13].

This new approach outperformed the Oriented Kruskalalgorithm, being more effective for finding the roots and theancestry relationships, either considering complete trees orwith missing links in the phylogeny tree. This happens dueto the different properties presented by each algorithm. TheOriented Kruskal algorithm is a greedy algorithm, that is, itmakes a local optimal choice (the lowest cost edge) at eachstage to find the minimum branching. On the other hand, theOptimum Branching algorithm is an exact algorithm, alwaysfinding one optimum global solution (it considers the wholedissimilarity matrix to make decisions about the phylogenyreconstruction), which leads to better results in most of thecases. More details about the optimum branching algorithmcan be found in [13].

B. Image Phylogeny Forest (IPF)

In some cases in multimedia phylogeny, instead of one tree,we may find m trees representing the ancestry relationship in

1536 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014

a set of near-duplicate images. This happens when we havemultiple images with the same semantic content, but that arenot directly related to each other. For instance, images fromthe same scene taken with different cameras, or with the samecamera but using different settings and in slight differentpositions. In this case, each tree in the forest represents thestructure of transformations and the evolution of one subsetof near-duplicate images, while the forest comprises distinctsubsets of near-duplicate images which are semanticallysimilar [23].

A first approach for reconstructing an IPF from a set ofsemantically similar images extended the idea of reconstruct-ing an IPT from a set of near-duplicate images using theOriented Kruskal algorithm. In this approach, Dias et al. [8]considered that to reconstruct m trees, the algorithm wouldstop when the structure of relationship reached n − m edgeswhere n is the number of images in the set. Although providinggood results, the algorithm required from the user the numberof trees to look for in the forest. However, this informationis not always available, and without knowing the numberof trees to reconstruct, the performance of the algorithmdecreased with the number of trees, specially regarding thecorrect number of roots and the ancestry relationship in theIPF [23].

To enable automatic IPF reconstructions, Dias et al. pre-sented in [23] a modified version of the Oriented Kruskal algo-rithm originally used to reconstruct IPTs. To create the newapproach, named Automatic Oriented Kruskal (AOK), threeparameters are required as input: the number of semanticallysimilar images n, an n × n dissimilarity matrix M built uponthese images and a parameter γAOK, calculated beforehand anddefined as the number of standard deviations used to limit thenumber of edges to be included in the forest.

The AOK algorithm keeps track of the variance of processededges and only adds a new one to the forest if the weight of thecurrent edge is lower than γAOK times the standard deviationof the processed edges up to that point. This parameter γAOKis related to a threshold point τAOK that selects only edgesthat belong to valid trees. To define its value, a study aboutthe behavior of the dissimilarity values of valid trees andforests was performed. It was observed that a Log-Normaldistribution can reasonably describe the data regardless of thenumber of trees in the forest and the type of image capture(single/multiple cameras). The threshold τAOK = μAOK +γAOK×σAOK was used, where μAOK represents the average andσAOK the standard deviation of the weight of the edges alreadyselected. After testing for different values, it was defined thatγAOK = 2. In terms of complexity, and considering the samenotation used for the analysis of OB algorithm, AOK algorithmhas the same complexity of the original Kruskal algorithm:O(V 2 log V ). Further details and a step-by-step algorithm canbe found in [23].

III. AUTOMATIC RECONSTRUCTION OF IMAGE

PHYLOGENY FORESTS

Without user intervention, the IPF reconstruction algorithmrelies on the choice of a good threshold point to correctly

Algorithm 1 Automatic Optimum BranchingInput: number of n semantically similar images, n × n dissimilarity matrix

M and the number of standard deviations γAOB used as limit (default:γAOB = 2)

Output: reconstructed IPF f orest1: for i ∈ [1..n] do � Initialization2: f orest[i] ← i3: end for4: nedges ← 0 � Number of processed edges5: x1 ← x2 ← 0 � Variables to calculate σ dynamically6: G ← Graph(M)7: B ← O Bmin (G) � Minimum branching8: B ′ ← Sort edges (i, j) of B by weight in non-decreasing order9: for (i, j) ∈ B ′ do

10: if (nedges > |B ′|/2) then

11: σAOB ←

√√√√√

x1−(

x22

nedges

)

nedges−1

12: if (w(i, j)− last > γAOB × σAOB) then13: break;14: end if15: end if16: nedges ← nedges + 1 � Updates auxiliary variables17: last ← w(i, j) � Last processed edge18: x1 ← x1 +w(i, j)2

19: x2 ← x2 + w(i, j)20: f orest[ j] ← i � Adds new edge to the forest21: end for22: return f orest � Returns the final forest

decide the number of trees in the forest in advance. In thissection, we propose new methods to automatically decidethe number of trees in a forest, through the analysis of thedissimilarity matrix values.

A. Automatic Optimum Branching (AOB)

Following a successful approach to automatically decidethe number of trees in an IPF using the Oriented Kruskalalgorithm, one might wonder what happens if we apply asimilar process to the optimum branching algorithm explainedin Section II-A. Thus, in this section, we extend upon thisidea, proposing the Automatic Optimum Branching algorithm(AOB), with the necessary modifications to deal with itsparticularities.

Algorithm 1 describes our proposed implementation, whichrequires the following parameters: the number n of semanti-cally similar images, an n×n dissimilarity matrix M built uponthese images, and the parameter γAOB, which is the numberof standard deviations used to limit the number of edges to beincluded in the forest.

Lines 1-3 initialize the vector f orest with n initialtrees, each tree containing a vertex representing an image.In our implementation, we used the same tree representationintroduced in [23], in which each position from f orest[i ]denotes the parent of node i ; i = {0..n − 1}. For instance,f orest[0] = 3 means that edge (3, 0) exists in the forest.Lines 4-6 initialize the auxiliary variables nedges , which is usedto keep track of the number of accepted edges, the variables x1and x2 to calculate the standard deviation of accepted edges,and G to represent the graph constructed using the values fromthe dissimilarity matrix M . In Line 7, G is used as input for theOptimum Branching (OB) algorithm, returning its minimumbranching in B . Subsequently, the edges of B are sorted into

COSTA et al.: IMAGE PHYLOGENY FORESTS RECONSTRUCTION 1537

non-decreasing order in Line 8 and used to decide the numberof trees in this forest in the for loop in Lines 9-21.

In Line 10, the evaluation of the standard deviation onlystarts when the number of processed edges is greater thanhalf of the edges of the minimum branching. This was doneto avoid removing edges that are still valid to our approach,since with fewer values used for the calculation of the standarddeviation, it is not possible to obtain a stable result.

In the next step, the algorithm updates the number ofaccepted edges and their standard deviation, includes the newedge to the f orest vector, and goes back to the beginningof the loop. Otherwise, the algorithm leaves the loop and thefinal result stored in vector f orest is returned in Line 22.In our approach, we use γAOB = 2 as the default value (formore details about how this value was obtained, please referto Section V).

B. Extended Automatic Optimum Branching (E-AOB)

Up to this point, it is possible to obtain a first result forthe IPF using AOB algorithm. However, after some experi-ments, we noticed that the IPF reconstruction could be furtherimproved by also executing the OB algorithm on each treebelonging to this forest. One important thing to notice in thisprocess is the fact that AOB algorithm considers all edgesto construct the minimum branching. Once we remove someedges to build the forest, we create several partitions that areindependent of each other. If these partitions are analyzedseparately, the OB algorithm will choose edge connectionsthat are optimal considering only the edges that belong to thecurrent partition. This re-execution characterizes our ExtendedAutomatic Optimum Branching (E-AOB) algorithm. It is alsoimportant to notice that, if we apply this re-execution usingthe Automatic Oriented Kruskal (AOK), due to its greedyheuristic, the result will not change, with AOK selecting thesame edges to generate the forest.

The implementation of E-AOB in Algorithm 2 requires asinput the dissimilarity matrix M representing the relationshipamong the images and the initial forest obtained with theexecution of AOB algorithm. Similar to the AOB algorithm,Lines 1-3 initialize the vector f inal Forest with n initialtrees, each tree containing a vertex representing an image,and Line 4 initializes the variable G used to represent thegraph constructed using the values from the dissimilaritymatrix M . Lines 5-13 uses an auxiliary variable G′ to storeonly the weights of the edges connected in the current forest,obtained from the vector f orest and representing subgraphsof graph G. In Line 14, AOB algorithm is executed again,returning the optimum branching for each of the trees inthis forest. Lines 15-17 updates the variable f inal Forestwith this new forest, which is returned by the algorithm inLine 18.

Figure 2 illustrates the execution of our proposed algorithmfor a toy example with n = 12 semantically similar imagesrelated by the dissimilarity matrix M shown in Figure 2(a).After executing the OB algorithm once, we obtain the min-imum branching tree B shown in Figure 2(a). We sort itsedges from the lowest to the highest weight to decide which

Algorithm 2 Extended Automatic Optimum BranchingInput: number of n semantically similar images, n × n dissimilarity matrix

M and the vector f orest with the IPF reconstructed using AOBOutput: reconstructed forest f inalForest1: for i ∈ [1..n] do � Initialization2: f inalForest[i] ← i3: end for4: G ← Graph(M)5: for i ∈ [1..n] do6: for j ∈ [1..n] do � Keeps valid connections7: if (Roots( f orest, j) = Roots( f orest, i)) then8: G ′(i, j)← G(i, j)9: end if

10: end for11: end for12: B ′′ ← O Bmin (G ′) � Execute OB again13: for (i, j) ∈ B ′′ do14: f inalForest[ j] ← i15: end for16: return f inalForest � Returns the final forest

edges should be deleted to construct the forest. The tablein Figure 2(b) shows the dynamic calculation of the standarddeviation, which starts from Step 7 with edge (8, 9), since|B ′|/2 = 12/2 = 6. In this case, the current standard deviationof the edges is σAOB ≈ 0.48. Since w(8, 9) − w(3, 0) =2.81 − 2.71 = 0.10 and 0.10 < 2 × 0.48 ≈ 0.97, thenthis edge can be accepted and the algorithm continues to thenext edge. In Step 8, the edge (1, 7) updates σ to 0.53 andsince w(1, 7) − w(8, 9) = 2.88 − 2.81 = 0.07 and 0.07 <2×0.53 ≈ 1.07, the algorithm proceeds to the next edge untilit reaches edge (0, 2). For this edge, the standard deviation isσAOB ≈ 0.58 and w(0, 2) − w(11, 8) = 6.27 − 2.96 = 3.31and 3.31 > 2 × 0.25 ≈ 1.16, which is above the allowedlimit. Hence, this edge is discarded and the algorithm stops,returning the forest depicted in Figure 2(c).

Based on the forest reconstructed in this first part, we canapply the E-AOB method, updating the dissimilarity matrixto keep only the edges connected in the current forest andexecuting the OB algorithm again. The final result for theforest is depicted in Figure 2(d), where each tree is representedby the same color of its corresponding sub-matrix used for thereconstruction. With the execution of this additional step inAOB, it is possible to correct the position of nodes 0 and 3,whose parent-child relationship was reversed in the initial IPFreconstructed by AOB.

IV. EXPLORING MULTIPLE COMBINATIONS

As mentioned in the previous sections, there are basicallyonly one method in the literature currently being explored forthe reconstruction of IPFs, the Automatic Oriented Kruskal(AOK). In addition, with the two extensions we introduced inSection III, we have AOB and E-AOB, which are extensionsof the Optimum Branching algorithm proposed in [13].

Considering a dissimilarity matrix M , AOK algorithm fol-lows a greedy heuristic, while AOB and E-AOB searchesfor the best global solution for the phylogeny reconstruction.However, all these algorithms assume a perfect dissimilaritycalculation, which is not always true. Thus, in some cases,a greedy heuristic may present better results than a globalsolution. In this section, aiming at exploring these different

1538 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014

Fig. 2. Simulation of AOB and E-AOB algorithm to reconstruct an IPF with three trees and 12 semantically similar images, from a 12 × 12 dissimilaritymatrix. (a) Optimum Branching calculation. (b) Heaviest edges elimination. (c) IPF with AOB. (d) IPF with Extended AOB.

properties and their complementarity, we propose a combina-tion among the results given by each approach, in such a waythat errors introduced by one method can be fixed by othermethod(s).

A. Fusion Methodology for IPFs Reconstruction

Given n semantically similar images and the reconstructedIPFs of two or more methods to be combined, our approach isbased on the idea of finding their most common elements toconstruct the final IPF. In the first part of our implementation,we calculate (a) the number of roots returned by each forest,(b) the number of times each node appears as a root, and (c) thenumber of times each edge connecting two nodes in the forestappears. In the last part, we implement the fusion scheme,constructing a new forest whose roots and edges of the treesare the ones with the highest number of votes calculated inthe previous step.

However, it is known that the calculation of the dissimilarityamong the images is not an exact estimation, given thatone of the steps include matching of images, which is notperfect. In our implementation, we take advantage of this flaw

in the dissimilarity calculation, and apply different amountsof perturbations through noise addition to the dissimilaritymatrix M relating a set of images, generating 100 differentvariations of M and using them to reconstruct the IPFs. Thenumber of variations was chosen in such a way that it isenough to analyze the consistency among the results and toachieve statistical significance in the analysis.

The noise was created by first calculating the standard devi-ation σM of all values in the dissimilarity matrix M relatingeach set of images. Then, for each value mij ∈ M , a differentvalue between [−k×σM ,+k×σM ] was randomly chosen usinguniform distribution and added to the corresponding mij entry.In our implementation, k = 1.5% (for more details about thisvalue, please refer to Section V-A).

Once the number of roots and edges for all forests havebeen calculated, we calculate the final number of roots r bychoosing the median of the number of votes received by allmethods in each of the 100 executions.1 Then, to decide which

1There are several alternative strategies to calculate r , such as the averageand the most frequent values. In our experiments, the median presented betterresults during training, but there is still room for further exploration.

COSTA et al.: IMAGE PHYLOGENY FORESTS RECONSTRUCTION 1539

Fig. 3. Reconstructed IPFs for algorithms (a) AOK, (b) AOB, and (c) E-AOB methods used separately. Highlighted in red, the nodes in the wrong positionwith respect to the ground-truth forest, which can happen in different positions within the 100 variations. In (c) the final forest after fusion of IPF reconstructionalgorithms, with the corrected nodes highlighted in blue.

nodes are the roots, we select the r nodes having the highestnumber of votes. In case there is a tie among the votes, werandomly choose one or some of them, depending on howmany roots are yet to be decided. For the edges selection,we sum up the number of times each edge connecting twonodes in each forest appears, constructing a matrix of votesfor edges. Once the roots are chosen, we fix these rootsand give the matrix of votes for edges as input to a maxi-mum branching algorithm, resulting in the sought combinedforest.

For instance, consider the example depicted in Figure 3with n = 15 semantically similar images. The first setshown in Figure 3(a) represents all 100 possibilities ofreconstructed IPFs using AOK, with the first IPF illustrat-ing one out of 100 possible reconstructions. The nodeshighlighted in red denote a node in the wrong posi-tion when compared to the ground-truth forest. Similarly,Figure 3(b) and (c) represent the IPFs using AOB and E-AOB,respectively. Each set of IPFs is followed by the votes forthe number of roots and the number of votes each rootreceived.

Let V be a vector storing the number of roots returnedby each method for each variation of the IPF. to calculatethe number of roots the final IPF should have, we choose themedian of V , that is, Median(V ) = 3. To decide which nodesare the roots, we select the nodes with the highest number ofvotes, which in this case are Roots = {0, 1, 2}.

To count the votes for edges, we go through all edges(except for the ones pointing to the roots), increasing theirscore in a matrix of votes every time the edge (i, j) appears,and summing up the votes for the edges in all forests beingcombined. Figure 3(d) shows the sum of votes for roots andedges obtained with the combination of AOK × AOB ×E-AOB, and the result after the fusion. The nodes highlightedin blue in the final forest represent the nodes that werepreviously in the wrong position (highlighted in red in theforest returned by the first execution of each of the methods),and were corrected after applying our fusion algorithm. Sincethe forest obtained after the fusion is the same forest than theone used to create this example, it is possible to see the fusionalgorithm was able to correct the roots as well as the positionof nodes {1, 5, 7, 8, 12, 13}.

1540 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014

V. EVALUATION AND RESULTS

For evaluating the reconstructed IPFs, we consider thesame quantitative metrics (roots, edges, leaves and ances-try) introduced by Dias et al. [23], considering scenarioswhere the ground truth is available. The following equation isused:

EM(IPF1, IPF2) = S1 ∩ S2

S1 ∪ S2(2)

where EM represents the evaluation metric, IPF1 is the recon-structed forest with elements represented by S1, and IPF2 is theforest ground truth with elements S2. This equation calculatesthe intersection of the result returned by IPF1 in metric EMwith respect to the reference forest IPF2, and normalizes it bythe union of both sets. For instance, in the example of Figure 2,the IPF reconstructed by E-AOB is the same of the Ground-Truth Forest (GTF) used to create the example. Therefore, theroot metric yields Root (E-AOB, GTF) = 3/3 = 100%. On theother hand, in comparison to the result returned by baselineAOB, we obtain Root (AOB, GTF) = 2/4 = 50%.

For our experiments in a controlled scenario, we considerimages taken with a single camera (OC) and with multiplecameras (MC) having similar scene semantics (the main con-tent of the image is the same, but with small variations inthe camera parameters, such as viewpoint, zoom, etc.). Threedifferent datasets were used in our experiments, with imagesin the JPEG format:• Training Dataset: It represents a small exploratory set

with 2 × 33 × 4 × 1 × 10 = 2, 160 forests, containingimages from OC and MC scenarios, taken from threedifferent cameras, three different scenes, three images percamera, four forest sizes |F | = {2..5}, one topology, and10 random variation of parameters for creating the near-duplicate images. A topology refers to the form of thetrees in a forest.

• Dataset A: comprises images from OC and MC sce-narios, three different scenes, three different cameras,three images per camera, four forest sizes |F | = {2..5},four different tree topologies, and ten random variationsof parameters for creating the near duplicate images.Therefore, the dataset comprises 2 × 33 × 4× 4× 10 =8,640 forests. The Training Dataset as well as Dataset Aare the same used by [23].

• Dataset B: comprises images randomly selected from aset of 20 different scenes, 10 different cameras, 10 imagesper camera, 10 different tree topologies, 10 randomvariations of parameters for creating the near duplicateimages, and forests with 10 trees each. For each of thecases, OC and MC, a total of 2,000 forests within this setwere randomly selected with forests of size |F | = {2..10}.Therefore, this set consisted of 2× 2,000× 9 = 36,000forests.

The image transformations used to create the near-duplicatesare the same used in [8]: resampling, cropping, affine warping,brightness and contrast adjustment, and lossy compressionusing the standard lossy JPEG algorithm.

The experiments have two phases: first, we analyze theeffects of choosing different values for parameter γAOB

(threshold calculation), and the value of parameter k (amountof noise to be introduced in all variations of the dissimilaritymatrix). Second, we evaluated the robustness of the approachesproposed in this paper in a controlled scenario. For the testsusing the OB algorithm, we consider a black-box implemen-tation2 that follows Tarjan’s description [29].

A. First Phase: Calculation of Parameters γAOB and k

The experiments described in this section were all per-formed using the Training Dataset previously described.To find the value of parameter γAOB, we studied the behaviorof AOB using Algorithm 1, varying the thresholds in theinterval [μAOB − 2σAOB, μAOB + 2σAOB], separated by stepsof 0.1. From this experiment, we defined τAOB = μAOB+(2× σAOB), and as a consequence, γAOB = 2.

The parameter k used to add noise in the dissimilaritymatrices were chosen according to the formula described inSection IV-A. The parameter k was tested in the interval[1%, 10%], separated by steps of 0.5%. Best results werepresented for k = 1.5% of the σM value (recall that σM isthe standard deviation of the input dissimilarity matrix).

B. Second Phase: Validation in a Controlled Scenario

In the second phase of our experiments, we analyzed therobustness of AOK, AOB, and E-AOB in two parts: (a) usingeach method separately and (b) their possible combinationsC = {AOK × AOB, AOK × E-AOB, AOB × E-AOB,AOK × AOB × E-AOK} for Datasets A and B.

To directly compare the algorithms, the error variation�error was calculated with respect to each metric (roots,edges, leaves and ancestry), using the same equation intro-duced by Dias et al. [13]:

�errormetric(M1,M2) =(

1−M1metric

1−M2metric

)

− 1 (3)

where M1 represents the method being evaluated in com-parison to method M2. For instance, on Table I, on col-umn �error (E-AOB, AOK), M1 represents E-AOB, M2represents AOK, and each of the columns Roots, Edges,Leaves and Ancestry represents the results for �errorroots,�erroredges, �errorleaves , and �errorancestry, respectively.

A �error < 0 value means that algorithm M1 performedbetter than M2, being able to reduce the error. Table Icompares the methods AOB and E-AOB against AOK, withE-AOB algorithm outperforming the other two methods inboth datasets, A and B, regardless of the number of trees inthe forest. On average, E-AOB algorithm reduces the errorfor finding the roots in about 41% for metric roots, 7% foredges, 10% for leaves and 11% for ancestry in dataset A. Fordataset B, the average of the error reduction is approximately20% for metrics roots, leaves and ancestry, and 18% for edges.

A Wilcoxon signed-rank test was performed for all metricsin dataset B, comparing (i) AOK with AOB, and (ii) AOKwith E-AOB. In Table I, results for this test are described inits last row, with the blue dots indicating that difference amongthe results are statistically significant, at 95% confidence level,

2Available at http://edmonds-alg.sourceforge.net

COSTA et al.: IMAGE PHYLOGENY FORESTS RECONSTRUCTION 1541

TABLE I

COMPARISON AMONG AOK [23] AND THE VARIATIONS OF AOB ALGORITHM. (a) SEMANTICALLY SIMILAR IMAGES FROM THE SCENARIO USING A

SINGLE CAMERA (OC). (b) SEMANTICALLY SIMILAR IMAGES FROM THE SCENARIO USING MULTIPLE CAMERAS (MC)

in favor of AOB or E-AOB, while the red crosses representstatistical difference in favor of the baseline AOK. In case (i),statistical difference in favor of AOK was found for metricroots in the OC scenario, and for metrics roots and ancestryin the MC scenario. These results show that AOB is only ableto improve the results of AOK regarding the metrics edges andleaves. On the other hand, when comparing AOK and E-AOB(Scenario ii), all differences are statistically significant in favorof E-AOB, confirming this method has better performance thanthe state-of-the-art method presented in the literature [23].

Using as baseline the results presented in Table I, betterresults for the fusion approach were found for the combination

(AOK × AOB × E-AOB). Although the combination usingonly AOK and E-AOB seems to be a better choice, results ofthis fusion presented lower results compared to the one usingthe three methods together.

In our experiments, we noticed that the AOB method iscurrently important in the fusion approach, mainly becausethe dissimilarity calculation is not perfect. For instance, insome cases, the dissimilarity between a pair of images maybe exchanged: node A is the real parent of node B, but the dis-similarity d(B, A) < d(A, B). This will lead AOK to choosethe wrong direction between these images, since it followsa greedy heuristic. Complementary, by taking into account all

1542 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014

TABLE II

RESULTS FOR FUSION (AOK × AOB × E-AOB) AND THE �error IN COMPARISON TO AOK AND E-AOB. (a) SEMANTICALLY SIMILAR IMAGES FROM

THE SCENARIO USING A SINGLE CAMERA (OC). (b) SEMANTICALLY SIMILAR IMAGES FROM THE SCENARIO USING MULTIPLE CAMERAS (MC)

dissimilarities, AOB may choose the correct direction for someof these cases. With the re-execution performed by E-AOB (oneach tree separately), some edges may change their direction toobtain the optimum branching. In some cases, E-AOB choosesedges that AOK also chooses erroneously. Therefore, in thefusion (AOK × E-AOB), since we take into account only thevotes of both methods, if both of them have a large numberof votes for some wrong edge, it will lead to the wrong result.However, when we include the votes from AOB, combined tothe votes correctly obtained by E-AOB, we may be able tokeep the right edge direction. If the dissimilarity calculationcan be further improved, there is a possibility that the fusion

AOK × E-AOB may present better results than the fusion ofAOK × AOB × E-AOB. Improvements on this dissimilaritycalculation is not included in the scope of this paper, but it is achallenge to be addressed by future research of the community.Results for the fusions (AOK × AOB), (AOK × E-AOB), and(AOB × E-AOB), can be found in the supplementar material.

Table II shows the results for the fusion (AOK × AOB ×E-AOB) and the error variation in comparison to AOK andto the current best performing algorithm, E-AOB. In thisfusion, OC scenario had lower performance only in dataset B,introducing more error for metrics roots and ancestry whenthe forest has more than eight trees. However, after running

COSTA et al.: IMAGE PHYLOGENY FORESTS RECONSTRUCTION 1543

Fig. 4. Comparison among the proposed approaches for scenarios with a single camera (OC) and with multiple cameras (MC), featuring metrics roots andancestry. (a) Metric: Roots (OC). (b) Metric: Ancestry (OC). (c) Metric: Roots (MC). (d) Metric: Ancestry (MC).

a Wilcoxon signed-rank test for these results, no statisticaldifference was found among them (represented by the greendashes in the last row of the table). For the other metrics in OC,and all metrics in MC scenario, differences are statisticallysignificant at 95% confidence level in favor of the fusionapproach.

The graphics in Figure 4 depict a comparison among theperformance of all methods described in Table I and thefusion. Only metrics root and ancestry are the ones featuredin these graphics because they are usually the most importantin a forensic scenario. For instance, roots are important tofind the source of the illegal activity, and with the ancestryrelationship we can trace back the users involved in the chainof transformations. In the graphic, it becomes more clearthe improvement obtained with E-AOB and the fusion incomparison to AOK and AOB.

Additionally, Figure 5 shows a comparison of the errorreduction (−�error ) between methods E-AOB and the fusionAOK × AOB × E-AOB, in comparison to AOK, describedin Table II. A closer analysis of the results shows thatMC scenario has better performance than OC. In a forensicscenario this is advantageous since, in most of the cases, morethan one user may be producing and spreading some illegalcontent from the same scene, and using different cameras.

In this case, it is important to find enough evidences toframe all users involved in the illegal activity (e.g., casesof child pornography, for instance). Furthermore, the MCcase is also more common in other scenarios. Consider,for instance, that we want to find all images related toa certain topic. In this case, the images will come fromdifferent photographers and higher are the chances they willbe using different cameras. Therefore, the correct identifi-cation and separation of each group of images plays animportant role.

Regarding the methods’ running time, the biggest bottleneckcorresponds to the dissimilarity calculation, as reconstruct-ing the forest and performing the fusion is much faster.Considering a forest with 100 semantically similar images, thedissimilarity function step spends 38.31 minutes, on average,for all cases. On the other hand, the time to perform the fusionof three methods (including the time spent in 100 perturbationsof each method, the votes counting, and the final executionof the fusion) takes, on average, 0.478 seconds. Tables IIIand IV details the running time for all forest sizes. Theseexperiments were performed in a machine with processorIntel Xeon E5645, 2.40GHz, with 16GB of memory, andrunning Ubuntu 12.04.4 LTS (GNULinux 3.5.0-49-genericx86_64).

1544 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014

Fig. 5. Comparison among the error reduction (−�error ) of methods E-AOB and the Fusion (AOK × AOB × E-AOB) in comparison to AOK, for metricsroots and ancestry. In (a) and (b), the graphs represent the scenario with images from a single camera (OC), while (c) and (d) show the results for imagesfrom the multiple camera (MC) scenario. (a) Metric: Roots (OC). (b) Metric: Ancestry (OC). (c) Metric: Roots (MC). (d) Metric: Ancestry (MC).

TABLE III

AVERAGE TIME (minutes) FOR THE DISSIMILARITY CALCULATION AND

THE FOREST RECONSTRUCTION (SINGLE EXECUTION)

Finally, to ensure reproducibility of all experiments, alldatasets used in the experiments and the entire source codewill be freely available. The datasets are registered on theaddress: http://figshare.com/articles/Image_Phylogeny_Forests_Reconstruction/1012816 under the accession numberhttp://dx.doi.org/10.6084/m9.figshare.1012816. The sourcecode and documentation are available in a public repositoryin the following address: http://repo.recod.ic.unicamp.br/public/projects.

TABLE IV

AVERAGE TIME (miliseconds) CONSIDERING 100 EXECUTIONS TO

PERFORM THE FUSION (AOK × AOB × E-AOB)

VI. CONCLUSIONS

In this paper, we introduced three new different approachesto deal with image phylogeny forests. First, we extended theapproach developed for phylogeny trees by Dias et al. [13],using Optimum Branching and applying a similar idea usedfor automatic reconstruction of IPFs that the authors usedfor their Oriented Kruskal algorithm. By finding a thresholdpoint after analysis of the behavior of valid forests, wemade possible the use of the optimum branching algorithm

COSTA et al.: IMAGE PHYLOGENY FORESTS RECONSTRUCTION 1545

to deal with forests, resulting in the AOB method. Resultswere further improved with the proposal of E-AOB, whichemploys an important additional step of re-execution ofthe OB algorithm in each tree of the IPF found by AOBfurther refining the initial results. Both approaches outper-formed the state-of-the-art AOK method presented in theliterature.

We have also explored the idea of combining the bestmatches among the proposed methods, aiming at reducing theerrors introduced by them when they are used independently,and improve the score of all metrics. Thus, we proposed anovel approach through the combination of the reconstructedforests of AOK, AOB and E-AOB methods. Results for thisnew approach showed better or equivalent performance tothe best result achieved so far (E-AOB method), being acompetitive solution for the image phylogeny problem.

Our fusion algorithm explores a view of the values of adissimilarity matrix as random variables. Since the relativeordering between causal pairs of images can change when theirdifference is small enough, our method introduces robustnessto statistical errors in the estimation of the image dissimi-larities, and takes away some of the importance of extra finetuning the dissimilarity calculation. Our approach can be easilyextended to an arbitrary number of algorithms.

As future work, we intend to further analyze the behavior ofthe phylogeny algorithms using other statistical measures forthe IPF reconstruction. For instance, Kurtosis is a statisticalmeasure of extreme variation, such that higher values denotea few extreme deviations in the data, while lower values showmore frequent, smaller deviations. Variations on the Kurtosispattern have been previously studied in image forensics todetect, for instance, noise added to the image at various stagesof production, which causes changes in the kurtosis [30],or image splicing, by identifying any portions of the imagewith significantly different Kurtosis values [31]. By exploringan analogous possibility, we aim at finding a relationshipbetween the variation of the Kurtosis of the edges chosenby the phylogeny forest algorithms while reconstructing theforest, and the choice of the number of trees this forestshould have.

Extensions to our approaches also include solving thephylogeny forest problem considering other media types, suchas videos. We also intend to perform experiments with realcases we may find in the internet (non-controlled scenario) inwhich we do not have any knowledge about the history ofgeneration of the duplicates.

REFERENCES

[1] A. Jaimes, S.-F. Chang, and A. Loui, “Duplicate detection in consumerphotography and news video,” in Proc. ACM Workshop MultimediaSecurity, 2002, pp. 423–424.

[2] F. Schaffalitzky and A. Zisserman, “Multi-view matching for unorderedimage sets, or ‘How do i organize my holiday snaps?”’ in Proc. 7th Eur.Conf. Comput. Vis., 2002, pp. 414–431.

[3] D.-Q. Zhang and S. F. Chang, “Detecting image near-duplicateby stochastic attributed relational graph matching with learn-ing,” in Proc. 12th Annu. ACM Int. Conf. Multimedia, 2004,pp. 877–884.

[4] X. Cheng and L.-T. Chia, “Stratification-based keyframe cliques forremoval of near-duplicates in video search results,” in Proc. ACM Int.Conf. Multimedia Inform. Retr., 2010, pp. 313–322.

[5] Z. X. H. Ling, F. Zou, Z. Lu, and P. Li, “Robust image copy detectionusing multi-resolution histogram,” in Proc. ACM Int. Conf. MultimediaInform. Retr., 2010, pp. 129–136.

[6] Y. Ke, R. Sukthankar, and L. Huston, “Efficient near-duplicate detectionand sub-image retrieval,” in Proc. ACM Workshop Multimedia Security,2004, pp. 869–876.

[7] J. Fridrich, D. Soukal, and J. Lukas, “Detection of copy-move forgeryin digital images,” in Proc. Digital Forensics Res. Conf., 2003,pp. 134–137.

[8] Z. Dias, A. Rocha, and S. Goldenstein, “Image phylogeny by minimalspanning trees,” IEEE Trans. Inf. Forensics Security, vol. 7, no. 2,pp. 774–788, Apr. 2012.

[9] B. Lewin, Genes VI. London, U.K.: Oxford Univ. Press, 1997.[10] A. D. Rosa, F. Uccheddu, A. Costanzo, A. Piva, and M. Barni,

“Exploring image dependencies: A new challenge in image forensics,”Proc. SPIE, vol. 7541, no. 2, pp. 1–12, 2010.

[11] Z. Dias, A. Rocha, and S. Goldenstein, “First steps towards imagephylogeny,” in Proc. IEEE Int. Workshop Inform. Forensics Security,2010, pp. 1–6.

[12] Z. Dias, A. Rocha, and S. Goldenstein, “Video phylogeny: Recoveringnear-duplicate video relationships,” in Proc. IEEE Int. Workshop Inform.Forensics Security, Dec. 2011, pp. 1–6.

[13] Z. Dias, S. Goldenstein, and A. Rocha, “Exploring heuristic andoptimum branching algorithms for image phylogeny,” J. Vis. Commun.Image Represent., vol. 24, no. 7, pp. 1124–1134, Oct. 2013.

[14] J. B. Kruskal, “On the shortest spanning subtree of a graph and thetraveling salesman problem,” Proc. Amer. Math. Soc., vol. 7, no. 1,pp. 48–50, 1956.

[15] Y. J. Chu and T. H. Liu, “On the shortest arborescence of a directedgraph,” Sci. Sinica, vol. 14, no. 4, pp. 1396–1400, 1965.

[16] J. Edmonds, “Optimum branchings,” J. Res. Nat. Inst. Standards Tech-nol., vol. 71B, no. 4, pp. 233–240, 1967.

[17] F. Bock, “An algorithm to construct a minimum directed spanning treein a directed network,” Develop. Oper. Res., vol. 1, pp. 29–44, 1971.

[18] Z. Dias, S. Goldenstein, and A. Rocha, “Large-scale image phylogeny:Tracing image ancestral relationships,” IEEE Multimedia, vol. 20, no. 3,pp. 58–70, Jul./Sep. 2013.

[19] L. Kennedy and S.-F. Chiang, “Internet image archaeology: Auto-matically tracing the manipulation history of photographs onthe web,” in Proc. ACM 16th Int. Conf. Multimedia, 2008,pp. 349–358.

[20] Z. Fan and R. L. Queiroz, “Identification of bitmap compression history:JPEG detection and quantizer estimation,” IEEE Trans. Image Process.,vol. 12, no. 2, pp. 230–235, Feb. 2003.

[21] J. Mao, O. Bulan, G. Sharma, and S. Datta, “Device temporal forensics:An information theoretic approach,” in Proc. 16th IEEE Int. Conf. ImageProcess., Nov. 2009, pp. 1501–1504.

[22] J. R. Kender, M. L. Hill, A. Natsev, J. R. Smith, and L. Xie, “Videogenetics: A case study from YouTube,” in Proc. Int. Conf. Multimedia,2010, pp. 1253–1258.

[23] Z. Dias, S. Goldenstein, and A. Rocha, “Toward image phylogenyforests: Automatically recovering semantically similar image rela-tionships,” Forensic Sci. Int., vol. 231, nos. 1–3, pp. 178–189,2013.

[24] M. Nucci, M. Tagliasacchi, and S. Tubaro, “A phylogenetic analysis ofnear-duplicate audio tracks,” in Proc. IEEE Int. Workshop MultimediaSignal Process., Oct. 2013, pp. 99–104.

[25] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up robustfeatures (SURF),” Comput. Vis. Image Understand., vol. 110, no. 3,pp. 346–359, 2008.

[26] M. A. Fischler and R. C. Bolles, “Random sample consensus: A para-digm for model fitting with applications to image analysis and auto-mated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395,1981.

[27] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transferbetween images,” IEEE Comput. Graph. Appl., vol. 21, no. 5, pp. 34–41,Sep./Oct. 2001.

[28] R. C. Prim, “Shortest connection networks and some generalizations,”Bell Syst. Tech. J., vol. 36, no. 6, pp. 1389–1401, 1957.

[29] R. E. Tarjan, “Finding optimum branchings,” Networks, vol. 7, no. 1,pp. 25–35, 1977.

[30] D. Zoran and Y. Weiss, “Scale invariance and noise in naturalimages,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Oct. 2009,pp. 2209–2216.

[31] X. Pan, X. Zhang, and S. Lyu, “Exposing image splicing with inconsis-tent local noise variances,” in Proc. IEEE Int. Conf. Comput. Photogr.,Apr. 2012, pp. 1–10.

1546 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 10, OCTOBER 2014

Filipe de O. Costa received the B.Sc. degree incomputer science from the Federal University ofAlfenas, Alfenas, Brazil, in 2010, and the master’sdegree in computer science from the University ofCampinas, Campinas, Brazil, in 2012, where he iscurrently pursuing the Ph.D. degree with the Instituteof Computing. His main research interests includedigital forensics, computer vision, image processing,and machine learning.

Marina A. Oikawa received the B.Sc. degree incomputer science from the Federal University ofPara, Belém, Brazil, in 2006, and the master’s andPh.D. degrees in engineering from the Nara Instituteof Science and Technology, Ikoma, Japan, in 2010and 2013, respectively. She is currently a Post-Doctoral Researcher with the Institute of Computing,University of Campinas, Campinas, Brazil. Her mainresearch interests include digital forensics, computervision, and computer graphics.

Zanoni Dias received the B.Sc. degree and the Ph.D.degree in computer science from the University ofCampinas (Unicamp), Campinas, Brazil, in 1997and 2002, respectively. Since 2003, he has been anAssistant Professor with the Institute of Computingat Unicamp. His main interests include theoreticalcomputer science, bioinformatics, and computationalmolecular biology.

Siome Goldenstein received the M.Sc. degreein computer science from Pontifícia UniversidadeCatólica do Rio de Janeiro, Rio de Janeiro, Brazil, in1997, the degree in electronic engineering from theFederal University of Rio de Janeiro, Rio de Janeiro,in 1995, and the Ph.D. degree in computer and infor-mation science from the University of Pennsylvania,Philadelphia, PA, USA, in 2002. He is an AssociateProfessor with the Institute of Computing, Univer-sity of Campinas, Campinas, Brazil. His researchinterests lie in computer vision, computer graphics,

forensics, and machine learning. He is an Area Editor of three journals,Computer Vision and Image Understanding, Graphics Models, and the IEEETRANSACTIONS ON INFORMATION FORENSICS AND SECURITY. He hasbeen a Program Committee Member of multiple conferences and workshops,including the local organization of the 2007 IEEE International Conferenceon Computer Vision in Rio de Janeiro.

Anderson de Rezende Rocha received the B.Sc.degree from the Federal University of Lavras,Lavras, Brazil, in 2003, and the M.S. and Ph.D.degrees from University of Campinas (Unicamp),Campinas, Brazil, in 2006 and 2009, respectively,all in computer science. He is currently an AssistantProfessor with the Institute of Computing at Uni-camp. His main interests include digital forensics,reasoning for complex data, and machine intelli-gence. He has been a Program Committee Mem-ber in several important Computer Vision, Pattern

Recognition, and Digital Forensics events. He is an Associate Editor of theElsevier Journal of Visual Communication and Image Representation and aLeading Guest Editor of EURASIP/Springer Journal on Image and VideoProcessing. He is an Affiliate Member of the Brazilian Academy of Sciencesand the Brazilian Academy of Forensics Sciences. He is also a member ofthe IEEE Information Forensics and Security Technical Committee and aMicrosoft Research Faculty Fellow.