Similarity Estimation Techniques from Rounding Algorithms

Embed Size (px)

Citation preview

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    1/9

    Similarity Estimation Techniques from RoundingAlgorithms

    Moses S. CharikarDept. of Computer Science

    Princeton University35 Olden Street

    Princeton, NJ 08544

    [email protected]

    ABSTRACT

    A locality sensitive hashing scheme is a distribution on afamily F of hash functions operating on a collection of ob-jects, such that for two objects x, y,

    PrhF[h(x) = h(y)] = sim(x, y),

    where sim(x, y) [0, 1] is some similarity function definedon the collection of ob jects. Such a scheme leads to a com-pact representation of objects so that similarity of objectscan be estimated from their compact sketches, and alsoleads to efficient algorithms for approximate nearest neigh-bor search and clustering. Min-wise independent permu-tations provide an elegant construction of such a localitysensitive hashing scheme for a collection of subsets with the

    set similarity measure sim(A, B) = |AB||AB|

    .

    We show that rounding algorithms for LPs and SDPs usedin the context of approximation algorithms can be viewedas locality sensitive hashing schemes for several interestingcollections of objects. Based on this insight, we constructnew locality sensitive hashing schemes for:

    1. A collection of vectors with the distance between uand v measured by (u,v)/, where (u,v) is the an-gle between u and v. This yields a sketching schemefor estimating the cosine similarity measure betweentwo vectors, as well as a simple alternative to minwiseindependent permutations for estimating set similar-ity.

    2. A collection of distributions on n points in a metricspace, with distance between distributions measuredby the Earth Mover Distance (EMD), (a popular dis-tance measure in graphics and vision). Our hash func-tions map distributions to points in the metric spacesuch that, for distributions P and Q,

    EMD(P, Q) EhF[d(h(P), h(Q))]

    O(log n log log n) EMD(P, Q).

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.STOC02, May 19-21, 2002, Montreal, Quebec, Canada.Copyright 2002 ACM 1-58113-495-9/02/0005 ...$5.00.

    1. INTRODUCTIONThe current information explosion has resulted in an in-

    creasing number of applications that need to deal with largevolumes of data. While traditional algorithm analysis as-sumes that the data fits in main memory, it is unreasonableto make such assumptions when dealing with massive data

    sets such as data from phone calls collected by phone com-panies, multimedia data, web page repositories and so on.This new setting has resulted in an increased interest inalgorithms that process the input data in restricted ways,including sampling a few data points, making only a fewpasses over the data, and constructing a succinct sketch ofthe input which can then be efficiently processed.

    There has been a lot of recent work on streaming algo-rithms, i.e. algorithms that produce an output by mak-ing one pass (or a few passes) over the data while using alimited amount of storage space and time. To cite a fewexamples, Alon et al [2] considered the problem of estimat-ing frequency moments and Guha et al [25] considered theproblem of clustering points in a streaming fashion. Manyof these streaming algorithms need to represent important

    aspects of the data they have seen so far in a small amount ofspace; in other words they maintain a compact sketch of thedata that encapsulates the relevant properties of the dataset. Indeed, some of these techniques lead to sketching algo-rithms algorithms that produce a compact sketch of a dataset so that various measurements on the original data setcan be estimated by efficient computations on the compactsketches. Building on the ideas of [2], Alon et al [1] give al-gorithms for estimating join sizes. Gibbons and Matias [18]give sketching algorithms producing so called synopsis datastructures for various problems including maintaining ap-proximate histograms, hot lists and so on. Gilbert et al [19]give algorithms to compute sketches for data streams so asto estimate any linear projection of the data and use this toget individual point and range estimates. Recently, Gilbertet al [21] gave efficient algorithms for the dynamic mainte-nance of histograms. Their algorithm processes a stream ofupdates and maintains a small sketch of the data from whichthe optimal histogram representation can be approximatedvery quickly.

    In this work, we focus on sketching algorithms for estimat-ing similarity, i.e. the construction of functions that producesuccinct sketches of objects in a collection, such that thesimilarity of objects can be estimated efficiently from theirsketches. Here, similarity sim(x, y) is a function that maps

    380

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    2/9

    pairs of ob jects x, y to a number in [0, 1], measuring thedegree of similarity between x and y. sim(x, y) = 1 corre-sponds to objects x, y that are identical while sim(x, y) = 0corresponds to objects that are very different.

    Broder et al [8, 5, 7, 6] introduced the notion of min-wiseindependent permutations, a technique for constructing suchsketching functions for a collection of sets. The similaritymeasure considered there was

    sim(A, B) = |A B||A B| .

    We note that this is exactly the Jaccard coefficient of simi-larity used in information retrieval.

    The min-wise independent permutation scheme allows theconstruction of a distribution on hash functions h : 2U Usuch that

    PrhF[h(A) = h(B)] = sim(A, B).

    Here F denotes the family of hash functions (with an asso-ciated probability distribution) operating on subsets of theuniverse U. By choosing say t hash functions h1, . . . ht fromthis family, a set S could be represented by the hash vector(h1(S), . . . ht(S)). Now, the similarity between two sets can

    be estimated by counting the number of matching coordi-nates in their corresponding hash vectors.1

    The work of Broder et al was originally motivated by theapplication of eliminating near-duplicate documents in theAltavista index. Representing documents as sets of featureswith similarity between sets determined as above, the hash-ing technique provided a simple method for estimating sim-ilarity of documents, thus allowing the original documentsto be discarded and reducing the input size significantly.

    In fact, the minwise independent permutations hashingscheme is a particular instance of a locality sensitive hashingscheme introduced by Indyk and Motwani [31] in their workon nearest neighbor search in high dimensions.

    Definition 1. A locality sensitive hashing scheme is a

    distribution on a family F of hash functions operating on acollection of objects, such that for two objects x, y,

    PrhF[h(x) = h(y)] = sim(x, y) (1)

    Here sim(x, y) is some similarity function defined on thecollection of objects.

    Given a hash function family F that satisfies (1), we willsay that F is a locality sensitive hash function family corre-sponding to similarity function sim(x, y). Indyk and Mot-wani showed that such a hashing scheme facilitates the con-struction of efficient data structures for answering approxi-mate nearest-neighbor queries on the collection of objects.

    In particular, using the hashing scheme given by minwise

    independent permutations results in effi

    cient data structuresfor set similarity queries and leads to efficient clustering al-gorithms. This was exploited later in several experimentalpapers: Cohen et al [14] for association-rule mining, Haveli-wala et al [27] for clustering web documents, Chen et al [13]for selectivity estimation of boolean queries, Chen et al [12]for twig queries, and Gionis et al [22] for indexing set value

    1One question left open in [7] was the issue of compact rep-resentation of hash functions in this family; this was settledby Indyk [28], who gave a construction of a small family ofminwise independent permutations.

    attributes. All of this work used the hashing technique forset similarity together with ideas from [31].

    We note that the definition of locality sensitive hashingused by [31] is slightly different, although in the same spiritas our definition. Their definition involves parameters r1 >r2 and p1 > p2. A family F is said to be (r1, r2, p1, p2)-sensitive for a similarity measure sim(x, y) ifPrhF[h(x) =h(y)] p1 when sim(x, y) r1 and PrhF[h(x) = h(y)] p2 when sim(x, y) r2. Despite the difference in the pre-

    cise definition, we chose to retain the name locality sensitivehashingin this work since the two notions are essentially thesame. Hash functions with closely related properties wereinvestigated earlier by Linial and Sasson [34] and Indyk etal [32].

    1.1 Our ResultsIn this paper, we explore constructions of locality sensi-

    tive hash functions for various other interesting similarityfunctions. The utility of such hash function schemes (fornearest neighbor queries and clustering) crucially dependson the fact that the similarity estimation is based on a testof equality of the hash function values. We make an interest-ing connection between constructions of similarity preserv-

    ing hash-functions and rounding procedures used in the de-sign of approximation algorithms. We show that proceduresused for rounding fractional solutions from linear programsand vector solutions to semidefinite programs can be usedto derive similarity preserving hash functions for interestingclasses of similarity functions.

    In Section 2, we prove some necessary conditions on sim-ilarity measures sim(x, y) for the existence of locality sensi-tive hash functions satisfying (1). Using this, we show thatsuch locality sensitive hash functions do not exist for certaincommonly used similarity measures in information retrieval,the Dice coefficient and the Overlap coefficient.

    In seminal work, Goemans and Williamson [24] intro-duced semidefinite programming relaxations as a tool forapproximation algorithms. They used the random hyper-

    plane rounding technique to round vector solutions for theMAX-CUT problem. We will see in Section 3 that the ran-dom hyperplane technique naturally gives a family of hashfunctions F for vectors such that

    PrhF[h(u) = h(v)] = 1(u,v)

    .

    Here (u,v) refers to the angle between vectors u and v.Note that the function 1

    is closely related to the func-

    tion cos(). (In fact it is always within a factor 0.878 from it.Moreover, cos() can be estimated from an estimate of .)Thus this similarity function is very closely related to thecosine similarity measure, commonly used in informationretrieval. (In fact, Indyk and Motwani [31] describe howthe set similarity measure can be adapted to measure dot

    product between binary vectors in d-dimensional Hammingspace. Their approach breaks up the data set into O(log d)groups, each consisting of approximately the same weight.Our approach, based on estimating the angle between vec-tors is more direct and is also more general since it appliesto general vectors.) We also note that the cosine b etweenvectors can be estimated from known techniques based onrandom projections [2, 1, 20]. However, the advantage of alocality sensitive hashing based scheme is that this directlyyields techniques for nearest neighbor search for the cosinesimilarity measure.

    381

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    3/9

    An attractive feature of the hash functions obtained fromthe random hyperplane method is that the output is a singlebit; thus the output oft hash functions can be concatenatedvery easily to produce a t-bit vector.2 Estimating similaritybetween vectors amounts to measuring the Hamming dis-tance between the corresponding t-bit hash vectors. We canrepresent sets by their characteristic vectors and use thislocality sensitive hashing scheme for measuring similaritybetween sets. This yields a slightly different similarity mea-

    sure for sets, one that is linearly proportional to the anglebetween their characteristic vectors.

    In Section 4, we present a locality sensitive hashing schemefor a certain metric on distributions on points, called theEarth Mover Distance. We are given a set of points L ={l1, . . . ln}, with a distance function d(i, j) defined on them.A probability distribution P(X) (or distribution for short) isa set of weights p1, . . . pn on the points such that pi 0 and

    pi = 1. (We will often refer to distribution P(X) as sim-ply P, implicitly referring to an underlying set X of points.)The Earth Mover Distance EMD(P, Q) between two dis-tributions P and Q is defined to be the cost of the mincost matching that transforms one distribution to another.(Imagine each distribution as placing a certain amount of

    earth on each point. EMD(P, Q) measures the minimumamount of work that must be done in transforming one dis-tribution to the other.) This is a popular metric for imagesand is used for image similarity, navigating image databasesand so on [37, 38, 39, 40, 36, 15, 16, 41, 42]. The ideais to represent an image as a distribution on features withan underlying distance metric on features (e.g. colors in acolor spectrum). Since the earth mover distance is expensiveto compute (requiring a solution to a minimum transporta-tion problem), applications typically use an approximationof the earth mover distance. (e.g. representing distributionsby their centroids).

    We construct a hash function family for estimating theearth mover distance. Our family is based on rounding al-gorithms for LP relaxations for the problem of classification

    with pairwise relationships studied by Kleinberg and Tar-dos [33], and further studied by Calinescu et al [10] andChekuri et al [11]. Combining a new LP formulation de-scribed by Chekuri et altogether with a rounding techniqueof Kleinberg and Tardos, we show a construction of a hashfunction family which approximates the earth mover dis-tance to a factor ofO(log n log log n). Each hash function inthis family maps a distribution on points L = {l1, . . . , ln}to some point li in the set. For two distributions P(X) andQ(X) on the set of points, our family of hash functions Fsatisfies the property that:

    EMD(P, Q) EhF[d(h(P), h(Q))]

    O(log n log log n) EMD(P, Q).

    We also show an interesting fact about a rounding al-gorithm in Kleinberg and Tardos [33] applying to the casewhere the underlying metric on points is a uniform met-ric. In this case, we show that their rounding algorithm can

    2In Section 2, we will show that we can convert any localitysensitive hashing scheme to one that maps objects to {0, 1}with a slight change in similarity measure. However, themodified hash functions convey less information, e.g. thecollision probability for the modified hash function family isat least 1/2 even for a pair of objects with original similarity0.

    be viewed as a generalization of min-wise independent per-mutations extended to a continuous setting. Their roundingprocedure yields a locality sensitive hash function for vectorswhose coordinates are all non-negative. Given two vectors

    a = (a1, . . . an) and b = (b1, . . . bn), the similarity functionis

    sim(a,b) = imin(ai, bi)

    imax(ai, bi).

    (Note that when a and b are the characteristic vectors forsets A and B, this expression reduces to the set similaritymeasure for min-wise independent permutations.)

    Applications of locality sensitive hash functions to solvingnearest neighbor queries typically reduce the problem to theHamming space. Indyk and Motwani [31] give a data struc-ture that solves the approximate nearest neighbor problemon the Hamming space. Their construction is a reduction tothe so called PLEB (Point Location in Equal Balls) problem,followed by a hashing technique concatenating the values ofseveral locality sensitive hash functions. We give a simpletechnique that achieves the same performance as the IndykMotwani result in Section 5. The basic idea is as follows:Given bit vectors consisting of d bits each, we choose a num-

    ber of random permutations of the bits. For each randompermutation , we maintain a sorted order of the bit vectors,in lexicographic order of the bits permuted by . To find anearest neighbor for a query bit vector q we do the follow-ing: For each permutation , we perform a binary search onthe sorted order corresponding to to locate the bit vectorsclosest to q (in the lexicographic order obtained by bits per-muted by ). Further, we search in each of the sorted ordersproceeding upwards and downwards from the location of q,according to a certain rule. Of all the bit vectors examined,we return the one that has the smallest Hamming distanceto the query vector. The performance bounds we can provefor this simple scheme are identical to that proved by Indykand Motwani for their scheme.

    2. EXISTENCEOF LOCALITY SENSITIVE

    HASH FUNCTIONSIn this section, we discuss certain necessary properties for

    the existence of locality sensitive hash function families forgiven similarity measures.

    Lemma 1. For any similarity function sim(x, y) that ad-mits a locality sensitive hash function family as defined in(1), the distance function 1 sim(x, y) satisfies triangle in-equality.

    Proof. Suppose there exists a locality sensitive hash func-tion family such that

    PrhF[h(x) = h(y)] = sim(x, y).

    Then,

    1 sim(x, y) = PrhF[h(x) = h(y)].

    Let h(x, y) be an indicator variable for the event h(x) =h(y). We claim that h(x, y) satisfies the triangle inequality,i.e.

    h(x, y) + h(y, z) h(x, z).

    382

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    4/9

    Since h() takes values in the set {0, 1}, the only casewhen the above inequality could be violated would be whenh(x, y) = h(y, z) = 0. But in this case h(x) = h(y) andh(y) = h(z). Thus, h(x) = h(z) implying that h(x, z) = 0and the inequality is satisfied. This proves the claim. Now,

    1 sim(x, y) = EhF[h(x, y)]

    Sinceh(x, y) satisfies the triangle inequality, EhF[h(x, y)]

    must also satisfy the triangle inequality. This proves thelemma.

    This gives a very simple proof of the fact that for the

    set similarity measure sim(A, B) = |AB||AB| , 1 sim(A, B)

    satisfies the triangle inequality. This follows from Lemma 1and the fact that a set similarity measure admits a localitysensitive hash function family, namely that given by minwiseindependent permutations.

    One could ask the question whether locality sensitive hashfunctions satisfying the definition (1) exist for other com-monly used set similarity measures in information retrieval.For example, Dices coefficient is defined as

    simDice(A, B) =

    |A B|

    12 (|A| + |B|)

    The Overlap coefficient is defined as

    simOvl(A, B) =|A B|

    min(|A|, |B|)

    We can use Lemma 1 to show that there is no such local-ity sensitive hash function family for Dices coefficient andthe Overlap measure by showing that the corresponding dis-tance function does not satisfy triangle inequality.

    Consider the sets A = {a}, B = {b}, C = {a, b}. Then,

    simDice(A, C) =2

    3, simDice(C, B) =

    2

    3,

    simDice

    (A, B) = 0

    1 simDice(A, C) + 1 simDice(C, B)

    < 1 simDice(A, B)

    Similarly, the values for the Overlap measure are as follows:

    simOvl(A, C) = 1, simOvl(C, B) = 1, simOvl(A, B) = 0

    1 simOvl(A, C) + 1 simOvl(C, B) < 1 simOvl(A, B)

    This shows that there is no locality sensitive hash functionfamily corresponding to Dices coefficient and the Overlapmeasure.

    It is often convenient to have a hash function family thatmaps objects to {0, 1}. In that case, the output oft differenthash functions can simply be concatenated to obtain a t-bit

    hash value for an object. In fact, we can always obtain sucha binary hash function family with a slight change in thesimilarity measure. A similar result was used and proved byGionis et al [22]. We include a proof for completeness.

    Lemma 2. Given a locality sensitive hash function familyF corresponding to a similarity function sim(x, y), we canobtain a locality sensitive hash function familyF that mapsobjects to {0, 1} and corresponds to the similarity function1+sim(x,y)

    2.

    Proof. Suppose we have a hash function family suchthat

    PrhF[h(x) = h(y)] = sim(x, y).

    Let B be a pairwise independent family of hash functionsthat operate on the domain of the functions in F and map el-ements in the domain to {0, 1}. Then PrbB[b(u) = b(v)] =

    1/2 if u = v and PrbB[b(u) = b(v)] = 1 if u = v. Considerthe hash function family obtained by composing a hash func-tion from F with one from B. This maps objects to {0, 1}and we claim that it has the required properties.

    PrhF,bB[b(h(x)) = b(h(y))] =1 + sim(x, y)

    2

    With probability sim(x, y), h(x) = h(y) and hence b(h(x) =b(h(y)). With probability 1 sim(x, y), h(x) = h(y) and inthis case, PrbB[b(h(x) = b(h(y))] =

    12

    . Thus,

    Pr[b(h(x)) = b(h(y))] = sim(x, y) + (1 sim(x, y))/2

    = (1 + sim(x, y))/2.

    This can be used to show a stronger condition for theexistence of a locality sensitive hash function family.

    Lemma 3. For any similarity function sim(x, y) that ad-mits a locality sensitive hash function family as defined in(1), the distance function 1 sim(x, y) is isometrically em-beddable in the Hamming cube.

    Proof. Firstly, we apply Lemma 2 to construct a binarylocality sensitive hash function family corresponding to sim-ilarity function sim(x, y) = (1 + sim(x, y))/2. Note thatsuch a binary hash function family gives an embedding ofobjects into the Hamming cube (obtained by concatenatingthe values of all the hash functions in the family). For ob-ject x, let v(x) be the element in the Hamming cube x ismapped to. 1 sim(x, y) is simply the fraction of bits thatdo not agree in v(x) and v(y), which is proportional to theHamming distance between v(x) and v(y). Thus this em-bedding is an isometric embedding of the distance function1 sim(x, y) in the Hamming cube. But

    1 sim(x, y) = 1 (1 + sim(x, y))/2 = (1 sim(x, y))/2.

    This implies that 1 sim(x, y) can be isometrically embed-ded in the Hamming cube.

    We note that Lemma 3 has a weak converse, i.e. for asimilarity measure sim(x, y) any isometric embedding of thedistance function 1sim(x, y) in the Hamming cube yields a

    locality sensitive hash function family corresponding to thesimilarity measure ( + sim(x, y))/( + 1) for some > 0.

    3. RANDOM HYPERPLANE BASED HASH

    FUNCTIONS FOR VECTORSGiven a collection of vectors in Rd, we consider the family

    of hash functions defined as follows: We choose a randomvector r from the d-dimensional Gaussian distribution (i.e.each coordinate is drawn the 1-dimensional Gaussian distri-bution). Corresponding to this vector r, we define a hash

    383

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    5/9

    function hr as follows:

    hr(u) =1 ifr u 00 ifr u < 0

    Then for vectors u and v,

    Pr[hr(u) = hr(v)] = 1(u,v)

    .

    This was used by Goemans and Williamson [24] in theirrounding scheme for the semidefinite programming relax-ation of MAX-CUT.

    Picking a random hyperplane amounts to choosing a nor-mally distributed random variable for each dimension. Thuseven representing a hash function in this family could requirea large number of random bits. However, for n vectors, thehash functions can be chosen by picking O(log2 n) randombits, i.e. we can restrict the random hyperplanes to be in

    a family of size 2O(log2 n). This follows from the techniques

    in Indyk [30] and Engebretsen et al [17], which in turn useNisans pseudorandom number generator for space boundedcomputations [35]. We omit the details since they are similarto those in [30, 17].

    Using this random hyperplane based hash function, we ob-

    tain a hash function family for set similarity, for a slightlydifferent measure of similarity of sets. Suppose sets are rep-resented by their characteristic vectors. Then, applying theabove scheme gives a locality sensitive hashing scheme where

    Pr[h(A) = h(B)] = 1

    , where

    = cos1|A B|

    |A| |B|

    Also, this hash function family facilitates easy incorporationof element weights in the similarity calculation, since thevalues of the coordinates of the characteristic vectors couldbe real valued element weights. Later, in Section 4.1 we willpresent another technique to define and estimate similarityof weighted sets.

    4. THE EARTH MOVER DISTANCEConsider a set of points L = {l1, . . . ln} with a distance

    function d(i, j) (assumed to be a metric). A distributionP(L) on L is a collection of non-negative weights (p1, . . . pn)for points in X such that pi = 1. The distance betweentwo distributions P(L) and Q(L) is defined to be the optimalcost of the following minimum transportation problem:

    mini,j

    fi,j d(i, j) (2)

    i j fi,j = pi (3)

    ji

    fi,j = qj (4)

    i, j fi,j 0 (5)

    Note that we define a somewhat restricted form of theEarth Mover Distance. The general definition does not as-sume that the sum of the weights is identical for distribu-tions P(L) and Q(L). This is useful for example in matchinga small image to a portion of a larger image.

    We will construct a hash function family for estimatingthe Earth Mover Distance based on rounding algorithms forthe problem of classification with pairwise relationships, in-troduced by Kleinberg and Tardos [33]. (A closely relatedproblem was also studied by Broder et al [9]). In designinghash functions to estimate the Earth Mover Distance, wewill relax the definition of locality sensitive hashing (1) inthree ways.

    1. Firstly, the quantity we are trying to estimate is adistance measure, not a similarity measure in [0, 1].

    2. Secondly, we will allow the hash functions to map ob-jects to points in a metric space and measure

    E[d(h(x), h(y))]. (A locality sensitive hash functionfor a similarity measure sim(x, y) can be viewed asa scheme to estimate the distance 1 sim(x, y) byPrhF[h(x) = h(y)]. This is equivalent to having auniform metric on the hash values).

    3. Thirdly, our estimator for the Earth Mover Distancewill not be an unbiased estimator, i.e. our estimatewill approximate the Earth Mover Distance to withina small factor.

    We now describe the problem of classification with pair-wise relationships. Given a collection of objects V and labelsL = {l1, . . . , ln}, the goal is to assign labels to objects. Thecost of assigning label l to object u V is c(u, l). Certainpairs of objects (u, v) are related; such pairs form the edgesof a graph over V. Each edge e = (u, v) is associated with anon-negative weight we. For edge e = (u, v), ifu is assignedlabel h(u) and v is assigned label h(v), then the cost paid iswed(h(u), h(v)).

    The problem is to come up with an assignment of labelsh : V L, so as to minimize the cost of the labeling h givenby

    uV

    c(v, h(v)) +

    e=(u,v)E

    wed(h(u), h(v))

    The approximation algorithms for this problem use an LPto assign, for every u V, a probability distribution overlabels in L (i.e. a set of non-negative weights that sum upto 1). Given a distribution P over labels in L, the round-ing algorithm of Kleinberg and Tardos gave a randomizedprocedure for assigning label h(P) to P with the followingproperties:

    1. Given distribution P(L) = (p1, . . . pn),

    Pr[h(P) = li] = pi. (6)

    2. Suppose P and Q are probability distributions over L.

    E[d(h(P), h(Q))] O(log n log log n) EMD(P, Q) (7)

    We note that the second property (7) is not immediatelyobvious from [33], since they do not describe LP relaxationsfor general metrics. Their LP relaxations are defined forHierarchically well Separated Trees (HSTs). They converta general metric to such an HST using Bartals results [3,4] on probabilistic approximation of metric spaces via treemetrics. However, it follows from combining ideas in [33]with those in Chekuri et al [11]. Chekuri et al do in factgive an LP relaxation for general metrics. The LP relaxation

    384

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    6/9

    does indeed produce distributions over labels for every ob-ject u V. The fractional distance between two labelings isexpressed as the min cost transshipment between P and Q,which is identical to the Earth Mover Distance EMD(P, Q).Now, this fractional solution can be used in the rounding al-gorithm developed by Kleinberg and Tardos to obtain thesecond property (7) claimed above. In fact, Chekuri et aluse this fact to claim that the gap of their LP relaxation isat most O(log n log log n) (Theorem 5.1 in [11]).

    We elaborate some more on why the property (7) holds.Kleinberg and Tardos first (probabilistically) approximatethe metric on L by an HST using [3, 4]. This is a treewith all vertices in the original metric at the leaves. Thepairwise distance between any two vertices does no decreaseand all pairwise distances are increased by a factor of atmost O(log n log log n) (in expectation). For this tree met-ric, they use an LP formulation which can be described asfollows. Suppose we have a rooted tree. For subtree T, letT denote the length of the edge that T hangs offof, i.e. thefirst edge on the path from T to the root. Further, for dis-tribution P on the vertices of the original metric, let P(T)denote the total probability mass that P assigns to leaves inT; Q(T) is similarly defined. The distance between distribu-

    tions P and Q is measured by T T|P(T)Q(T)|, wherethe summation is computed over all subtrees T. The Klein-berg Tardos rounding scheme ensures that E[d(h(P), h(Q))]is within a constant factor of T T|P(T)Q(T)|.

    Suppose instead, we measured the distance between distri-butions by EMD(P, Q), defined on the original metric. Byprobabilistically approximating the original metric by a treemetric T, the expected value of the distance EMDT(P, Q)(on the tree metric T) is at most a factor ofO(log n log log n)times EMD(P, Q). This follows since all distances increaseby O(log n log log n) in expectation. Now note that the treedistance measure used by Kleinberg and Tardos T T|P(T)Q(T)| is a lower bound on (and in fact exactly equal to)EMDT(P, Q). To see that this is a lower bound, note thatin the min cost transportation between P and Q on T,

    the flow on the edge leading upwards from subtree T mustbe at least |P(T) Q(T)|. Since the rounding scheme en-sures that E[d(h(P), h(Q))] is within a constant factor of

    T T|P(T)Q(T)|, we have that

    E[d(h(P), h(Q))] O(1) EMDT(P, Q)

    O(log n log log n) EMD(P, Q)

    where the expectation is over the random choice of the HSTand the random choices made by the rounding procedure.

    Theorem 1. The Kleinberg Tardos rounding scheme yieldsa locality sensitive hashing scheme such that

    EMD(P, Q) E[d(h(P), h(Q))]

    O(log n log log n) EMD(P, Q).

    Proof. The upper bound on E[d(h(P), h(Q))] follows di-rectly from the second property (7) of the rounding schemestated above.

    We show that the lower bound follows from the first prop-erty (6). Let yi,j be the joint probability that h(P) = liand h(Q) = lj . Note that j yi,j = pi, since this is sim-

    ply the probability that h(P) = li. Similarly i yi,j = qj ,since this is simply the probability that h(Q) = lj. Now, ifh(P) = li and h(Q) = lj, then d(h(P)h(Q)) = d(i, j). Hence

    E[d(f(P), f(Q))] = i,j yi,j d(i, j). Let us write down theexpected cost and the constraints on yi,j .

    E[d(h(P), h(Q))] =i,j

    yi,j d(i, j)

    ij

    yi,j = pi

    j

    i

    yi,j = qj

    i, j yi,j 0

    Comparing this with the LP for EMD(P, Q), we see thatthe values offi,j = yi,j is a feasible solution to the LP (2) to(5) and E[d(h(P), h(Q))] is exactly the value of this solution.Since EMD(P, Q) is the minimum value of a feasible solu-tion, it follows that EMD(P, Q) E[d(h(P), h(Q))].

    Calinescu et al [10] study a variant of the classificationproblem with pairwise relationships called the 0-extensionproblem. This is the version without assignment costs wheresome objects are assigned labels apriori and this labelingmust be extended to the other objects (a generalization ofmultiway cut). For this problem, they design a rounding

    scheme to get a O(log n) approximation. Again, their tech-nique does not explicitly use an LP that gives probabilitydistributions on labels. However in hindsight, their round-ing scheme can be interpreted as a randomized procedurefor assigning labels to distributions such that

    E[d(h(P), h(Q))] O(log n) EMD(P, Q).

    Thus their rounding scheme gives a tighter guarantee than(7). However, they do not ensure (6). Thus the previousproof showing that EMD(P, Q) E[d(h(P), h(Q))] doesnot apply. In fact one can construct examples such thatEMD(P, Q) > 0, yet E[d(h(P), h(Q))] = 0. Hence, theresulting hash function family provides an upper bound onEMD(P, Q) within a factor O(log n) but does not providea good lower bound.

    We mention that the hashing scheme described providesan approximation to the Earth Mover Distance where thequality of the approximation is exactly the factor by whichthe underlying metric can be probabilistically approximatedby HSTs. In particular, if the underlying metric itself isan HST, this yields an estimate within a constant factor.This could have applications in compactly representing dis-tributions over hierarchical classes. For example, documentscan b e assigned a probability distribution over classes inthe Open Directory Project (ODP) hierarchy. This hier-archy could be thought of as an HST and documents canbe mapped to distributions over this HST. The distance be-tween two distributions can be measured by the Earth MoverDistance. In this case, the hashing scheme described gives a

    way to estimate this distance measure to within a constantfactor.

    4.1 Weighted SetsWe show that the Kleinberg Tardos [33] rounding scheme

    for the case of the uniform metric actually is an extensionof min-wise independent permutations to the weighted case.

    First we recall the hashing scheme given by min-wise in-dependent permutations. Given a universe U, consider arandom permutation of U. Assume that the elements ofU are totally ordered. Given a subset A U, the hash

    385

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    7/9

    function h is defined as follows:

    h(A) = min{(A)}

    Then the property satisfied by this hash function family isthat

    Pr[h(A) = h(B)] =|A B|

    |A B|

    We now review the Kleinberg Tardos rounding scheme for

    the uniform metric: Firstly, imagine that we pick an infinitesequence {(it, t)}

    t=1 where for each t, it is picked uniformlyand at random in {1, . . . n} and t is picked uniformly andat random in [0, 1]. Given a distribution P = (p1, . . . , pn),the assignment of labels is done in phases. In the ith phase,we check whether i pit . If this is the case and P has notbeen assigned a label yet, it is assigned label it.

    Now, we can think of these distributions as sets in R2 (seeFigure 1).

    2 3 4 5 6 70 1 8

    Figure 1: Viewing a distribution as a continuous set.

    The set S(P) corresponding to distribution P consists ofthe union of the rectangles [i1, i] [0, pi]. The elements ofthe universe are [i 1, i] . [i 1, i] belongs to S(P)iff pi. The notion of cardinality of union and intersec-tion of sets is replaced by the area of the intersection andunion of two such sets in R2. Note that the Kleinberg Tar-dos rounding scheme can be interpreted as constructing apermutation of the universe and assigning to a distributionP, the value i such that (i,) is the minimum in the per-

    mutation amongst all elements contained in S(P). Supposeinstead, we assign to P, the element (i,) which is the min-imum in the permutation of S(P). Let h be a hash functionderived from this scheme (a slight modification of the one in[33]). Then,

    Pr[h(P) = h(Q)] =|S(P) S(Q)|

    |S(P) S(Q)|= i

    min(pi, qi)

    imax(pi, qi)(8)

    For the Kleinberg Tardos rounding scheme, the probabil-ity of collision is at least the probability of collision for themodified scheme (since two objects hashed to (i,1) and(i,2) respectively in the modified scheme would be bothmapped to i in the original scheme). Hence

    PrKT[h(P) = h(Q)]

    imin(pi, qi)

    imax(pi, qi)

    PrKT[h(P) = h(Q)] 1 imin(pi, qi)

    imax(pi, qi)

    = i|pi qi|

    imax(pi, qi)i

    |pi qi|

    The last inequality follows from the fact that pi = qi =1 in the Kleinberg Tardos setting. This was exactly theproperty used in [33] to obtain a 2-approximation for theuniform metric case.

    Note that the hashing scheme given by (8) is a generaliza-tion of min-wise independent permutations to the weightedsetting where elements in sets are associated with weights [0, 1]. Min-wise independent p ermutations are a specialcase of this scheme when the weights are {0, 1}. This schemecould be useful in a setting where a weighted set similar-ity notion is desired. We note that the original min-wiseindependent permutations can be used in the setting of in-teger weights by simply duplicating elements according to

    their weight. The present scheme would work for any non-negative real weights.

    5. APPROXIMATE NEAREST NEIGHBOR

    SEARCH IN HAMMING SPACE.Applications of locality sensitive hash functions to solving

    nearest neighbor queries typically reduce the problem to theHamming space. Indyk and Motwani [31] give a data struc-ture that solves the approximate nearest neighbor problemon the Hamming space Hd. Their construction is a reduc-tion to the so called PLEB (Point Location in Equal Balls)problem, followed by a hashing technique concatenating thevalues of several locality sensitive hash functions.

    Theorem 2 ([31]). For any > 0, there exists an al-gorithm for -PLEB in Hd using O(dn + n1+1/(1+)) space

    and O(n1/(1+)) hash function evaluations for each query.

    We give a simple technique that achieves the same perfor-mance as the Indyk Motwani result:

    Given bit vectors consisting of d bits each, we chooseN = O(n1/(1+)) random permutations of the bits. For eachrandom permutation , we maintain a sorted order O ofthe bit vectors, in lexicographic order of the bits permutedby . Given a query bit vector q, we find the approximatenearest neighbor by doing the following: For each permu-tation , we perform a binary search on O to locate thetwo bit vectors closest to q (in the lexicographic order ob-tained by bits permuted by ). We now search in each of

    the sorted orders O examining elements above and belowthe position returned by the binary search in order of thelength of the longest prefix that matches q. This can be doneby maintaining two pointers for each sorted order O (onemoves up and the other down). At each step we move one ofthe pointers up or down corresponding to the element withthe longest matching prefix. (Here the length of the longestmatching prefix in O is computed relative to q with its bitspermuted by ). We examine 2N = O(n1/(1+)) bit vec-tors in this way. Of all the bit vectors examined, we returnthe one that has the smallest Hamming distance to q. Theperformance bounds we can prove for this simple schemeare identical to that proved by Indyk and Motwani for theirscheme. An advantage of this scheme is that we do not needa reduction to many instances of PLEB for different values

    of radius r, i.e. we solve the nearest neighbor problem si-multaneously for all values of radius r using a single datastructure.

    We outline the main ideas of the analysis. In fact, theproof follows along similar lines to the proofs of Theorem5 and Corollary 3 in [31]. Suppose the nearest neighbor ofq is at a Hamming distance of r from q. Set p1 = 1 rd ,

    p2 = 1 r(1+)d

    and k = log1/p2 n. Let =ln 1/p1ln 1/p2

    . Then

    n = O(n1/(1+)). We can show that with constant proba-

    bility, from amongst N = O(n1/(1+)) permutations, there

    386

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    8/9

    exists a permutation such that the nearest neighbor agreeswith p on the first k coordinates in . Further, over all Lpermutations, the number of bit vectors that are at Ham-ming distance of more than r(1 + ) from q and agree on thefirst k coordinates is at most 2N with constant probability.This implies that for this permutation , one of the 2L bitvectors near q in the ordering O and examined by the algo-rithm will be a (1 + )-approximate nearest neighbor. Theprobability calculations are similar to those in [31], and we

    only sketch the main ideas.For any point q at distance at least r(1 + ) from q, the

    probability that a random coordinate agrees with q is atmost p2. Thus the probability that the first k coordinatesagree is at most pk2 =

    1n . For the N permutations, the

    expected number of such points that agree in the first kcoordinates is at most N. The probability that this num-ber is 2N is > 1/2. Further, for a random permutation, the probability that the nearest neighbor agrees in k co-ordinates is pk1 = n

    . Hence the probability that thereexists one permutation amongst the N = n permutationswhere the nearest neighbor agrees in k coordinates is at least1 (1 n)n

    > 1/2. This establishes the correctness ofthe procedure.

    As we stated earlier, a nice property of this data structureis that it automatically adjusts to the correct distance rto the nearest neighbor, i.e. we do not need to maintainseparate data structures for different values of r.

    6. CONCLUSIONSWe have demonstrated an interesting relationship between

    rounding algorithms used for rounding fractional solutionsof LPs and vector solutions of SDPs on the one hand, andthe construction of locality sensitive hash functions for in-teresting classes of objects, on the other.

    Rounding algorithms yield new constructions of localitysensitive hash functions that were not known previously.Conversely (at least in hindsight), locality sensitive hash

    functions lead to rounding algorithms (as in the case of min-wise independent permutations and the uniform metric casein Kleinberg and Tardos [33]).

    An interesting direction to pursue would be to investigatethe construction of sketching functions that allow one to es-timate information theoretic measures of distance betweendistributions such as the KL-divergence, commonly used instatistical learning theory. Since the KL-divergence is nei-ther symmetric nor satisfies triangle inequality, new ideaswould be required in order to design a sketch function toapproximate it. Such a sketch function, if one exists, wouldbe a very valuable tool in compactly representing complexdistributions.

    7. REFERENCES[1] N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy.

    Tracking Join and Self-Join Sizes in Limited Storage.Proc. 18th PODS pp. 10-20, 1999.

    [2] N. Alon, Y. Matias, and M. Szegedy. The SpaceComplexity of Approximating the FrequencyMoments. JCSS 58(1): 137-147, 1999

    [3] Y. Bartal. Probabilistic approximation of metricspaces and its algorithmic application. Proc. 37thFOCS, pages 184193, 1996.

    [4] Y. Bartal. On approximating arbitrary metrics by treemetrics. In Proc. 30th STOC, pages 161168, 1998.

    [5] A. Z. Broder. On the resemblance and containment ofdocuments. Proc. Compression and Complexity ofSEQUENCES, pp. 2129. IEEE Computer Society,1997.

    [6] A. Z. Broder. Filtering near-duplicate documents.Proc. FUN 98, 1998.

    [7] A. Z. Broder, M. Charikar, A. Frieze, andM. Mitzenmacher. Min-wise independentpermutations. Proc. 30th STOC, pp. 327336, 1998.

    [8] A. Z. Broder, S. C. Glassman, M. S. Manasse, andG. Zweig. Syntactic clustering of the Web. Proc. 6thIntl World Wide Web Conference, pp. 391404, 1997.

    [9] A. Z. Broder, R. Krauthgamer, and M. Mitzenmacher.Improved classification via connectivity information.Proc. 11th SODA, pp. 576-585, 2000.

    [10] G. Calinescu, H. J. Karloff, and Y. Rabani.Approximation algorithms for the 0-extensionproblem. Proc. 11th SODA, pp. 8-16, 2000.

    [11] C. Chekuri, S. Khanna, J. Naor, and L. Zosin.Approximation algorithms for the metric labelingproblem via a new linear programming formulation.

    Proc. 12th SODA, pp. 109-118, 2001.[12] Z. Chen, H. V. Jagadish, F. Korn, N. Koudas,

    S. Muthukrishnan, R. T. Ng, and D. Srivastava.Counting Twig Matches in a Tree. Proc. 17th ICDEpp. 595-604. 2001.

    [13] Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan.Selectivity Estimation for Boolean Queries. Proc. 19thPODS, pp. 216-225, 2000.

    [14] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk,R. Motwani, J. D. Ullman, and C. Yang. FindingInteresting Associations without Support Pruning.Proc. 16th ICDEpp. 489-499, 2000.

    [15] S. Cohen and L. Guibas. The Earth Movers Distanceunder Transformation Sets. Proc. 7th IEEE Intnl.

    Conf. Computer Vision, 1999.[16] S. Cohen and L. Guibas. The Earth Movers Distance:

    Lower Bounds and Invariance under Translation.Tech. report STAN-CS-TR-97-1597, Dept. ofComputer Science, Stanford University, 1997.

    [17] L. Engebretsen, P. Indyk and R. ODonnell.Derandomized dimensionality reduction withapplications. To appear in Proc. 13th SODA, 2002.

    [18] P. B. Gibbons and Y. Matias. Synopsis DataStructures for Massive Data Sets. Proc. 10th SODApp. 909910, 1999.

    [19] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, andM. J. Strauss. Surfing Wavelets on Streams: One-PassSummaries for Approximate Aggregate Queries. Proc.

    27th VLDB pp. 79-88, 2001.[20] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and

    M. J. Strauss. QuickSAND: Quick Summary andAnalysis of Network Data. DIMACS Technical Report2001-43, November 2001.

    [21] A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis,S. Muthukrishnan, and M. J. Strauss. Fast,Small-Space Algorithms for Approximate HistogramMaintenance. these proceedings.

    [22] A. Gionis, D. Gunopulos, and N. Koudas. Efficient

    387

  • 8/14/2019 Similarity Estimation Techniques from Rounding Algorithms

    9/9

    and Tunable Similar Set Retrieval. Proc. SIGMODConference 2001.

    [23] A. Gionis, P. Indyk, and R. Motwani. SimilaritySearch in High Dimensions via Hashing. Proc. 25thVLDB pp. 518-529, 1999.

    [24] M. X. Goemans and D. P. Williamson. ImprovedApproximation Algorithms for Maximum Cut andSatisfiability Problems Using SemidefiniteProgramming. JACM 42(6): 1115-1145, 1995.

    [25] S. Guha, N. Mishra, R. Motwani, and L. OCallaghan.Clustering data streams. Proc. 41st FOCS, pp.359366, 2000.

    [26] A. Gupta and Eva Tardos. A constant factorapproximation algorithm for a class of classificationproblems. Proc. 32nd STOC, pp. 652658, 2000.

    [27] T. H. Haveliwala, A. Gionis, and P. Indyk. ScalableTechniques for Clustering the Web. Proc. 3rd WebDB,pp. 129-134, 2000.

    [28] P. Indyk. A small approximately min-wiseindependent family of hash functions. Proc. 10thSODA, pp. 454456, 1999.

    [29] P. Indyk. On approximate nearest neighbors innon-Euclidean spaces. Proc. 40th FOCS, pp. 148-155,

    1999.[30] P. Indyk. Stable Distributions, Pseudorandom

    Generators, Embeddings and Data StreamComputation. Proc. 41st FOCS, 189-197, 2000.

    [31] Indyk, P., Motwani, R. Approximate nearestneighbors: towards removing the curse ofdimensionality. Proc. 30th STOC pp. 604613, 1998.

    [32] P. Indyk, R. Motwani, P. Raghavan, and S. Vempala.Locality-Preserving Hashing in MultidimensionalSpaces. Proc. 29th STOC, pp. 618625, 1997.

    [33] J. M. Kleinberg and Eva Tardos ApproximationAlgorithms for Classification Problems with PairwiseRelationships: Metric Labeling and Markov RandomFields. Proc. 40th FOCS, pp. 1423, 1999.

    [34] N. Linial and O. Sasson. Non-Expansive Hashing.Combinatorica 18(1): 121-132, 1998.

    [35] N. Nisan. Pseudorandom sequences for space boundedcomputations. Combinatorica, 12:449461, 1992.

    [36] Y. Rubner. Perceptual Metrics for Image DatabaseNavigation. Phd Thesis, Stanford University, May1999

    [37] Y. Rubner, L. J. Guibas, and C. Tomasi. The EarthMovers Distance, Multi-Dimensional Scaling, andColor-Based Image Retrieval. Proc. of the ARPAImage Understanding Workshop, pp. 661-668, 1997.

    [38] Y. Rubner, C. Tomasi. Texture Metrics. Proc. IEEEInternational Conference on Systems, Man, andCybernetics, 1998, pp. 4601-4607.

    [39] Y. Rubner, C. Tomasi, and L. J. Guibas. A Metric forDistributions with Applications to Image Databases.Proc. IEEE Int. Conf. on Computer Vision, pp. 59-66,1998.

    [40] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth

    Movers Distance as a Metric for Image Retrieval.Tech. Report STAN-CS-TN-98-86, Dept. of ComputerScience, Stanford University, 1998.

    [41] M. Ruzon and C. Tomasi. Color edge detection withthe compass operator. Proc. IEEE Conf. ComputerVision and Pattern Recognition, 2:160-166, 1999.

    [42] M. Ruzon and C. Tomasi. Corner detection intextured color images. Proc. IEEE Int. Conf.Computer Vision, 2:1039-1045, 1999.

    388