On Embedding of Finite Metric Spaces into Hilbert Space

On Embedding of Finite Metric Spaces into Hilbert Space

Ittai Abraham∗ Yair Bartal†

Hebrew UniversityOfer Neiman‡

Abstract

Metric embedding plays an important role in a vast range of application areas such as computer vision,computational biology, machine learning, networking, statistics, and mathematical psychology, to name a few.The main criteria for the quality of an embedding is its average distortion over all pairs.

A celebrated theorem of Bourgain states that every finite metric space onn points embeds in Euclidean spacewith O(log n) distortion.

Bourgain’s result is best possible when considering the worst case distortion over all pairs of points in themetric space. Yet, is this the case for theaverage distortion?

Our main result is a strengthening of Bourgain’s theorem providing an embedding withconstantaveragedistortion for arbitrary metric spaces. In fact, our embedding possesses a much stronger property. We define the`q-distortionof a uniformly distributed pair of points. Our embedding achieves the best possible`q-distortion forall 1 ≤ q ≤ ∞ simultaneously.

1 Introduction

The theory of embeddings of finite metric spaces has attracted much attention in recent decades by several com-munities: Mathematicians, researchers in Theoretical Computer Science as well as researchers in the NetworkingCommunity and other applied fields of Computer Science.

The main objective of the field is to find embeddings of metric spaces into other more simple and structuredspaces that havelow distortion.

Given two metric spaces(X, dX) and(Y, dY ) an injectivemappingf : X → Y is called anembeddingof Xinto Y . An embedding isnon-contractiveif for any u 6= v ∈ X: dY (f(u), f(v)) ≥ dX(u, v). For a non-contractiveembedding thedistortionof f is defined asdist(f) = supu6=v∈X distf (u, v), wheredistf (u, v) = dY (f(u),f(v))

dX(u,v) .In Computer Science, embeddings of finite metric spaces have played an important role, in recent years, in the

development of algorithms. More general practical use of embeddings can be found in a vast range of applicationareas including computer vision, computational biology, machine learning, networking, statistics, and mathematicalpsychology to name a few.

From a mathematical perspective embeddings of finite metric spaces into normed spaces are considered naturalnon-linear analogues to the local theory of Banach spaces. The most classic fundamental question is that of em-bedding metric spaces into Hilbert Space. The main cornerstone of the field has been the following theorem byBourgain [13]:

Theorem 1 (Bourgain). For everyn-point metric space there exists an embedding into Euclidean space with dis-tortion O(log n).

Bourgain also showed that this bound is nearly tight and later Linial, London and Rabinovich [33] prove thatembedding the metrics of constant-degree expander graphs into Euclidean space requiresΩ(log n) distortion.

Yet, this lower bound on the distortion is aworst casebound, i.e., it means that thereexistsa pair of pointswhose distortion is large. However, theaverage caseis often more significant in terms of evaluating the quality ofthe embedding, in particular in relation to practical applications.

Formally, theaverage distortionof an embeddingf is defined as:avgdist(f) = 1

(n2)

∑u 6=v∈X distf (u, v).

Indeed, in all real-world applications of metric embeddingsaverage distortionand similar notions are used forevaluating the embedding’s performance in practice, for example see [23, 24, 4, 22, 41, 42]. Moreover, in some

∗email: [email protected].†email: [email protected]. Supported in part by a grant from the Israeli Science Foundation (195/02).‡email: [email protected]. Supported in part by a grant from the Israeli Science Foundation (195/02).

1

cases it is desired that the average distortion would be small and the worst case distortion would still be reasonablybounded as well. While these papers provide some indication that such embeddings are possible in practice, theclassic theory of metric embedding fails to address this natural question.

In particular, applying Bourgain’s embedding to the metric of a constant-degree expander graph results inΩ(log n) distortion for aconstant fractionof the pairs1.

In this paper we prove the following theorem which provides a qualitative strengthening of Bourgain’s theorem:

Theorem 2 (Average Distortion). For everyn-point metric space there exists an embedding into Euclidean spacewith distortionO(log n) and average distortionO(1).

In fact our results are even stronger. For1 ≤ q ≤ ∞, define the q-distortionof an embeddingf as:

distq(f) = ‖distf (u, v)‖(U)q = E[distf (u, v)q]1/q,

where the expectation is taken according to the uniform distributionU over(X2

). The classic notion of distortion is

expressed by the∞-distortion and the average distortion is expressed by the`1-distortion. Theorem2 is a corollaryof the following theorem:

Theorem 3 ( q-Distortion). For everyn-point metric space(X, d) there exists an embeddingf of X into Euclideanspace such that for any1 ≤ q ≤ ∞, distq(f) = O(minq, log n).

Another variant of average distortion that is natural is what we calldistortion of average: distavg(f) =Pu6=v∈X dY (f(u),f(v))P

u 6=v∈X d(u,v) , which can be naturally extended to its`q-normed extension termeddistortion of `q-norm.

Theorems2 and3 extend to those notions as well.Besidesq = ∞ andq = 1, the case ofq = 2 provides a particularly natural measure. It is closely related to

the notion ofstresswhich is a standard measure inmultidimensional scalingmethods, invented by Kruskal [28] andlater studied in many models and variants. Multidimensional scaling methods (see [29, 23]) are based on embeddingof a metric representing the relations between entities into low dimensional space to allow feature extraction and areoften used for indexing, clustering, nearest neighbor searching and visualization in many application areas includingpsychology and computational biology [24].

Previous work on average distortion.Related notions to the ones studied in this paper have been consideredbefore in several theoretical papers. Most notably, Yuri Rabinovich [37] studied the notion of distortion of average2

motivated by its application to the Sparsest Cut problem. This however places the restriction that the embedding isLipschitz ornon-expansive. Other recent papers have address this version of distortion of average and its extensionto weighted average. In particular, it has been recently shown (see for instance [18]) that the work of Arora, Raoand Vazirani on Sparsest Cut [3] can be rephrased as an embedding theorem using these notions.

In his paper, Rabinovich observes that for Lipschitz embeddings the lower bound ofΩ(log n) still holds. It isthereforecrucial in our theorems that the embeddings areco-Lipschitz3 (a notion defined by Gromov [21]) (andw.l.o.gnon-contractive).

To the best of our knowledge the only paper addressing such embeddings prior to this work is by Lee, Mendeland Naor [30] where they seek to bound theaverage distortionof embeddingn-point L1 metrics into Euclideanspace. However, even for this special case they do not give a constant bound on the average distortion4.

Network embedding. Our work is largely motivated by a surge of interest in the networking community onperformingpassive distance estimation(see e.g. [19, 35, 32, 15, 41, 14]), assigning nodes with short labels in sucha way that the network latency between nodes can be approximated efficiently by extracting information from thelabels without the need to incur active network overhead. The motivation for such labelling schemes are manyemerging large-scale decentralized applications that requirelocality awareness, the ability to know the relativedistance between nodes. For example, in peer-to-peer networks, finding the nearest copy of a file may significantlyreduce network load, or finding the nearest server in a distributed replicated application may improve response time.One promising approach for distance labelling isnetwork embedding(see [15]). In this approach nodes are assignedcoordinates in a low dimensional Euclidean space. The node coordinates form simple and efficientdistance labels.Instead of repeatedly measuring the distance between nodes, these labels allow to extract an approximate measure ofthe latency between nodes. Hence these network coordinates can be used as an efficient building block for localityaware networks that significantly reduce network load.

1Similar statements hold for the more recent metric embeddings of [39, 27] as well.2Usually this notion was called average distortion but the name is somewhat confusing.3This notion is used here somewhat differently than its original purpose.4The bound given in [30] is O(

√log n) which applies to a somewhat weaker notion.

2

As mentioned above the natural measure of efficiency in the networking research is how the embedding performson average, where the notion of average distortion comes in several variations can be phrased in terms of the defi-nitions given above. The phenomenon observed in measurements of network distances is that the average distortionof network embeddings was bounded by a small constant. Our work gives thefirst full theoretical explanation forthis intriguing phenomenon.

Embedding with relaxed guaranties. The theoretical study of such phenomena was initiated by the work ofKleinberg, Slivkins and Wexler [26]. They mainly focus on the fact reported in the networking papers that thedistortion of almost all pairwise distances is bounded by some small constant. In an attempt to provide theoreticaljustification for such phenomena [26] define the notion of a(1 − ε)-partial embedding5 where the distortion isbounded for at least some(1 − ε) fraction of the pairwise distances. They obtained some initial results for metricswhich have constant doubling dimension [26]. In Abraham et. al. [1] is was shown that any finite metric space hasa (1− ε)-partial embedding into Euclidean space withO(log 1

ε ) distortion.While this result is very appealing it has the disadvantage of lacking any promise for some fraction of the

pairwise distances. This may be critical for applications - that is we really desire an embedding which in a sensedoes“as well as possible”for all distances. To question whether such an embedding exists [26] define a strongernotion ofscaling distortion6. An embedding has scaling distortion ofα(ε) if it provides this bound on the distortionof a(1−ε) fraction of the pairwise distances,for anyε. In [26], such embeddings withα(ε) = O(log 1

ε ) were shownfor metrics of bounded growth dimension, this was extended in [1] to metrics of bounded doubling dimension. Inaddition [1] give a rather simple probabilistic embedding with scaling distortion, implying an embedding into (high-dimensional)L1.

The most important question arising from the work of [26, 1] is whether embeddings with small scaling distortionexist for embedding into Euclidean space. We give the following theorem7 which lies in the heart of the proof ofTheorem3:

Theorem 4. For every finite metric space(X, d), there exists an embedding ofX into Euclidean space with scalingdistortionO(log 1

ε ).

Novel Techniques.While [1] certainly uses the state of the art methods in finite metric embedding, it appearsall these techniques break when attempting to prove Theorem4. Indeed, to prove the theorem and its generalizationswe present novel embedding techniques.

Our embeddings are based onprobabilistic partitionsof metric spaces [6] originally defined in the context ofprobabilistic embeddingof metric spaces and later used in the context of metric embedding in [7, 17, 8, 39, 27].

We make use of novel probabilistic partitions [9] with refined properties which allow stronger and more generalresults on embedding of finite metric spaces which cannot be achieved using any of the previous methods. Theseconstructions as well as the embedding techniques based on them were developed in conjunction with this paper.Moreover, here we make a far more sophisticated use of these partitions.

Our embeddings intoLp are based onuniformly padded hierarchical probabilistic partitions. The properties ofthese partitions allow to define sophisticated embeddings in a natural way. We believe that the application of thesetechniques as presented here demonstrates their vast versatility and we expect that more applications will be foundin the near future.

We stress that although the similarity in approach and techniques to [9] the proof of the main result in thispaper is considerably more involved. This is not surprising given that here we desire to obtain distortions whichdepend solely onε rather than onn. This requires clever ways of defining the embeddings so that the contributionwould be limited as a function ofε. In particular, in addition to the decomposition based embedding, a secondcomponent of our embedding uses a similar approach to that of Bourgain’s original embedding. However usingit in straightforward manner is impossible. It is here that we crucially rely on the hierarchical structure of ourdecompositions in order to do this in a way that will allow us to bound the contribution appropriately.

Additional Results and Applications. In addition to our main result, our paper contains several other importantcontributions: we extend the results on average distortion to weighted averages. We show the bound isO(log Φ)whereΦ is the effective aspect ratio of the weight distribution. We also obtain average distortion results for embed-dings into ultrametrics. In addition we present a solution for another open problem from [26, 1] regarding partialembedding into trees.

Finally, we demonstrate some basic algorithmic applications of our theorems, mostly due to their extensions togeneral weighted averages. Among others is an application touncapacitated quadratic assignment[36, 25]. We alsoextend our concepts to analyze Distance Oracles of Thorup and Zwick [44] providing results with strong relation to

5Called “embeddings withε-slack” in [26].6Called “gracefully degrading distortion” in [26].7In fact in this theorem the definition of scaling distortion is even stronger. This is explained in detail in the appropriate section.

3

the questions addressed by [26]. We however feel that our current applications do not make full use of the strengthof our theorems and techniques and it remains to be seen if such applications will arise.

The rest of the introduction provides a detailed description of the new concepts and results. We then provide theproof of Theorem6 which contains the main technical contribution of this paper. The rest of the results are given inthe detail in the appendix.

1.1 `q-Distortion and the Main Theorem

Given two metric spaces(X, dX) and(Y, dY ) an injectivemappingf : X → Y is called anembeddingof X intoY . In what follows we define novel notions of distortion. In order to do that we start with the definition of the classicnotion.

An embeddingf is calledc-co-Lipschitz [21] if for any u 6= v ∈ X: dY (f(u), f(v)) ≥ c · dX(u, v) andnon-contractiveif c = 1. In the context of this paper we will restrict attention to co-Lipschitz embeddings, which dueto scaling may be further restricted tonon-contractiveembeddings. This has no difference for the classic notion ofdistortion but has a crucial role for the results presented in this paper. We will elaborate more on this issue in thesequel.

For a non-contractive embedding define the distortion function off , distf :(X2

) → R+, where foru 6= v ∈ X:

distf (u, v) = dY (f(u),f(v))dX(u,v) . The distortion off is defined asdist(f) = supu6=v∈X distf (u, v).

Definition 1 (`q-Distortion). Given a distributionΠ over(X2

)define for1 ≤ q ≤ ∞ the `q-distortion of f with

respect toΠ:dist(Π)

q (f) = ‖distf (u, v)‖(Π)q = EΠ[distf (u, v)q]1/q,

where‖ · ‖(Π)q denotes thenormalizedq norm over the distribution(Π), defined as in the equation above. LetU

denote the uniform distribution over(X2

). The`q-distortionof f is defined as:distq(f) = dist(U)

q (f).

In particular the classic distortion may be viewed as the`∞-distortion: dist(f) = dist∞(f). An importantspecial case ofq-distortion is whenq = 1:

Definition 2 (Average Distortion). Given a distributionΠ over(X2

)define for1 ≤ q ≤ ∞ theaverage distortion

of f with respect toΠ is defined as:avgdist(Π)(f) = dist(Π)1 (f), and theaverage distortionof f is given by:

avgdist(f) = dist1(f).

Another natural notion is the following:

Definition 3 (Distortion of `q-Norm). Given a distributionΠ over(X2

)define thedistortion of`q-normof f with

respect toΠ:

distnorm(Π)q (f) =

EΠ[dY (f(u), f(v))q]1/q

EΠ[dX(u, v)q]1/q,

and letdistnormq(f) = distnorm(U)q (f).

Again, an important special case of distortion of`q-norm is whenq = 1:

Definition 4 (Distortion of Average). Given a distributionΠ over(X2

)define thedistortion of averageof f with

respect toΠ as: distavg(Π)(f) = distnorm(Π)1 (f) and thedistortion of averageof f is given by: distavg(f) =

distnorm1(f).

For simplicity of the presentation of our main results we use the following notation:dist∗(Π)

q (f) = maxdist(Π)q (f), distnorm(Π)

q (f), dist∗q(f) = maxdistq(f), distnormq(f), andavgdist∗(f) = maxavgdist(f), distavg(f).Definition 5. A probability distributionΠ over

(X2

), with probability functionπ :

(X2

) → [0, 1], is callednon-degenerateif for everyu 6= v ∈ X: π(u, v) > 0. Theaspect ratioof a non-degenerate probability distributionΠ isdefined as:

Φ(Π) =maxu6=v∈X π(u, v)minu6=v∈X π(u, v)

.

In particularΦ(U) = 1. If Π is not non-degenerate thenΦ(Π) = ∞.

4

For anarbitrary probability distributionΠ over(X2

), define itseffective aspect ratioas:8 Φ(Π) = 2 minΦ(Π),

(n2

).

Theorem 5 (Embedding intoLp). Let (X, d) an n-point metric space, and let1 ≤ p ≤ ∞. There exists anembeddingf of X into Lp in dimensioneO(p) log n, such that for every1 ≤ q ≤ ∞, and any distributionΠover

(X2

): dist∗(Π)

q (f) = O(minq, log n/p + log Φ(Π)). In particular, avgdist∗(Π)(f) = O(log Φ(Π)). Also:dist(f) = O(dlog n/pe), dist∗q(f) = O(dq/pe) andavgdist∗(f) = O(1).

We show that all the bounds in the theorem above are tight.In the full paper we also give a stronger version of the bounds for decomposable metrics. Recall that metric

spaces(X, d) can be characterized by their decomposability parameterτX where it is known thatτX = O(log λX),whereλX is the doubling constant ofX, and for metrics ofKs,s-excluded minor graphs.τX = O(s2). For metrics

with a bounded decomposability parameter we extendTheorem 5by showing an embedding withdist∗(Π)q (f) =

O(minq, (log λX)1−1p (log n)1/p+ log Φ(Π)).

The proof ofTheorem 5follows directly from results on embedding with scaling distortion, discussed in thenext paragraph.

1.2 Partial Embedding, Scaling Distortion and Additional Results

Following [26] we define:

Definition 6 (Partial Embedding). Given two metric spaces(X, dX) and(Y, dY ), a partial embeddingis a pair(f, G), wheref is a non-contractive embedding ofX into Y , andG ⊆ (

X2

). The distortion of(f, G) is defined as:

dist(f, G) = supu,v∈G distf (u, v).For ε ∈ [0, 1), a(1− ε)-partial embeddingis a partial embedding such that|G| ≥ (1− ε)

(n2

).9

Next, we would like to define a special type of(1 − ε)-partial embeddings. For this aim we need a few moredefinitions. Letrε(x) denote the minimal radiusr such that|B(x, r)|/n ≥ ε. Let G(ε) = x, y ∈ (

X2

)|d(x, y) ≥maxrε/2(x), rε/2(y).

A coarsely(1− ε)-partial embedding is a pair(f, G(ε)), wheref is an embedding.10

Definition 7 (Scaling Distortion). Given two metric spaces(X, dX) and(Y, dY ) and a functionα : [0, 1) → R+,we say that an embeddingf : X → Y hasscaling distortionα if for any ε ∈ [0, 1), there is some setG(ε) such that(f, G(ε)) is a(1− ε)-partial embedding with distortion at mostα(ε). We say thatf hascoarselyscaling distortionif for every ε, G(ε) = G(ε).

We can extend the notions of partial probabilistic embeddings and scaling distortion to probabilistic embeddings.For simplicity we will restrict to coarsely partial embeddings.11

Definition 8 (Partial/Scaling Probabilistic Embedding). Given (X, dX) and a set of metric spacesS, for ε ∈[0, 1), acoarsely(1− ε)-partial probabilistic embeddingconsist of a distributionF over a setF of coarsely(1− ε)-partial embeddings fromX intoY ∈ S. The distortion ofF is defined as:dist(F) = supu,v∈G(ε) E(f,G(ε))∼F [distf (u, v)].

The notion of scaling distortion is extended to probabilistic embedding in the obvious way.

We observe the following relation between partial embedding, scaling distortion and the`q-distortion.

Lemma 1 (Scaling Distortion vs. `q-Distortion). Given ann-point metric space(X, dX) and a metric space(Y, dY ). If there exists an embeddingf : X → Y with scaling distortionα then for any distributionΠ over

(X2

):12

dist(Π)q (f) ≤

(2

∫ 1

12(

n2)−1

Φ(Π)α(xΦ(Π)−1)qdx

)1/q

+ α(Φ(Π)−1).

In the case ofcoarselyscaling distortion this bound holds fordist∗(Π)q (f).

8The factor of 2 in the definition is placed solely for the sake of technical convenience.9Note that the embedding isstrictly partial only if ε ≥ 1/

n2

.

10It is elementary to verify that indeed this defines a(1− ε)-partial embedding. We also note that in most of the proofs we can use amin

rather thanmax in the definition ofG(ε). However, this definition seems more natural and of more general applicability.11Our upper bounds use this definition, while our lower bounds hold also for the non-coarsely case.12Assuming the integral is defined. We note that lemma is stated using the integral for presentation reasons.

5

Combined with the following theorem we obtain Theorem5. We note that when applying the lemma we useα(ε) = O(log 1

ε ) and the bounds in the theorem mentioned above follow from bounding the corresponding integral.

Theorem 6 (Scaling Distortion Theorem intoLp). Let 1 ≤ p ≤ ∞. For anyn-point metric space(X, d) thereexists an embeddingf : X → Lp with coarsely scaling distortionO(d(log 1

ε )/pe) and dimensioneO(p) log n.

For metrics with a decomposability parameterτX the distortion improves to:O(minτ1−1/pX (log 1

ε )1/p, log 1

ε).Applying the lemma on the probabilistic embedding into ultrametrics with scaling distortionO(log 1

ε ) of [1] weobtain:

Theorem 7 (Probabilistic Embedding into Ultrametrics). Let (X, d) an n-point metric space. There exists aprobabilistic embeddingF of X into ultrametrics, such that for every1 ≤ q ≤ ∞, and any distributionΠ over

(X2

):

dist∗(Π)q (F) = O(minq, log n+ log Φ(Π)).

Forq = 1 and for a given fixed distributionTheorem 7can be given a classic embedding (deterministic) version:

Theorem 8 (Embedding into Ultrametrics). Given an arbitrary fixed distributionΠ over(X2

), for any finite

metric space(X, d) there exists embeddingsf, f ′ into ultrametrics, such thatavgdist(Π)(f) = O(log Φ(Π)) anddistavg(Π)(f ′) = O(log Φ(Π)).

The results in [1] leave as an open question the distortion of partial embedding into an ultrametric. Whilethe upper bound which follows from [1] by applying [6, 12] is O(1

ε ), the lower bound which follows from [1] byapplying theΩ(n) lower bound on embedding into trees of [38] is only Ω( 1√

ε).

We show that the correct answer to this question is the latter bound which is achievable by embedding intoultrametrics. In fact we can obtain embeddings into low-degreek-HSTs [6]. This result is related both in spirit andtechniques to recently developed metric Ramsey theorems [10, 12] where embeddings into ultrametrics andk-HSTsplay a central role.

Theorem 9 (Partial Embedding into Ultrametrics). For everyn-point metric space(X, d) and anyε ∈ (0, 1)there exists a(1− ε) partial embedding into an ultrametric with distortionO( 1√

ε).

1.3 Algorithmic Applications

We demonstrate some basic applications of our main theorems. We must stress however that our current applicationsdo not use the full strength of these theorems. Most of our applications are based on the bound given on thedistortion of averagefor general distributions of embeddingsf into Lp and into ultrametrics withdistavg(Π)(f) =O(log Φ(Π)). In some of these applications it is crucial that the result holds for all such distributionsΠ Thisis useful for problems which are defined with respect to weightsc(u, v) in a graph or in a metric space, where thesolution involves minimizing the sum over distances weighted according toc. This is common for many optimizationproblem either as part of the objective function or alternatively it may come up in the linear programming relaxationof the problem. These weights can be normalized to define the distributionΠ. Using this paradigm we obtainO(log Φ(c)) approximation algorithms, improving on the general bound which depends onn in the case thatΦ(c)is small. This is thefirst result of this nature.

We are able to obtain such results for the following group of problems:general sparsest cut[31, 5, 33, 3, 2],multi cut[20], minimum linear arrangement[16, 40], embedding ind-dimensional meshes[16, 8], multiple sequencealignment[45] anduncapacitated quadratic assignment[36, 25].

We would like to emphasize that the notion of bounded weights is in particular natural in the last applicationmentioned above. The problem ofuncapacitated quadratic assignmentis one of the most basic problems in opera-tions research (see the survey [36]) and has been one of the main motivations for the work of Kleinberg and Tardoson metric labelling [25].

We also present a different use of our results for the problem ofmin-sumk-clustering[11].

1.4 Distance Oracles

Thorup and Zwick [44] study the problem of creatingdistance oraclesfor a given metric space. A distance oracle isa space efficient data structure which allows efficient queries for the approximate distance between pairs of points.

They give a distance oracle of spaceO(kn1+1/k), query time ofO(k) andworst casedistortion (also calledstretch) of2k − 1. They also show that this is nearly best possible in terms of the space-distortion tradeoff.

6

We extend the new notions of distortion in the context of distance oracles. In particular, we can define the`q-distortion of a distance oracle. Of particular interest are the average distortion and distortion of average notion. Wealso define partial distance oracles and a distance oracles scaling distortion. We present distance oracle analoguesto our theorems on embeddings.

2 Proof of Main Result

In this section we prove Theorem6. We make use of a new type of probabilistic partition described below.

2.1 Probabilistic Hierarchical Partitions

Definition 9 (Partition). Let (X, d) be a finite metric space. A partitionP of X is a collection of disjoint setsC(P ) = C1, C2, . . . , Ct such thatX = ∪jCj . The setsCj ⊆ X are called clusters. Forx ∈ X we denote byP (x) the cluster containingx. Given∆ > 0, a partition is∆-boundedif for all 1 ≤ j ≤ t, diam(Cj) ≤ ∆.

Definition 10 (Uniform Function). Given a partitionP of a metric space(X, d), a functionf defined onX iscalleduniformwith respect toP if for any x, y ∈ X such thatP (x) = P (y) we havef(x) = f(y).

Definition 11 (Hierarchical Partition). Fix some integerL > 0. Let I = 0 ≤ i ≤ L|i ∈ Z. A hierarchicalpartition P of a finite metric space(X, d) is a hierarchical collection of partitionsPii∈I whereP0 consists ofa single cluster equal toX and for any0 < i ∈ I and x ∈ X, Pi(x) ⊆ Pi−1(x). Given k > 1, let L =dlogk(diam(X))e and set∆0 = diam(X), and for each0 < i ∈ I, ∆i = ∆i−1/k. We say thatP is k-hierarchicalif for eachi ∈ I, Pi ∈ P , Pi is ∆i-bounded.

Definition 12 (Probabilistic Hierarchical Partition). A probabilistick-hierarchical partitionH of a finite metricspace(X, d) consists of a probability distribution over a setH of k-hierarchical partitions.

A collection of functions defined onX, f = fP,i|P ∈ H, i ∈ I is uniform with respect toH if for everyP ∈ H andi ∈ I, fP,i is uniform with respect toPi.

Definition 13 (Uniformly Padded Probabilistic Hierarchical Partition). Let H be a probabilistick-hierarchicalpartition. Given collection of functionsη = ηP,i : X → [0, 1]|i ∈ I, Pi ∈ P, P ∈ H andδ ∈ (0, 1], H is called(η, δ)-paddedif the following condition holds for alli ∈ I and for anyx ∈ X:

Pr[B(x, ηP,i(x)∆i) ⊆ Pi(x)] ≥ δ.

We sayH is uniformly paddedif η is uniform with respect toH.

To state the main lemma we require the following additional notion:

Definition 14. The local growth rate ofx ∈ X at radiusr > 0 for a given scaleγ > 0 is defined asρ(x, r, γ) =|B(x, rγ)|/|B(x, r/γ)|.Given a subspaceZ ⊆ X, the minimum local growth rate ofZ at radiusr > 0 and scaleγ > 0 is defined asρ(Z, r, γ) = minx∈Z ρ(x, r, γ).the minimum local growth rate ofx ∈ X at radiusr > 0 andscaleγ > 0 is defined asρ(x, r, γ) = ρ(B(x, r), r, γ).

We now present the main lemma on the existence of hierarchical partitions which are the main building block ofour embedding. A variant of this lemma is given in [9].

Lemma 2 (Hierarchical Uniform Padding Lemma*). Let Γ = 64. Let δ ∈ (0, 12 ]. Given a finite metric space

(X, d), there exists a probabilistic4-hierarchical partitionH of (X, d) and a uniform collection of functionsξ =ξP,i : X → 0, 1|P ∈ H, i ∈ I, such that for the collection of functionsη, defined below, we have thatH is(η, δ)-uniformly padded, and the following properties hold for anyP ∈ H, 0 < i ∈ I, Pi ∈ P :

• ∑j≤i ξP,j(x)ηP,j(x)−1 ≤ 211 ln

(n

|B(x,∆i+4)|)

/ ln(1/δ).

• If ξP,i(x) = 1 then:ηP,i(x) ≤ 2−8.

• If ξP,i(x) = 0 then:ηP,i(x) ≥ 2−8 and ρ(x,∆i−1, Γ) < 1/δ.

7

2.2 The Proof

In this section we prove the following generalization of Theorem6. The proof relies on Lemma2.

Theorem 10. Let 1 ≤ p ≤ ∞ and let1 ≤ κ ≤ p. For anyn-point metric space(X, d) there exists an embeddingf : X → Lp with coarsely scaling distortionO(d(log 1

ε )/κe) and dimensioneO(κ) log n.

Let 1 ≤ κ ≤ p. Let s = eκ. Let D = eΘ(κ) lnn. We will define an embeddingf : X → lDp , by defining foreach1 ≤ t ≤ D, functionf (t), ψ(t), µ(t) : X → R+ and letf (t) = ψ(t) ⊕ µ(t) andf = D−1/p

⊕1≤t≤D f (t).

Fix t, 1 ≤ t ≤ D. In what follows we defineψ(t). We construct a uniformly(η, 1/s)-padded probabilistic4-hierarchical partitionH as inLemma 2, and letξ be as defined in the lemma. Now fix a hierarchical partitionP ∈ H. We define the embedding by defining the coordinates for eachx ∈ X. Define forx ∈ X, 0 < i ∈ I,φ

(t)i : X → R+, by φ

(t)i (x) = ξP,i(x)ηP,i(x)−1.

Claim 3. For anyx, y ∈ X andi ∈ I if Pi(x) = Pi(y) thenφ(t)i (x) = φ

(t)i (y).

For each0 < i ∈ I we define a functionψ(t)i : X → R+ and forx ∈ X, let ψ(t)(x) =

∑i∈I ψ

(t)i (x).

Let σ(t)i (C)|C ∈ Pi, 0 < i ∈ I be i.i.d symmetric0, 1-valued Bernoulli random variables. For each

x ∈ X: For each0 < i ∈ I, let ψ(t)i (x) = σ

(t)i (Pi(x)) · g(t)

i (x), whereg(t)i : X → R+ is defined as:g(t)

i (x) =minφ(t)

i (x) ·d(x,X \Pi(x)), ∆i. Defineg(t)i : X×X → R+ as follows:g(t)

i (x, y) = minφ(t)i (x) ·d(x, y), ∆i

(Note thatg(t)i is nonsymmetric).

Claim 4. For any0 < i ∈ I andx, y ∈ X: ψ(t)i (x)− ψ

(t)i (y) ≤ g

(t)i (x, y).

Proof. We have two cases. In Case 1, assumePi(x) = Pi(y). It follows thatψ(t)i (x) − ψ

(t)i (y) = σ

(t)i (Pi(x)) ·

(g(t)i (x)−g

(t)i (y)).We will show thatg(t)

i (x)−g(t)i (y) ≤ g

(t)i (x, y). The boundg(t)

i (x)−g(t)i (y) ≤ ∆i is immediate.

To proveg(t)i (x)−g

(t)i (y) ≤ φ

(t)i (x) ·d(x, y) consider the value ofg(t)

i (y). Assume firstg(t)i (y) = φ

(t)i (y) ·d(y, X \

Pi(x)). By Claim 3φ(t)i (y) = φ

(t)i (x) and therefore

g(t)i (x)− g

(t)i (y) ≤ φ

(t)i (x) · (d(x,X \ Pi(x))− d(y, X \ Pi(x))) ≤ φ

(t)i (x) · d(x, y).

In the second caseg(t)i (y) = ∆i and thereforeg(t)

i (x)− g(t)i (y) ≤ ∆i −∆i = 0, proving the claim in this case.

Next, consider Case 2 wherePi(x) 6= Pi(y). In this case we have thatd(x, X \Pi(x)) ≤ d(x, y) which implies

thatψ(t)i (x) ≤ g

(t)i (x) ≤ g

(t)i (x, y).

Next, we define the functionµ(t), based on the embedding technique of Bourgain [13] and its generalization byMatousek [34]. Let T ′ = dlogs ne andK = k ∈ N|1 ≤ k ≤ T ′. For eachk ∈ K define a randomly chosen

subsetA(t)k ⊆ X, with each point ofX included inA

(t)k independently with probabilitys−k. For eachk ∈ K and

x ∈ X, define:Ik(x) = i ∈ I|∀u ∈ Pi(x), sk−2 < |B(u, 16∆i−1)| ≤ sk.

We make the following two simple observations:

Claim 5. For everyi ∈ I: (1) For anyx ∈ X: |k|i ∈ Ik(x)| ≤ 2. (2) For everyk ∈ K: the functioni ∈ Ik(x) isuniform with respect toPi.

We defineik : X → I, whereik(x) = mini|i ∈ Ik(x). For eachk ∈ K we define a functionµ(t)k : X → R+

and forx ∈ X let µ(t)(x) =∑

k∈K µ(t)k (x). Let Φ0 = 28. The functionµ

(t)k is defined as follows: for each

x ∈ X: For eachk ∈ K, let µ(t)k (x) = min1

4d(x,A(t)k ), h(t)

ik(x)(x), whereh(t)i : X → R+ is defined as:

h(t)i (x) = minΦ0 ·d(x,X\Pi(x)), ∆i. Defineh

(t)i : X×X → R+ as follows:h(t)

i (x, y) = minΦ0 ·d(x, y), ∆i(Note thath(t)

i is nonsymmetric). We have the following analogue ofClaim 4:

Claim 6. For anyk ∈ K andx, y ∈ X: µ(t)k (x)− µ

(t)k (y) ≤ h

(t)ik(x)(x, y).

8

Proof. Let i = ik(x). We have two cases. In Case 1, assumePi(x) = Pi(y). Let i′ = ik(y). By Claim 5we havethat i ∈ Ik(y), implying i′ ≤ i. SinceP is a hierarchical partition we have thatPi′(x) = Pi′(y). HenceClaim 5implies thati′ ∈ Ik(x), so thati ≤ i′, which impliesi′ = i.

Sinceh(t)i (x) ≤ ∆i we have thatµ(t)

k (x) − µ(t)k (y) ≤ ∆i. To proveµ

(t)k (x) − µ

(t)k (y) ≤ Φ0 · d(x, y) consider

the value ofµ(t)k (y). If µ

(t)k (y) = 1

4d(y,A(t)k ) thenµ

(t)k (x) − µ

(t)k (y) ≤ 1

4(d(x,A(t)k ) − d(y,A

(t)k )) ≤ 1

4 · d(x, y) ≤Φ0 · d(x, y). Otherwise, ifµ(t)

k (y) = Φ0 · d(y, X \ Pi(x)) then

µ(t)k (x)− µ

(t)k (y) ≤ Φ0 · (d(x,X \ Pi(x))− d(y,X \ Pi(x))) ≤ Φ0 · d(x, y).

Finally, if µ(t)k (y) = ∆i thenµ

(t)k (x)− µ

(t)k (y) ≤ ∆i −∆i = 0.

Next, consider Case 2 wherePi(x) 6= Pi(y). In this case we have thatd(x, X \Pi(x)) ≤ d(x, y) which impliesthat

µ(t)i (x)− µ

(t)i (y) ≤ µ

(t)i (x) ≤ h

(t)i (x) ≤ h

(t)i (x, y).

Lemma 7. There exists a universal constantC1 > 0 such that for anyε > 0 and any(x, y) ∈ G(ε):

|f (t)(x)− f (t)(y)| ≤ C1 (ln(1/ε)/κ + 1) · d(x, y).

Proof. FromClaim 4we get∑

0<i∈I

(ψ(t)i (x)− ψ

(t)i (y)) ≤

∑

0<i∈I

g(t)i (x, y)

Define` to be largest such that∆`+4 ≥ d(x, y) ≥ maxrε/2(x), rε/2(y). If no such` exists then let = 0.By Lemma 2we have

∑

0<i≤`

g(t)i (x, y) ≤

∑

0<i≤`

φ(t)i (x) · d(x, y) ≤

∑

0<i≤`

φ(t)i (x) · d(x, y)

≤ 211 · ln(

n

|B(x, ∆`+4)|)

/κ · d(x, y) ≤ (211 ln(2/ε)/κ) · d(x, y).

We also have that∑

`<i∈I g(t)i (x, y) ≤ ∑

`<i∈I ∆i ≤ ∆` ≤ 45d(x, y).It follows that|ψ(t)(x)− ψ(t)(y)| = |∑0<i∈I(ψ

(t)i (x)− ψ

(t)i (y))| ≤ (

211 ln(2/ε)/κ + 45) · d(x, y).

FromClaim 6we get∑

k∈K(µ(t)k (x)− µ

(t)k (y)) ≤ ∑

k∈K h(t)ik(x)(x, y).

Let k′ be the largest such thatsk′ ≤ εn/2. We have∑

k′<k∈K

h(t)ik(x)(x, y) ≤

∑

k′<k∈K

Φ0 · d(x, y) = 28 · (dlogs ne − blogs(εn/2)c) · d(x, y)

≤ 28 · (ln(2/ε)/κ + 2) · d(x, y).

Now, if k ≤ k′ andi ∈ Ik(x) then for anyu ∈ Pi(x) we have|B(x, 16∆i)| ≤ |B(u, 16∆i−1)| ≤ sk ≤ εn/2. Itfollows thatd(x, y) ≥ rε/2(x) ≥ 16∆i. Denote the largesti satisfying the last inequality′. UsingClaim 5we get

∑

k′ ≥k∈K

h(t)ik(x)(x, y) =

∑

k′≥k∈K

∆ik(x)(x) ≤∑

`′≥i∈I

∑

k∈K,i∈Ik(x)

∆i ≤∑

`′≥i∈I

2∆i ≤ 4∆`′ ≤ d(x, y)/4.

It follows that

|µ(t)(x)− µ(t)(y)| = |∑

k∈K

(µ(t)k (x)− µ

(t)k (y))| ≤ 28 (ln(2/ε)/κ + 3) · d(x, y).

Therefore

|f (t)(x)− f (t)(y)| = |ψ(t)(x) + µ(t)(x)− ψ(t)(y)− µ(t)(y)| ≤ 212 (ln(2/ε)/κ + 1) · d(x, y).

9

Lemma 8. There exists a universal constantC2 > 0 such that for anyx, y ∈ X, with probability at leaste−5κ/4:

|f (t)(x)− f (t)(y)| ≥ C2 · d(x, y).

Proof. Let 0 < ` ∈ I be such that4∆`−1 ≤ d(x, y) ≤ 16∆`−1. We distinguish between the following two cases:

• Case 1:EitherξP,`(x) = 1 or ξP,`(y) = 1.

Assume w.l.o.g thatξP,`(x) = 1. It follows thatφ(t)` (x) = ηP,`(x)−1. As H is (η, δ)-padded we have the

following boundPr[B(x, ηP,`(x)∆`) ⊆ P`(x)] ≥ 1/s.Therefore with probability at least1/s:

φ(t)` (x) · d(x,X \ P`(x)) ≥ φ

(t)` (x) · ηP,`(x)∆` ≥ ∆`.

Assume that this event occurs. We distinguish between two cases:

– |f (t)(x) − f (t)(y) − (ψ(t)` (x) − ψ

(t)` (y))| ≥ 1

2∆`. In this case there is probability at least1/4 that

σ(t)` (P`(x)) = σ

(t)` (P`(y)) = 0, so thatψ(t)

` (x) = ψ(t)` (y) = 0.

– |f (t)(x) − f (t)(y) − (ψ(t)` (x) − ψ

(t)` (y))| ≤ 1

2∆`. Sincediam(P`(x)) ≤ ∆` < d(x, y) we have that

P`(y) 6= P`(x). We get that there is probability1/4 thatσ(t)` (P`(x)) = 1 andσ

(t)` (P`(y)) = 0 so that

ψ(t)` (x)− ψ

(t)` (y) ≥ ∆`.

We conclude that with probability at least1/4s: |f (t)(x)− f (t)(y)| ≥ 12∆`.

• Case 2:ξP,`(x) = ξP,`(y) = 0

It follows from Lemma 2that maxρ(x,∆`−1,Γ), ρ(y, ∆`−1, Γ) < s. Let x′ ∈ B(x,∆`−1) and y′ ∈B(y, ∆`−1) such thatρ(x′, ∆`−1,Γ) = ρ(x, ∆`−1,Γ) andρ(y′, ∆`−1, Γ) = ρ(y, ∆`−1, Γ). Forz ∈ x′, y′we have:

s >|B(z, Γ∆`−1)||B(z,∆`−1/Γ)| ≥

|B(x, 32∆`−1)||B(z,∆`−1/Γ)| ,

using thatd(x, x′) ≤ ∆`−1 andd(x, y′) ≤ d(x, y)+d(y, y′) ≤ 17∆`−1, andΓ = 64, so thatB(x, 32∆`−1) ⊆B(z,Γ∆`−1). Let k ∈ K be such thatsk−1 < |B(x, 32∆`−1)| ≤ sk. We deduce that forz ∈ x′, y′,|B(z,∆`−1/Γ)| > sk−2. Consider an arbitrary pointu ∈ P`(x) asd(x, u) ≤ ∆` < ∆`−1 it follows thatsk−2 < |B(u, 16∆`−1)| ≤ sk. This implies that ∈ Ik(x) and thereforeik(x) ≤ `. As H is (η, δ)-paddedwe have the following bound

Pr[B(x, ηP,`(x)∆`) ⊆ P`(x)] ≥ 1/s.

Assume that this event occurs. SinceP is hierarchical we get that for everyi ≥ ` B(x, ηP,`)(x)∆` ⊆ P`(x) ⊆Pi(x) and in particular this holds fori = ik(x). As ξP,`(x) = 0 we have thatηP,`(x) ≥ 2−8 = 1/Φ0. Hence,

Φ0 · d(x,X \ Pi(x)) ≥ Φ0 · ηP,`(x)∆` ≥ ∆`.

Implying: µ(t)k (x) = min1

4d(x,A(t)k ),Φ0 · d(x,X \ Pi(x)),∆i ≥ min1

4d(x,A(t)k ),∆`.

The following is a variant on the original argument in [13, 34]. Define the events:A1 = B(y′, ∆`−1/Γ) ∩A

(t)k 6= ∅, A2 = B(x′,∆`−1/Γ) ∩ A

(t)k 6= ∅ andA′2 = [B(x, 32∆`−1) \ B(y′,∆`−1)] ∩ A

(t)k = ∅. Then for

m ∈ 1, 2:

Pr[Am] ≥ 1−(1− s−k

)sk−2

≥ 1− e−s−k·sk−2 ≥ 1− e−s−2 ≥ s−2/2,

Pr[A′2] ≥(1− s−k

)sk

≥ 1/4,

using s ≥ 2. Observe thatd(x′, y′) ≥ d(x, y) − 2∆`−1/Γ ≥ (1 − 2/Γ)∆`−1 > 2∆`−1/Γ, implyingB(y′, ∆`−1/Γ) ∩B(x′, ∆`−1/Γ). It follows that eventA1 is independent of either eventA2 orA′2.

Assume eventA1 occurs. It follows thatd(y, A(t)k ) ≤ d(y, y′) + ∆`−1/Γ ≤ (1 + 1/Γ)∆`−1. We distinguish

between two cases:

10

– |f (t)(x) − f (t)(y) − (µ(t)k (x) − µ

(t)k (y))| ≥ 3/2 · ∆`. In this case there is probability at leasts−2/2

that eventA2 occurs, so that|µ(t)k (x)−µ

(t)k (y)| ≤ 1

4 maxd(x,A(t)k ), d(y, A

(t)k ) ≤ 1

4(1+1/Γ)∆`−1 =(1 + 1/Γ)∆`, by boundingd(x,A

(t)k ) in the same way as done above ford(y, A

(t)k ). We therefore get

with probability at leasts−2/2 that|f (t)(x)− f (t)(y) ≥ 3/2 ·∆` − (1 + 1/Γ)∆` ≥ ∆`/4.

– |f (t)(x) − f (t)(y) − (µ(t)` (x) − µ

(t)` (y))| ≤ 3/2 ·∆`. In this case there is probability at least1/4 that

eventA′2 occurs. Observe that:

d(x,B(y′,∆`−1/Γ)) ≤ d(x, y) + d(y, y′) + ∆`−1/Γ≤ 16∆`−1 + ∆`−1 + ∆`−1/Γ ≤ (17 + 1/Γ)∆`−1 < 32∆`−1,

d(x,B(y′,∆`−1/Γ)) ≥ d(x, y)− d(y, y′)−∆`−1/Γ≥ 4∆`−1 −∆`−1 −∆`−1/Γ ≥ (3− 1/Γ)∆`−1,

implying thatd(x,A(t)k ) ≥ d(x,B(y′, ∆`−1/Γ)) ≥ (3−1/Γ)∆`−1 and thereforeµ(t)

k (x) ≥ min14(3−

1/Γ)∆`−1, ∆` = (3− 1/Γ)∆`. Sinceµ(t)k (y) ≤ 1

4d(y, A(t)k ) ≤ 1

4(1 + 1/Γ)∆`−1 = (1 + 1/Γ)∆` we

obtain that:µ(t)k (x) − µ

(t)k (y) ≥ (3 − 1/Γ)∆` − (1 + 1/Γ)∆` ≥ (2 − 2/Γ)∆`. We therefore get with

probability at least1/4 that|f (t)(x)−(t) (y)| ≥ (2− 2/Γ)∆` − 3/2 ·∆` ≥ ∆`/4.

We conclude that givenA1, with probability at leasts−2/2: |f (t)(x)− f (t)(y)| ≥ ∆`/4.

It follows that with probability at leasts−5/4: |f (t)(x)− f (t)(y)| ≥ 14∆` ≥ 1

44−3d(x, y) = 2−8d(x, y).

Lemma 9. There exists a universal constantsC ′1, C

′2 > 0 such that w.h.p for anyε > 0 and any(x, y) ∈ G(ε):

C ′2 · d(x, y) ≤ ‖f(x)− f(y)‖p ≤ C ′

1 (ln(1/ε)/κ + 1) · d(x, y).

Proof. By definition‖f(x)− f(y)‖pp = D−1

∑1≤t≤D |f (t)(x)− f (t)(y)|p.

Lemma 7implies that‖f(x)− f(y)‖pp ≤ (C1(ln(1/ε)/κ + 1))p d(x, y)p.

UsingLemma 8and applying Chernoff bounds we get w.h.p for anyx, y ∈ X:

‖f(x)− f(y)‖pp ≥

18e−5κ (C2d(x, y))p ≥ 1

8(e−5 · C2d(x, y)

)p.

References

[1] I. Abraham, Y. Bartal, H. Chan, K. Dhamdhere, J. Kleiberg A. Gupta, O. Neiman, and A. Slivkins. Metricembedding with relaxed guarantees.

[2] S. Arora, J. R. Lee, and A. Naor. Euclidean distortion and the sparsest cut. InSTOC ’05: Proceedings of thethirty-seventh annual ACM symposium on Theory of computing, 2005.

[3] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph partition-ing. In STOC ’04: Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages222–231. ACM Press, 2004.

[4] Vassilis Athitsos and Stan Sclaroff. Database indexing methods for 3d hand pose estimation. InGestureWorkshop, pages 288–299, 2003.

[5] Yonatan Aumann and Yuval Rabani. An o(log k) approximate min-cut max-flow theorem and approximationalgorithm.SIAM J. Comput., 27(1):291–301, 1998.

[6] Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. In37th AnnualSymposium on Foundations of Computer Science (Burlington, VT, 1996), pages 184–193. IEEE Comput. Soc.Press, Los Alamitos, CA, 1996.

[7] Y. Bartal. On approximating arbitrary metrics by tree metrics. InProceedings of the 30th Annual ACMSymposium on Theory of Computing, pages 183–193, 1998.

11

[8] Y. Bartal. Graph decomposition lemmas and their role in metric embedding methods. In12th Annual EuropeanSymposium on Algorithms, pages 89–97, 2004.

[9] Y. Bartal. On embedding finite metric spaces in low-dimensional normed spaces, 2005. Submitted.

[10] Y. Bartal, B. Bollobas, and M. Mendel. Ramsey-type theorems for metric spaces with applications to onlineproblems, 2002. To appear in Special issue of Journal of Computer and System Science.

[11] Yair Bartal, Moses Charikar, and Danny Raz. Approximating min-sum k-clustering in metric spaces. InSTOC’01: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 11–20. ACMPress, 2001.

[12] Yair Bartal, Nathan Linial, Manor Mendel, and Assaf Naor. On metric ramsey-type phenomena.Annals Math,2003. To appear.

[13] J. Bourgain. On Lipschitz embedding of finite metric spaces in Hilbert space.Israel J. Math., 52(1-2):46–52,1985.

[14] Manuel Costa, Miguel Castro, Antony I. T. Rowstron, and Peter B. Key. Pic: Practical internet coordinatesfor distance estimation. In24th International Conference on Distributed Computing Systems, pages 178–187,2004.

[15] Russ Cox, Frank Dabek, M. Frans Kaashoek, Jinyang Li, and Robert Morris. Practical, distributed networkcoordinates.ACM SIGCOMM Computer Communication Review, 34(1):113–118, 2004.

[16] Guy Even, Joseph Seffi Naor, Satish Rao, and Baruch Schieber. Divide-and-conquer approximation algorithmsvia spreading metrics.J. ACM, 47(4):585–616, 2000.

[17] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by treemetrics. InSTOC ’03: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages448–455. ACM Press, 2003.

[18] U. Feige, M. T. Hajiaghayi, and J. R. Lee. Improved approximation algorithms for minimum-weight vertexseparators. InAnnual ACM Symposium on Theory of Computing, pages 563–572, 2005.

[19] Paul Francis, Sugih Jamin, Cheng Jin, Yixin Jin, Danny Raz, Yuval Shavitt, and Lixia Zhang. Idmaps: a globalinternet host distance estimation service.IEEE/ACM Trans. Netw., 9(5):525–540, 2001.

[20] Naveen Garg, Vijay V. Vazirani, and Mihalis Yannakakis. Approximate max-flow min-(multi)cut theoremsand their applications. InACM Symposium on Theory of Computing, pages 698–707, 1993.

[21] Mikhael Gromov. Filling Riemannian manifolds.J. Differential Geom., 18(1):1–147, 1983.

[22] Eran Halperin, Jeremy Buhler, Richard M. Karp, Robert Krauthgamer, and B. Westover. Detecting proteinsequence conservation via metric embeddings. InISMB (Supplement of Bioinformatics), pages 122–129, 2003.

[23] G. Hjaltason and H. Samet. Contractive embedding methods for similarity searching in metric spaces, 2000.

[24] Gabriela Hristescu and Martin Farach-Colton. Cofe: A scalable method for feature extraction from complexobjects. InDaWaK, pages 358–371, 2000.

[25] Jon Kleinberg and Eva Tardos. Approximation algorithms for classification problems with pairwise relation-ships: metric labeling and markov random fields.J. ACM, 49(5):616–639, 2002.

[26] Jon M. Kleinberg, Aleksandrs Slivkins, and Tom Wexler. Triangulation and embedding using small sets ofbeacons. InFOCS, pages 444–453, 2004.

[27] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: A new embedding method for finitemetrics. In45th Annual IEEE Symposium on Foundations of Computer Science, pages 434–443. IEEE, October2004.

[28] J.B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis.Psychome-trika, 29(1):1–27, 1964.

12

[29] Joseph B. Kruskal and Myron Wish.Multidimensional Scaling. M. Sage Publications, CA, 1978.

[30] James R. Lee, Manor Mendel, and Assaf Naor. Metric structures in l1: Dimension, snowflakes, and averagedistortion. InLATIN, pages 401–412, 2004.

[31] Frank Thomson Leighton and Satish Rao. Multicommodity max-flow min-cut theorems and their use in de-signing approximation algorithms.J. ACM, 46(6):787–832, 1999.

[32] Hyuk Lim, Jennifer C. Hou, and Chong-Ho Choi. Constructing internet coordinate system based on delaymeasurement. In3rd ACM SIGCOMM Conference on Internet Measurement, pages 129–142, 2003.

[33] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applications.Combinatorica, 15(2):215–245, 1995.

[34] J. Matousek. Note on bi-lipschitz embeddings into low-dimensional euclidean spaces.Comment. Math. Univ.Carolinae, 31:589–600, 1990.

[35] T. S. Eugene Ng and Hui Zhang. Predicting internet network distance with coordinates-based approaches.In 21st Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), pages178–187, 2002.

[36] P. Pardalos, F. Rendl, and H. Wolkowicz. The quadratic assignment problem: a survey and recent develop-ments. In P. Pardalos and H. Wolkowicz, editors,Quadratic assignment and related problems (New Brunswick,NJ, 1993), pages 1–42. Amer. Math. Soc., Providence, RI, 1994.

[37] Yuri Rabinovich. On average distortion of embedding metrics into the line and into l1. InSTOC ’03: Proceed-ings of the thirty-fifth annual ACM symposium on Theory of computing, pages 456–462. ACM Press, 2003.

[38] Yuri Rabinovich and Ran Raz. Lower bounds on the distortion of embedding finite metric spaces in graphs.Discrete & Computational Geometry, 19(1):79–94, 1998.

[39] S. Rao. Small distortion and volume preserving embeddings for planar and Euclidean metrics. InProceedingsof the Fifteenth Annual Symposium on Computational Geometry, pages 300–306, New York, 1999. ACM.

[40] S. Rao and A. Richa. New approximation techniques for some ordering problems. InSODA, pages 211–219,1998.

[41] Yuval Shavitt and Tomer Tankel. Big-bang simulation for embedding network distances in euclidean space.IEEE/ACM Trans. Netw., 12(6):993–1006, 2004.

[42] Liying Tang and Mark Crovella. Geometric exploration of the landmark selection problem. InPAM, pages63–72, 2004.

[43] M. Thorup and U. Zwick. Compact routing schemes. In13th Annual ACM Symposium on Parallel Algorithmsand Architectures (SPAA), pages 1–10. ACM Press, July 2001.

[44] Mikkel Thorup and Uri Zwick. Approximate distance oracles.J. ACM, 52(1):1–24, 2005.

[45] Bang Ye Wu, Giuseppe Lancia, Vineet Bafna, Kun-Mao Chao, R. Ravi, and Chuan Yi Tang. A polynomialtime approximation scheme for minimum routing cost spanning trees. InSODA ’98: Proceedings of theninth annual ACM-SIAM symposium on Discrete algorithms, pages 21–32. Society for Industrial and AppliedMathematics, 1998.

13

Contents

1 Introduction 11.1 `q-Distortion and the Main Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Partial Embedding, Scaling Distortion and Additional Results. . . . . . . . . . . . . . . . . . . . . 51.3 Algorithmic Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Distance Oracles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Proof of Main Result 72.1 Probabilistic Hierarchical Partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 The Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

A Partial Embedding, Scaling Distortion and the`q-Distortion 15A.1 Distortion of`q-Norm for Fixedq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

B Embedding into Ultrametrics 18

C Applications 19C.1 Sparsest cut. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20C.2 Multi cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20C.3 Minimum Linear Arrangement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21C.4 Multiple sequence alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21C.5 Uncapacitated quadratic assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22C.6 Min-sumk-clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

D Distance Oracles 23D.1 Distance oracles with scaling distortion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23D.2 Partial distance oracles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

E Constructing Hierarchical Probabilistic Partitions 25

14

A Partial Embedding, Scaling Distortion and the`q-Distortion

The following lemma states that lower bounds on thelq-distortion follow from lower bound on(1 − ε)-partialembeddings. Applying this on the lower bound results from [1] we obtain the tightness of our bounds.

Lemma 10 (Partial Embedding vs. `q-Distortion). Let Y be a target metric space, letX be a family of metricspaces. If for anyε ∈ [0, 1), there is a lower bound ofα(ε) on the distortion of(1− ε) partial embedding of metricspaces inX into Y , then for any1 ≤ q ≤ ∞, there is a lower bound of12α(2−q) on the`q-distortion of embeddingmetric spaces inX into Y .

Proof. For any1 ≤ q ≤ ∞ setε = 2−q and letX ∈ X be a metric space such that any(1 − ε) partial embeddinginto Y has distortion at leastα(ε). Now, letf be an embedding ofX into Y . It follows that there are at leastε

(n2

)

pairsu, v ∈ (X2

)such thatdistf (u, v) ≥ α(ε). Therefore:

(E [distf (u, v)q])1/q ≥ (εα(ε)q)1/q ≥ (2−qα

(2−q

)q)1/q =12α

(2−q

).

The following lemma in particular implies constant average distortion for any scaling embedding with distortionO(log(1/ε)). However, the argument extends to the weighted average case to provideO(log Φ) distortion, whereΦis the effective aspect ratio of the weight function.

Lemma 1 (Scaling Distortion vs. `q-Distortion). Given ann-point metric space(X, dX) and a metric space(Y, dY ). If there exists an embeddingf : X → Y with scaling distortionα then for any distributionΠ over

(X2

):13

dist(Π)q (f) ≤

(2

∫ 1

12(

n2)−1


)1/q

+ α(Φ(Π)−1).

Proof. We may restrict to the caseΦ(Π) ≤ (n2

). OtherwiseΦ(Π) >

(n2

)and thereforedist(Π)

q (f) ≤ dist(f) ≤α(Φ(Π)−1). Recall that

dist(Π)q (f) = ‖distf (u, v)‖(Π)

q = EΠ[distf (u, v)q]1/q.

Define for eachε ∈ [0, 1) the setG(ε) of the (1 − ε)(n2

)pairs u, v of smallest distortiondistf (u, v) over all

pairs in(X2

). Sincef is a (1 − ε)-partial embedding for anyε ∈ [0, 1) we have that for eachu, v ∈ G(ε),

distf (u, v) ≤ α(ε). Let Gi = G(2−iΦ(Π)−1) \ G(2−(i−1)Φ(Π)−1). Sinceα is a monotonic non-increasingfunction, it follows that

EΠ[distf (u, v)q] =∑

u6=v∈X

π(u, v)distf (u, v)q

≤∑

u,v∈G(Φ(Π)−1)

π(u, v)α(Φ(Π)−1)q +

blog((n2)Φ(Π)−1)c∑

i=1

∑

u,v∈Gi

π(u, v)α(2−iΦ(Π)−1)q

≤∑

u6=v∈X

π(u, v) · α(Φ(Π)−1)q +


i=1

|Gi| · Φ(Π)(

n2

)∑

u6=v∈X

π(u, v)

· α(2−iΦ(Π)−1)q

≤ α(Φ(Π)−1)q +blog((n

2)Φ(Π)−1)c∑

i=1

2−i · α(2−iΦ(Π)−1)q

≤ α(Φ(Π)−1)q +

(2

∫ 1

12(

n2)−1


).

13Assuming the integral is defined. We note that lemma is stated using the integral for presentation reasons.

15

The following somewhat more sophisticated lemma in particular implies constantdistortion of averageor anyscaling embedding with distortionO(log(1/ε)). Again, the argument extends to the weighted average case toprovideO(log Φ) distortion.

Lemma 2 (Coarsely Scaling Distortion vs. Distortion of q-Norm). Given ann-point metric space(X, dX) anda metric space(Y, dY ). If there exists an embeddingf : X → Y with coarsely scaling distortionα then for anydistributionΠ over

(X2

):14

distnorm(Π)q (f) ≤

(2

∫ 1

12(

n2)−1


)1/q

+ α(Φ(Π)−1).

Proof. We may restrict to the caseΦ(Π) ≤ (n2

). OtherwiseΦ(Π) >

(n2

)and thereforedistnorm(Π)

q (f) ≤ dist(f) ≤α(Φ(Π)−1). Recall that

distnorm(Π)q (f) =

EΠ[dY (f(u), f(v))q]1/q

EΠ[dX(u, v)q]1/q.

For ε ∈ [0, 1) recall thatG(ε) = x, y ∈ (X2

)|d(x, y) ≥ maxrε/2(x), rε/2(y). Since(f, G) is a(1− ε)-partial

embedding for anyε ∈ [0, 1) we have that for eachu, v ∈ G(ε), distf (u, v) ≤ α(ε). Let Gi = G(2−iΦ(Π)−1) \G(2−(i−1)Φ(Π)−1). We first need to prove the following property:

∑

u,v∈Gi

dX(u, v)q ≤ 2−iΦ(Π)−1∑

u6=v∈X

dX(u, v)q.

To prove this fix someu ∈ X. Let S = v|u, v /∈ G(2−(i−1)Φ(Π)−1). ThenS = B(u, r2−iΦ(Π)−1(u)). Thus,

|S|/(n2

)2−iΦ(Π)−1 and for eachv ∈ S, v′ ∈ S we haved(u, v) ≤ d(u, v′). It follows that:

∑

v;u6=v∈X

dX(u, v)q =∑

v∈S

dX(u, v)q +∑

v∈S

dX(u, v)q

≥ |S| ·∑

v∈S dX(u, v)q

|S| + |S| ·∑

v∈S dX(u, v)q

|S| =

(n2

)

|S|∑

v∈S

dX(u, v)q.

14Assuming the integral is defined.

16

Sinceα is a monotonic non-increasing function, it follows that

EΠ[dY (f(u), f(v))q] =∑

u 6=v∈X

π(u, v)dY (f(u), f(v))q

=∑

u 6=v∈X

π(u, v)dX(u, v)qdistf (u, v)q

≤∑

u,v∈G(Φ(Π)−1)

π(u, v)dX(u, v)qα(Φ(Π)−1)q +


i=1

∑

u,v∈Gi

π(u, v)dX(u, v)qα(2−iΦ(Π)−1)q

≤∑

u 6=v∈X

π(u, v)dX(u, v)q · α(Φ(Π)−1)q +


i=1

∑

u,v∈Gi

dX(u, v)q · Φ(Π) · minw 6=z∈X

π(w, z) · α(2−iΦ(Π)−1)q

≤∑

u 6=v∈X

π(u, v)dX(u, v)q · α(Φ(Π)−1)q +


i=1

∑

u6=v∈X

2−idX(u, v)q · minw 6=z∈X

π(w, z) · α(2−iΦ(Π)−1)q

≤∑

u 6=v∈X

π(u, v)dX(u, v)q · α(Φ(Π)−1)q +


i=1

∑

u6=v∈X

π(u, v)dX(u, v)q · 2−i · α(2−iΦ(Π)−1)q

≤ EΠ[dX(u, v)q] ·[α(Φ(Π)−1)q +

(2

∫ 1

12(

n2)−1


)].

A.1 Distortion of `q-Norm for Fixed q

Lemma 1. Let 1 ≤ q ≤ ∞. For any finite metric space(X, d), there exists an embeddingf from X into a star

metric such that for any non-degenerate distributionΠ: distnorm(Π)q (f) ≤ 21/q(2q − 1)1/qΦ(Π)1/q. In particular:

distnormq(f) ≤ 21/q(2q − 1)1/q ≤ √6.

Proof. Let w ∈ X be the point that minimizes(∑

x∈X d(w, x)q)1/q. Let Y = X ∪ r. Define a star metric(Y, d′)

17

wherer is the center and for everyx ∈ X: d′(r, x) = d(w, x). Thusd′(x, y) = d(w, x) + d(w, y). Then

EΠ[d′(u, v)q] =∑

u6=v∈X

π(u, v)d′(u, v)q ≤∑

u6=v∈X

π(u, v) (d(u, w) + d(w, v))q

≤ (2q − 1)∑

u6=v∈X

π(u, v) (d(u,w)q + d(w, v)q)

≤ (2q − 1)∑

u6=v∈X

(Φ(Π) min

s6=t∈Xπ(s, t)

)· (d(u,w)q + d(w, v)q)

= (2q − 1) · Φ(Π) mins6=t∈X

π(s, t) · n− 12

(∑

u∈X

d(u,w)q +∑

v∈X

d(w, v)q

)

≤ (2q − 1) · Φ(Π) · (n− 1) · 1n

∑

z∈X

∑

u∈X

mins6=t∈X

π(s, t) · d(u, z)q

≤ 2(2q − 1) · Φ(Π) ·∑

u6=v∈X

π(u, v) · d(u, v)q

= 2(2q − 1) · Φ(Π) · EΠ[d(u, v)q].

B Embedding into Ultrametrics

In this section we show(1− ε) partial embedding with distortionlog(1/√

ε) into ultrametrics.

Definition 15. An ultrametric U (or 1-HST) is a metric space(U, dU ) whose elements are the leaves of a rootedtreeT . Eachv ∈ T is associated a label∆(v) ≥ 0 such that ifu ∈ T is a descendant ofv then∆(u) ≤ ∆(v) and∆(u) = 0 iff u ∈ U is a leaf. The distance between leavesx, y ∈ U is defined asdU (x, y) = ∆(lca(x, y)) wherelca(x, y) is the least common ancestor ofx andy in T .

Theorem 9. For everyn-point metric space(X, d) and anyε ∈ (0, 1) there exists a(1− ε) partial embedding intoan ultrametric with distortionO( 1√

ε).

Proof. The proof is by induction on the size ofX (the base case is where|X| = 1 and is trivial). Assume the claimis true for any metric space with less than|X| points. DenoteΛ = diam(X). Letu, v ∈ X such thatdX(u, v) = Λ,and assume

∣∣B(u, Λ3 )

∣∣ ≤ n2 (otherwise switch the roles ofu andv).

Let I = 1, 2, 3, · · · d√

5/ε e, for everyi ∈ I let Ai = B(u, i

√ε/5Λ

3

), αi = |Ai|

n , Si = Ai+1 \Ai.

There are two cases to consider.Case 1:α1 ≤ ε. In this case label the root ofU with Λ, and it will have twochildren: one is a leaf formed by the singletonu and the other is the tree formed recursively byX \ u.

Let v ∈ X \ Ai thendX(u, v) ≥√

ε/5Λ3 . SincedU (u, v) = Λ then distortion isO( 1√

ε). Hence, only pairs

(u,w) wherew ∈ A1 may need to be discarded, and using the induction hypothesis the total number of distorteddistances is at most:

ε(n− 1) + ε

(n− 1

2

)= ε

n2 − n

2= ε

(n

2

)

as required.Case 2:α1 > ε. Notice that for alli ∈ I we haveαi ≤ 1

2 because∣∣B(u, Λ

3 )∣∣ ≤ n

2 .

Claim 1. There existsi ∈ I such that|Si|2 ≤ εαin2

Proof. Seeking a contradiction, assume that for alli ∈ I, |Si|2 = (αi+1 − αi)2n2 > εαin2 and henceαi+1 − αi >√

εαi. Under this assumptionαi ≥ 19εi2 for all i ∈ I, and the proof is by induction.

For the base caseα1 > ε > 19ε. For the induction step, assumeαi ≥ 1

9εi2, then

18

αi+1 − αi ≥ √εαi

αi+1 ≥ 19εi2 +

√19ε2i2 =

19ε(i2 + 3i) ≥ 1

9ε(i + 1)2

Thereforeα√5/ε

≥ 19ε(

√5/ε)2 > 1

2 and contradiction follows. This completes the proof ofClaim 1

Let i ∈ I be an index such that|Si|2 ≤ εαin2. We partitionX toX1 = B

(u, (i + 1

2)√

ε/5Λ3

)andX2 = X\X1

(we divide the shellSi in half). Label the root ofU with Λ and create two children: one is the tree formed recursivelyby X1 and the other is the tree formed recursively byX2.

Since|X1| ≥ αin and|X2| ≥ n2 we get

(|Si|2

)≤ |Si|2

2≤ ε|X1||X2|

Let C = (u, v) | u ∈ X1, v ∈ X2 andD = (u, v) | u ∈ u ∈ X1 ∩ Si, v ∈ u ∈ X2 ∩ Si. So forany (u, v) ∈ C \ D, d(u, v) ≥ mindX(X1, X2 \ Si), dX(X1 \ Si, X2) ≥

√(ε/5)Λ/6. Hence only distances

(u, v) ∈ D may be distorted by more thanO( 1√ε), and there are at most

(|Si|2

)such distances.

Using the induction hypothesis we conclude that the total number of distorted distances is at most

ε

(|X1|2

)+ ε

(|X2|2

)+

(|Si|2

)≤ ε

12

(|X1|2 − |X1|+ |X2|2 − |X2|+ 2|X1||X2|)

≤ ε12(|X1|+ |X2|)(|X1|+ |X2| − 1)

= ε

(n

2

)

C Applications

Consider an optimization problem defined with respect to weightsc(u, v) in a graph or in a metric space, where thesolution involves minimizing the sum over distances weighted according toc:

∑u,v c(u, v)d(u, v). It is common

for many optimization problem that such a term appears either in the objective function or alternatively it may comeup in the linear programming relaxation of the problem.

Then these weights are can be normalized to define the distributionΠ whereπ(u, v) = c(u,v)Px,y c(x,y) so that the

goal translates into minimizing theexpected distanceaccording to the distributionΠ. We can now use our results toconstruct embeddings with smalldistortion of averageprovided inTheorem 5, Theorem 7andTheorem 8. Thus weget embeddingsf into Lp and into ultrametrics withdistavg(Π)(f) = O(log Φ(Π)). In some of these applicationsit is crucial that the result holds for all such distributionsΠ (Theorems5 and Theorem7).

DefineΦ(c) = Φ(Π) andΦ(c) = Φ(Π) . Note that if for allu 6= v, c(u, v) > 0 thenΦ(c) = maxu,v c(u,v)minu,v c(u,v) .

Using this paradigm we obtainO(log Φ(c)) = O(minlog(Φ(c)), log n) approximation algorithms.This lemma below summarizes the specific propositions which will be useful in most of the applications in the

sequel:

Lemma 2. LetX be a metric space, with a weight function on the pairsc :(X2

) → R+ . Then:

1. There exists an embeddingf : X → Lp such that for any weight functionc:

∑

u,v∈(X2 )

c(u, v)‖f(u)− f(v)‖p ≤ O(log Φ(c))∑

u,v∈(X2 )

c(u, v)dX(u, v)

19

2. There is a set of ultrametricsS and a probabilistic embeddingF of X intoS such that for any weight functionc:

Ef∼F

∑

u,v∈(X2 )

c(u, v)dY (f(u), f(v))

≤ O(log Φ(c))

∑

u,v∈(X2 )

c(u, v)dX(u, v)

3. For any given weight functionc, there exists an ultrametric(Y, dY ) and an embeddingf : X → Y such that

∑

u,v∈(X2 )

c(u, v)dY (f(u), f(v)) ≤ O(log Φ(c))∑

u,v∈(X2 )

c(u, v)dX(u, v)

C.1 Sparsest cut

We show an approximation for the sparsest cut problem for complete weighted graphs, i.e., for the following prob-lem:

Given a complete graphG(V, E) with capacitiesc(u, v) : E → R+ and demandsD(u, v) : E → R+. Definethe weight of a cut(S, S) as ∑

u∈S,v∈S c(u, v)∑u∈S,v∈S D(u, v)

We seek a subsetS ⊆ V minimizing the weight of the cut.The uniform demand case of the problem was first given an approximation algorithm ofO(log n) by Leighton

and Rao [31]. For the general caseO(log k) approximation algorithms were given by Aumann and Rabani [5] andLondon, Linial and Rabinovich [33] (wherek is the number of demands), via embeddings intoL1 of Bourgain.Recently Arora, Rao and Vazirani improved the uniform case bound toO(

√log n) and subsequently Arora, Lee and

Naor gave anO(√

log n log log n) approximation for the general demand case based on embedding of negative-typemetrics intoL1.

We show anO(log Φ(c)) approximation. We apply the method of [33]: build the following linear program:

minτ

∑u,v

c(u, v)τ(u, v)

subject to:∑u,v

D(u, v)τ(u, v) ≥ 1

τ satisfies triangle inequality

τ ≥ 0

If the solution would yield a cut metric it would be the optimal solution. We solve the relaxed program forall metrics, obtaining a metric(V, τ), then embed(V, τ) into l1, usingf of Lemma 2. Since the embedding isnon-contractiveτ(u, v) ≤ ‖f(u)− f(v)‖1, hence

∑u,v c(u, v)‖f(u)− f(v)‖1∑u,v D(u, v)‖f(u)− f(v)‖1

≤ O(log Φ(c))

∑u,v c(u, v)τ(u, v)∑u,v D(u, v)τ(u, v)

Following [33], we can obtain a cut that provides aO(log Φ(c)) approximation.

C.2 Multi cut

The multi cut problem is: given a complete graphG(V, E) with weightsc(u, v) : E → R+, andk set of pairs(si, ti) ⊆ V × V i = 1, . . . , k find a minimal weight subsetE′ ⊆ E, such that removing every edge inE′disconnects every pair(si, ti).

There best approximation algorithm for this problem due to Garg, Vazirani and Yannakakis [20] has performanceO(log k).

20

We show aO(log Φ(c)) approximation. We slightly change the methods of [20], create a linear program:

minτ

∑

(u,v)∈(V2)

c(u, v)τ(u, v)

subject to:∀i, j∑

(u,v)∈pji

τ(u, v) ≥ 1

τ satisfies triangle inequality

τ ≥ 0

wherepji is thej-th path fromsi to ti. Now solve the relaxed version obtaining metric space(V, τ). Using(3.)

of Lemma 2we get an embeddingf : V → Y into an HST(Y, dY ). We use this metric to partition the graph insteadof the region growing method introduced by [20].

Hence∑

(u,v)∈(V2)

c(u, v)dY (u, v) ≤ O(log Φ(c))∑

(u,v)∈(V2)

c(u, v)τ(u, v). We build a multi cutE′: for every

pair (si, ti) find theirlca(si, ti) = ri, and create two clusters containing all the vertices under each child: insert intoE′ all the edges between the points in each subtree and the rest of the graph. Since we have the constraint that∑

(u,v)∈pjiτ(u, v) ≥ 1, we get from the fact thatf is non-contractive that∆(ri) = dY (si, ti) ≥ 1. It follows that if

an edge(u, v) ∈ E′ n thend(u, v) ≥ 1. It follows that∑

(u,v)∈E′c(u, v) ≤

∑

(u,v)∈(V2)

c(u, v)dY (u, v) ≤ O(log Φ(c))OPT

C.3 Minimum Linear Arrangement

The same idea can be used in the minimum linear arrangement problem, where we have an undirected graphG(V,E)with capacitiesc(e) for everye ∈ E, we wish to find a one to one arrangement of verticesh : V → 1, . . . , |V |,minimizing the total edge length:

∑(u,v)∈E c(u, v)|h(u)− h(v)|.

This problem was first given anO(log n log log n) approximation by Even, Naor, Rao and Schieber [16], whichwas subsequently improved by Rao and Richa [40] to O(log n).

As shown in [16], this can be done using the following LP:

min∑

u6=v∈V

c(u, v)d(u, v)

s.t. ∀U ⊆ V, ∀v ∈ U :∑

u∈U

d(u, v) ≥ 14(|U |2 − 1)

∀(u, v) : d(u, v) ≥ 0

which is proven there to be a lower bound to the optimal solution. Even et. al [16] use this LP formulation todefine aspreading metricwhich they use to recursively solve the problem in a divide-and-conquer approach. Theirmethod can be in fact viewed as an embedding into an ultrametric (HST) (the argument is similar to the one givenfor the special case of themulti cutproblem) and so by using assertion(3.) of Lemma 2we obtain anO(log Φ(c))approximation.

The problem of embedding ind-dimensional meshes is basically an expansion ofh to d dimensions, and can besolved in the same manner.

C.4 Multiple sequence alignment

Multiple sequence alignments are important tools in highlighting similar patterns in a set of genetic or molecularsequence.

Givenn strings over a small character set, the goal is to insert gaps in each string as to minimize the total numberof different characters between all pairs of strings, when the cost of gap is considered 0.

In their paper, [45] showed an approximation algorithm for the generalized version, where each pair of stringhas an importance parameterc(u, v), they phrased the problem as finding a minimumcommunication cost spanningtree, i.e. finding a tree that minimizes

∑u,v c(u, v)d(u, v), whered is the edit distance. They apply probabilistic

embedding into trees to bound the cost of such a tree. This gives an approximation ratio ofO(log n).UsingLemma 2we get anO(log Φ(c)) approximation.

21

C.5 Uncapacitated quadratic assignment

The uncapacitated quadratic assignment problem is one of the most studied problems in operations research (see thesurvey [36]) and is once of the main applications of metric labelling [25]. Given threen×n input matricesC,D, F ,such thatC is symmetric with0 in the diagonal,D is a metric and all matrices are non-negative. The objective is tominimize

minσ∈Sn

∑

i,j

C(i, j)D (σ(i), σ(j)) +∑

i

F (i, σ(i))

whereSn is the set of all permutations overn elements.One of the major applications of uncapacitated quadratic assignment is in location theory: whereC(i, j) is thematerial flow from facilityi to j, D (σ(i), σ(j)) is their distance after locating them andF (i, σ(i)) is the cost forpositioning facilityi at locationσ(i).

Unlike the previous applications hereC is not a fixed weight function on the metricD, but the actual weightsdepends onσ which is determined by the algorithm. Hence we require the probabilistic result(1) of Lemma 2whichis oblivious to the weight functionC.Kleinberg and Tardos [25] gave an approximation algorithm based on probabilistic embedding into ultrametrics.They give anO(1) approximation algorithm for an ultrametric (they in fact use a3-HST). This implies anO(log k)approximation for general metrics, wherek is the number of labels.

As uncapacitated quadratic assignment is a special case of metric labelling it can be solved in the same manner,yielding aO(log Φ(C)) approximation ratio by applying result(1) of Lemma 2together with theO(1) approxima-tion for ultrametrics of [25].

C.6 Min-sum k-clustering

Recall the min-sumk-clustering problem, where one has to partition a graphH to k clustersC1, . . . , Ck as tominimize

k∑

i=1

∑

u,v∈Ci

dH(u, v)

[11] showed that using probabilistic embedding into HST and solving the problem there by dynamic programmingyields a constant approximation factor in the HST and therefore a factor ofO

(1ε (log n)1+ε

)in H, with running time

nO(1/ε). Let Φ = Φ(d).

Lemma 3. For a graphH equipped with the shortest path metric, ifΦ ≤ 2log n

log log n , there is a polynomial timeO (log(kΦ)) approximation for min-sumk-clustering problem.

Denote byOPT the optimum solution for the problem with clustersCOPTi , andOPTT the optimum solution

for a family of HSTT with clustersCOPTTi . Also denote ALG for the result of [11] algorithm with clustersCALGT

i .We know that by using probabilistic(1−ε) partial embeddingH intoT , edgese ∈ G are expanded byO(log 1

ε )and fore /∈ G the maximum expansion isΦ (no distance is contracted), therefore choosingε = 1

kΦ yields:

E[ALG] =∑

T∈TPr[T ]

k∑

i=1

∑

u,v∈CALGTi

dH(u, v) ≤∑

T∈TPr[T ]

k∑

i=1

∑

u,v∈CALGTi

dT (u, v)

≤ O(1)∑

T∈TPr[T ]

k∑

i=1

∑

u,v∈COPTTi

dT (u, v) ≤ O(1)∑

T∈TPr[T ]

k∑

i=1

∑

u,v∈COPTi

dT (u, v)

≤ O(1)

k∑

i=1

∑

u,v∈COPTi ∩G

∑

T∈TPr[T ]dT (u, v) +

k∑

i=1

∑

u,v∈COPTi \G

∑

T∈TPr[T ]dT (u, v)

≤ O(1)

k∑

i=1

∑

u,v∈COPTi ∩G

O (log (1/ε)) dH(u, v) +k∑

i=1

∑

u,v∈COPTi \G

Φ

≤ O((log (1/ε))OPT + kεn2Φ = O (log(kΦ))OPT + n2/k = O (log(kΦ))OPT

22

the last equation follows from the fact thatn2

k ≤ OPT (assuming we scaled the distances such thatminu6=v∈H dH(u, v) ≥1).

Let the clusters be of sizea1, . . . , ak, naturally∑k

i=1 ai = n, and there are∑k

i=1 a2i edges inside clusters. Let

b = (1, 1, . . . , 1) ∈ Rk. From Cauchy-Schwartz we get

(k∑

i=1

ai

)2

= (a× b)2 ≤ ‖a‖2‖b‖2 =∑

i

(a2i )k

therefore∑

i(a2i ) ≥ n2

k , meaningOPT ≥ n2

k .The running time of the algorithm isO(log n)O(log Φ) = nO(1), since the number of levels in the HST is relativelysmallO(log Φ). (see [11] for details).

D Distance Oracles

A distance oracle for a metric space(X, d), |X| = n is a data structure that given any pair returns an estimate oftheir distance. In this section we study scaling distance oracles and partial distance oracles.

D.1 Distance oracles with scaling distortion

Given a distance oracle withO(n1/k) bits, the worst case stretch can indeed be2k−1 for some pairs in some graphs.However we prove the existence of distance oracles with ascaling stretchproperty. For these distance oracles, theaverage stretch over all pairs is onlyO(1).

We repeat the same preprocessing and distance query algorithm of Thorup and Zwick [44] with sampling prob-ability 3n−1/k ln n for the first set andn−1/k thereafter.

Given(X, d) and parameterk:A0 := X ; Ak = ∅ ;for i = 1 to k − 1

let Ai contain each element ofAi−1,

independently with probability

3n−1/k ln n i = 1n−1/k i > 1

;

for everyx ∈ Xfor i = 0 to k − 1

let pi(x) be the nearest node inAi,sod(x,Ai) = d(x, pi(x));let Bi(x) := y ∈ Ai \Ai+1 | d(x, y) < d(x,Ai+1);

Figure 1:Preprocessing algorithm.

Givenx, y ∈ X:z := x ; i := 0 ;while z 6∈ Bi(y)

i := i + 1;(x, y) := (y, x);z := pi(x);

returnd(x, z) + d(z, y);

Figure 2:Distance query algorithm.

Theorem 11. Let (X, d) be a finite metric space. Letk = O(lnn) be a parameter. The metric space can bepreprocessed in polynomial time, producing a data structure ofO(n1+1/k log n) size, such that distance queries can

be answered inO(k) time. The distance oracle has coarsely scaling distortion bounded by(2

⌈log(2/ε)k

log n

⌉+ 1

).

Proof. Fix ε ∈ (0, 1), andx, y ∈ G(ε). Let j be the integer such thatnj/k ≤ εn/2 < n(j+1)/k. We prove byinduction that at the end of theth iteration of the while loop of the distance query algorithm:

23

1. d(x, z) ≤ d(x, y)max1, `− j2. d(z, y) ≤ d(x, y) max2, `− j + 1.

Observe thatPr[B(x, rn(i−k)/k(x)) ∩Ai = ∅] ≤ (1− n−i/k3 ln n)ni/k ≤ n−3

for all x ∈ X and i ∈ 0, 1, 2, . . . , k − 1. Hence with high probability(1.) holds for any` < j sinced(x, p`+1(x)) ≤ rε/2(x) ≤ d(x, y) and (2.) follows from (1.) and the triangle inequality. For≥ j, from theinduction hypothesis, at the beginning of the`th iteration,d(z′, y) ≤ d(x, y)max1, ` − j, wherez′ = p`(x),z′ ∈ A`. Sincez′ 6∈ B`(y) then after the swap (the line(x, y) := (y, x)) we have

d(x, z) = d(x, p`+1(x)) ≤ d(x, y)max1, `− j

andd(z, y) ≤ d(x, y)max2, `−j+1 follows from the triangle inequality. This competes the inductive argument.Sincepk−1(x) ∈ Ak−1 = Bk−1(y) then ` ≤ k − 1 and therefore the stretch of the response is bounded by

2(k − j)− 1 ≤ 2⌈

log(2/ε)klog n

⌉+ 1.

We note that a similar argument showing scaling stretch can be given for variation of Thorup and Zwick’scompact routing scheme [43].

D.2 Partial distance oracles

We construct a distance oracle with linear memory that guarantees stretch to1− ε fraction of the pairs.

Theorem 11. Let (X, d) be a finite metric space. Let0 < ε < 1 be a parameter. Letk ≤ O(log 1ε ). The metric

space can be preprocessed in polynomial time, producing data structure such that distance queries can be answeredin O(k) time. With either one of the following properties:

1. Either withO

(n log 1

ε + k(

log 1ε

ε

)1+1/k)

size, and stretch6k− 1 for some setG ⊆ (X2

), |G| ≥ (1− ε)

(n2

).

2. Or, with O

(n log n log 1

ε + k log n(

log 1ε

ε

)1+1/k)

size, and stretch6k − 1 for the setG(ε).

Proof. We begin with a proof of(1.). Let b = d(8/ε) ln(16/ε)e. Let B be a set ofb beacons chosen uniformly atrandom. Construct a distance oracle of [44] on the subspace(B, d) with parameterk ≤ log b yielding stretch2k−1and usingO(kb1+1/k) storage. For everyx ∈ X let p(x) be the closest node inB. The resulting data structure’s sizeis O(n log b) + O(kb1+1/k) = O(n log b + kb1+1/k). Queries are processed as follows: given two nodesx, y ∈ Xlet r be the response of the distance oracle on the beaconsp(x), p(y) then returnd(x, p(x)) + r + d(p(y), y).

Observe that from triangle inequality the response is at leastd(x, y). Let Ex for anyx ∈ X be the event

Ex = d(x,B) > rε/8(x) .

ThenPr[Ex] ≤ (1− 8/ε)b ≤ ε/16 and so by Markov inequality,Pr[|Ex | x ∈ X| ≤ εn/8] ≥ 1/2. In such a caselet

G = (x, y) ∈(

X

2

)| ¬Ex ∧ ¬Ey ∧ d(x, y) ≥ maxrε/8(x), rε/8(y) .

We bound the size ofG. At mostεn/8 nodes are removed due toEx occurring, and from each remaining node at mostεn/8 distances are discarded so|G| ≥ (

X2

)−εn2/4 ≥ (1−ε)(X2

). For(x, y) ∈ G, we haved(p(x), p(y)) ≤ 3d(x, y)

so from the distance oracler ≤ (6k−3)d(x, y) and in additionmaxd(x, p(x)), d(y, p(y)) ≤ d(x, y) so the stretchis bounded by6k − 1.

The proof of(2.) is a slight modification of the above procedure. Letm = d3 ln ne. Let B1, . . . , Bm besets each containingb beacons chosen uniformly at random. LetDOi be the distance oracle on(Bi, d). Foreveryx ∈ X we storep1(x), . . . , pm(x) wherepi(x) is the closest node inBi. The resulting data structure’s sizeis O(n log b lnn) + O(kb1+1/k ln n) = O(n log b ln n + kb1+1/k ln n). Queries are processed as follows: giventwo nodesx, y ∈ X let ri be the response of the distance oracleDOi on the beaconspi(x), pi(y) then returnmin1≤i≤m d(x, pi(x)) + ri + d(pi(y), y).

24

For every(x, y) ∈ (X2

), 1 ≤ i ≤ m define the eventE i

x,y = d(x, Bi) > rε/8(x) ∨ d(y, Bi) > rε/8(y). Since

beacons are chosen independently, then for every(x, y) ∈ G(ε/4), with high probability, there exists1 ≤ i ≤ msuch that¬E i

x,y. Hence, in a similar argument as in(1.) above, the stretch ofd(x, pi(x)) + ri + d(pi(y), y) is atmost6k − 1.

E Constructing Hierarchical Probabilistic Partitions

In this section we provide details on the proof of Lemma2.The main building block in the proof of the lemma is a lemma about uniformly padded probabilistic partitions.

Definition 16 (Probabilistic Partition). A probabilistic partitionP of a finite metric space(X, d) is a distributionover a setP of partitions ofX. Given∆ > 0, P is ∆-bounded if eachP ∈ P is ∆-bounded.

Definition 17 (Uniformly Padded Probabilistic Partition). Given ∆ > 0, let P be a∆-bounded probabilisticpartition of (X, d). Given collection of functionsη = ηP : X → [0, 1]|P ∈ P and δ ∈ (0, 1], is called(η, δ)-paddedif the following condition holds for anyx ∈ X:

Pr[B(x, ηP (x)∆) ⊆ P (x)] ≥ δ.

We sayP is uniformly paddedif for any P ∈ P the functionηP is uniform with respect toP .

Lemma 4 (Uniform Padding Lemma). Let (X, d) be a finite metric space. LetZ ⊆ X. Let ∆ be such thatdiam(Z) ≤ ∆. Let ∆ be such that∆ ≤ ∆/4 and letΓ be such thatΓ ≥ 4∆/∆. Let δ ∈ (0, 1

2 ]. There exists a

∆-bounded probabilistic partitionP of (Z, d) and a collection of uniform functionsξP : X → 0, 1|P ∈ P and

ηP : X → 0, 1/ ln(1/δ)|P ∈ P, such that for anyδ ≤ δ ≤ 1, andη(δ) defined byη(δ)P (x) = ηP (x) ln(1/δ),

the probabilistic partitionP is (η(δ), δ)-uniformly padded, and the following conditions hold for anyP ∈ P andanyx ∈ Z:

• If ξP (x) = 1 then:2−7/ ln ρ(x, ∆,Γ) ≤ ηP (x) ≤ 2−7/ ln(1/δ).

• If ξP (x) = 0 then: η(δ)P (x) = 2−7/ ln(1/δ) and ρ(x, ∆, Γ) < 1/δ.

The proof of Lemma4 is a non-trivial generalization of arguments of [6, 8] and is given in [9].Using this lemma we can prove the following lemma on uniformly padded hierarchical probabilistic partitions15

from which Lemma2 is derived.

Lemma 5 (Hierarchical Uniform Padding Lemma). Let Γ = 64. Let δ ∈ (0, 12 ]. Given a finite metric space

(X, d), there exists a probabilistic4-hierarchical partitionH of (X, d) and uniform collections of functionsξ =ξP,i : X → 0, 1|P ∈ H, i ∈ I and η = ηP,i : X → 0, 1/ ln(1/δ)|P ∈ H, i ∈ I, such that for anyδ ≤ δ ≤ 1 andη(δ) defined byη(δ)(x) = η(δ)(x) ln(1/δ), we have thatH is (η(δ), δ)-uniformly padded, and thefollowing properties hold:

• ∑

j≤i

ξP,j(x)η(δ)P,j(x)−1 ≤ 211 ln

(n

|B(x,∆i+4)|)

/ ln(1/δ).

and for anyP ∈ H, 0 < i ∈ I, Pi ∈ P :

• If ξP,i(x) = 1 then: ηP,i(x) ≤ 2−8/ ln(1/δ).

• If ξP,i(x) = 0 then: ηP,i(x) ≥ 2−8/ ln(1/δ) and ρ(x,∆i−1,Γ) < 1/δ.

15a variant of this lemma appears also in [9]

25

Proof. We create a random hierarchical partitionP . By definitionP0 consists of a single cluster equal toX. Setfor all x ∈ X, ∆0 = diam(X), ηP,0(x) = 1/ ln(1/δ), ξP,0(x) = 0. For eachi ∈ Z we set∆i = 4−i∆0.The rest of the levels of the partition are created by invoking iteratively Lemma4. For 0 < i ∈ I, assume wehave created clusters inPi−1. Set∆ = ∆i−1. Now, for each clusterS ∈ Pi−1, invoke Lemma4 to create a∆i-bounded probabilistic partitionQ[S] of (S, d). Let Q[S] be the generated partition. SetPi = Q[S]. Let ξ′Q[S],

η′Q[S] be the uniform functions defined in Lemma4. Recall that forδ′ ≥ δ we have thatQ[S] is (η′(δ′), δ′)-

uniformly padded, whereη′(δ′)

Q[S](x) = η′Q[S](x) ln(1/δ′). DefineηP,i(x) = min12 · η′Q[S](x), 2 · ηP,i−1(x) and let

η(δ)P,i(x) = ηP,i(x) ln(1/δ). If it is the case thatηP,i(x) = 1

2 · η′Q[S](x) and alsoξ′Q[S](x) = 0 then setξP,i(x) = 0,otherwiseξP,i(x) = 1.

Settingδ′ = δ1/2 ≥ δ, observe that by definition:η(δ)P,i(x) = minη′(δ′)Q[S](x), 2η

(δ)P,i−1(x).

Note, that fori ∈ I, x, y ∈ X such thatPi(x) = Pi(y), it follows by induction thatηP,i(x) = ηP,i(y) (and

henceη(δ)P,i(x) = η

(δ)P,i(y)) andξP,i(x) = ξP,i(y), by using the fact thatη′ andξ′ are uniform functions with respect

to Q[S], whereS = Pi−1(x) = Pi−1(y).We prove by induction oni that Property 3 of the lemma holds for allδ ≥ δ. Assume it holds fori− 1 and we

will prove for i. Now fix some appropriate value ofδ. Let Bi = B(x, η(δ)P,i(x)∆i). We have:

Pr[Bi ⊆ Pi(x)] = Pr[Bi ⊆ Pi−1(x)] · Pr[Bi ⊆ Pi(x)|Bi ⊆ Pi−1(x)]. (1)

As η(δ)P,i(x) ≤ η′(δ

′)Q[Pi−1(x)](x) we haveBi ⊆ B(x, η′(δ

′)Q[Pi−1(x)](x)∆i). It follows that if Bi ⊆ Pi−1(x) thenBi ⊆

BPi−1(x, η′(δ′)

Q[Pi−1(x)](x)∆i). Using Lemma4 we havePr[Bi ⊆ Pi(x)|Bi ⊆ Pi−1(x)] ≥ δ′.

Next observe that by definitionη(δ)P,i(x) ≤ 2η

(δ)P,i−1(x). A simple induction oni shows thatη(δ)

P,i−1(x) ≤2η

(δ1/2)P,i−1 (x) for any δ ≥ δ. Since∆i = ∆i−1/4 we get thatη(δ)

P,i(x)∆i ≤ η(δ′)P,i−1(x)∆i−1 for δ′ = δ1/2. We

therefore obtain thatBi ⊆ B(x, η(δ′)P,i−1(x)∆i−1). Using the induction hypothesis we getPr[Bi ⊆ Pi−1(x)] ≥ δ′.

We conclude from (1) above that the inductive claim holds:Pr[Bi ⊆ Pi(x)] ≥ δ′ · δ′ = δ.This completes the proof thatH is (η(δ), δ)-uniformly padded.We now turn to prove the properties stated in the lemma. Consider somei ∈ I andx ∈ X. The second

property holds asηP,i(x) ≤ 12 η′Q[Pi−1(x)](x) ≤ 2−8/ ln(1/δ), usingLemma 4. Let us prove the third property. By

definition if ξP,i(x) = 0 then ηP,i(x) = 12 η′Q[Pi−1(x)](x) andξ′Q[Pi−1(x)](x) = 0. UsingLemma 4we have that

ηP,i(x) = 2−8/ ln(1/δ) and thatρ(x,∆i−1, Γ) < 1/δ.It remains to prove the first property of the lemma. DefineψP,i(x) = 2−8 · ξP,i(x)ηP,i(x)−1. We claim that the

following recursion holds:ψP,i(x) ≤ maxln ρ(x,∆i−1, Γ), ψP,i−1(x)/2. By definition if ξP,i(x) = 1 then oneof the two following cases is possible. In the first caseηP,i(x) = 2ηP,i−1(x) ≤ 1

2 η′Q[Pi−1(x)](x). It follows from

Lemma4 that ηP,i−1(x) ≤ 12 η′Q[Pi−1(x)] ≤ 1

22−8/ ln(1/δ) = 2−9/ ln(1/δ). By the second property just proved

above we deduce thatξP,i−1(x) = 1. ThereforeψP,i(x) = 2−8 ·ξP,i(x)ηP,i(x)−1 = 2−8 ·ξP,i−1(x)ηP,i−1(x)−1/2 =ψP,i−1(x)/2. In the second caseηP,i(x) = 1

2 η′Q[Pi−1(x)](x) andξ′Q[Pi−1(x)](x) = 1. Now, using Lemma4 we get

that≤ ηP,i(x) ≥ 2−8/ ln ρ(x,∆i−1, Γ), and henceψP,i(x) ≤ ln ρ(x,∆i−1, Γ).We conclude that the following recursion holds:ψP,i(x) ≤ ln ρ(x,∆i−1,Γ) + ψP,i−1(x)/2.A simple induction

on t shows that for any0 ≤ t < i:∑

t<j≤i ψP,j(x) ≤ 2∑

t<j≤i ln ρ(x,∆j−1, Γ)+(1−2t−i)ψP,t(x). Now observethat asΓ = 64, and that for anyj ∈ I:

ln ρ(x,∆j , Γ) = ln( |B(x,∆jΓ)||B(x,∆j/Γ)|

)=

3∑

h=−4

ln( |B(x, 4∆j+h)||B(x,∆j+h)|

).

It follows that

∑

0<j≤i

ψP,j(x) ≤ 2∑

0<j≤i

ln ρ(x,∆j−1, Γ) = 2∑

0<j≤i

3∑

h=−4

ln( |B(x, 4∆j+h)||B(x,∆j+h)|

)

= 23∑

h=−4

∑

0<j≤i

ln( |B(x, 4∆j+h)||B(x,∆j+h)|

)= 8 ln

(n

|B(x,∆i+4)|)

.

26

This completes the proof of the first property of the lemma.

27

Documents

On Embedding of Finite Metric Spaces into Hilbert Space