Assignment 1 - University of Texas at Austinusers.ece.utexas.edu/.../problemset1_soln.pdf · Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and 2 have been

The University of Texas at AustinDepartment of Electrical and Computer Engineering

EE381V: Large Scale Learning — Spring 2013

Assignment 1

Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013.

(Problems 1 and 2 have been adapted from Jure Leskovec’s course on ‘Mining Massive Datasets’)

1. Locality Sensitive Hashing

The first week of class discussed locality sensitive hashing. For some standard similarityfunctions, like the Jaccard similarity, we showed that there corresponds a locality sensitivehashing scheme. As it turns out, not all similarity functions have a locality sensitive hashingscheme. This problem explores this issue.

Recall that a locality sensitive hashing scheme is a set F of hash functions that operate on aset S of objects, such that for two objects x, y ∈ S,

Prh∈F

[h(x) = h(y)] = sim(x, y)

where sim(·) : S × S → [0, 1] is a pairwise function (the similarity function).

• (5 points) Let d(x, y) = 1− sim(x, y). Prove that for sim(·) to have a locality sensitivehashing scheme, d(x, y) should satisfy the triangle inequality.

d(x, y) + d(y, z) ≥ d(x, z)

for all x, y, z ∈ S.

• (5 points) Consider the following two similarity functions: The so-called Overlap simi-larity function,

simOver(A,B) =|A⋂B|

min(|A|, |B|)and the Dice similarity function,

simDice(A,B) =2|A

⋂B|

(|A|+ |B|),

where A and B are two sets.

Is there a locality sensitive hashing scheme for either? Prove, or disprove, giving acounterexample.

Prove or disprove (give counterexamples). Let A,B be any two sets.

Solution: Let x, y, z ∈ S. Then,

1

P (h(x) 6= h(y)) = P (h(x) 6= h(y), h(x) = h(z)) + P (h(x) 6= h(y), h(x) 6= h(z))

≤ P (h(y) 6= h(z)) + P (h(x) 6= h(z))

1− sim(x, y) ≤ 1− sim(y, z) + 1− sim(x, z)

d(x, y) ≤ d(y, z) + d(z, x)

Now let A,B,C ⊆ {1, 2, . . . , 10}. A = {1, 2, 3, 4, 5}, B = {1, 2, 6, 7, 8}, C = {1, 2, 3, 4, 6, 8}.Then, simOver(A,B) = 2

5 , simOver(B,C) = 45 , simOver(A,C) = 4

5 .

Then,

d(A,B) = 1− 2

5=

3

5>

2

5= (1− 4

5) + (1− 4

5) = d(A,C) + d(B,C)

Hence simOver() cannot have a LSH scheme. The same example can be used to show thatsimDice() function also cannot have a LSH scheme.

2. Approximate Near Neighbor Search using LSH

LSH has been used for nearest neighbor search, in numerous applications. This problemsexplores this.

Given a data set A, along with a distance function d(·), the Nearest Neighbor problem says thefollowing: given a query point, z, return its nearest neighbors, w.r.t. d(·). The approximatenearest neighbor problem is an approximation in two respects: it only requires us to knowthe immediate neighborhood of any given point, and also, we need only return approximatenearest neighbors, up to a dilation factor λ. More precisely, the (c, λ)-Approximate NearNeighbor (ANN) problem is as follows: Given a query point, z, for which (we assume) thereexists a point x ∈ A with d(x, z) ≤ λ, return a point x′ from the dataset with d(x′, z) ≤ cλ.

We outline an approximate nearest neighbor algorithm, and then, through the parts of thisproblem, show that with large probability, it outputs a c-approximate nearest neighbor, asexplained above.

Let A be a dataset with n points from a metric space with distance measure d(). Let H bea (λ, cλ, p1, p2) locally sensitive family of hash functions for the distance measure d(). LetG = Hk = {g = (h1, . . . , hk)|hi ∈ H}, where k = log1/p2(n) be an amplified family. Choose

L = nρ random members g1, . . . , gL ∈ G, where ρ = log(1/p1)log(1/p2) . We then do the following: (a)

hash all the data points and the query point z using all gi’s (1 ≤ i ≤ L); (b) retrieve atmost 3L data points from the buckets gj(z) (1 ≤ j ≤ L), and (c) report the closest one as a(c, λ)-ANN.

• (5 points) Define Wj = {x ∈ A|gj(x) = gj(z)} (1 ≤ j ≤ L) as the random set of datapoints x hashed to the same bucket as the query point z by the hash function gj . LetT = {x ∈ A|d(x, z) > cλ}. Prove:

Pr

L∑j=1

|T⋂Wj | > 3L

< 1

3.

2

• (5 points) Let x∗ ∈ A be a data point such that d(x∗, z) ≤ λ. Prove:

Pr [gj(x∗) 6= gj(z), ∀1 ≤ j ≤ L] <

1

e.

• (5 points) Find a bound on δ, the probability that the reported point is an actual (c, λ)-ANN.

Solution: Let |T | = m ≤ n. Suppose x ∈ T . Then,

P (g(x) = g(z)) = P (h(x) = h(z))k ≤ pk2 = plog1/p2

n

2 =1

n

Hence E(|T⋂Wj |) = mP (g(x) = g(z)) = m

n < 1. Now using Markov’s inequality

P

L∑j=1

|T⋂Wj | > 3L

≤ E[∑L

j=1 |T⋂Wj | > 3L

]3L

=E(|T

⋂Wj |)L

3L<

1

3

Now,

P [gj(x∗) 6= gj(z),∀1 ≤ j ≤ L] = (P (g1(x∗) 6= g1(z)))L

= (1− P (g1(x∗) = g1(z)))L

≤ (1− pk1)L

=(

1− plog1/p2

n

1

)L= (1− n−ρ)L

≤ e−Lnρ = e−1 (using 1− x ≤ e−x)

Let E be the error event. Suppose x∗ ∈ A, d(x∗, z) < λ. Now error may occur if, (a) there aremore than 3L points in the L buckets gj(z) and all 3L sampled points are from set T (callthis event E1), (b) the point x∗ is not hashed in any of the L buckets gj(z) (call this eventE2). Note that error cannot happen if the event (E1

⋃E2)c occurs. The probability of error

can be bounded as follows,

P (E) = P (E , E1

⋃E2) ≤ P (E , E1) + P (E , E2) ≤ P (E1) + P (E2)

= P

L∑j=1

|T⋂Wj | > 3L

+ P [gj(x∗) 6= gj(z),∀1 ≤ j ≤ L]

≤ 1

3+

1

e= ε (say)

3

Hence we can guarantee that the reported point is an actual (c, λ)-ANN with probabilitygreater than δ = 1− ε.

3. This problem tests empirically how nearest-neighbor search using LSH compares to linearsearch. For this, we provide a dataset of images. Each column in the dataset is a vectorized20×20 image patch. Download the image set and matlab code here: http://http://users.ece.utexas.edu/~cmcaram/LSL2013/lsh.zip

1 and see the ReadMe.txt file for instructionon using the code, and in particular, the functions lsh and lshlookup.

In this problem we use the `1 distance measure. The LSH function is run with L = 10, k = 24,where L is the number of hash tables generated and k is the length/number of bits of thehash key.

• (10 points) Consider the image patches zj , of column 100×j, for j ∈ {1, 2, . . . , 10}. Findthe top 3 nearest neighbors for these image patches, (excluding the original patch) usingboth LSH and linear search.

Compare the average search time for LSH and linear search. (If the bucket contains lessthan 3 points, rehash till you get enough neighboring points).

• (10 points) For each zj (1 ≤ j ≤ 10) let {xij}3i=1 denote the approximate near neighborsof zj found using LSH, and {x∗ij}3i=1 to be the actual top 3 near neighbors of zj foundusing linear search. Compute the following error measure:

error =1

10

10∑j=1

∑3i=1 d(xij , zj)∑3i=1 d(x∗ij , zj)

Plot the error value as a function of L (for L = 10, 12, 14, . . . , 20, with k = 24). Thenplot the error values as a function of k (for k = 16, 18, 20, 22, 24 with L = 10).

• (5 points) Plot the top 10 near neighbors found using the two methods (using the defaultL = 10, k = 24) for the image patch in column 100, together with the image patch itself.Use functions reshape() and mat2gray() to convert the matrices to images, then use thefunctions imshow() and subplot() to display the images. Compare them visually.

Solution: The relevant plots are shown in Figure 1, 2 and 3.

4. K-Mean vs. EM for Gaussian Mixture Models

Consider samples being generated from a Gaussian mixture model with the following pdf

p(x|z = k) =1

2π|Σk|1/2exp

[−1

2(x− µk)TΣ−1

k (x− µk)], k ∈ {1, . . . ,K}

p(x) =1

K

K∑k=1

p(x|z = k)

where x, µk ∈ R2, z ∈ {1, . . . ,K}, K = 3, Σk ∈ R2×2 are the covariance matrices.

Let µ1 = [5, 5]T , µ2 = [10, 20]T , µ3 = [16, 8]T and

1This is the same as that provided from the problem’s source, Jure Leskovec’s course on Mining Massive DataSets. The dataset and code are adapted from Brown University’s Greg Shakhnarovich

4

http://http://users.ece.utexas.edu/~cmcaram/LSL2013/lsh.zip

http://http://users.ece.utexas.edu/~cmcaram/LSL2013/lsh.zip

Figure 1: Runtime comparison between linear search and LSH based search of top 3 nearest neigh-bors. The LSH based search is much faster then the linear search.

(a) Error vs L (b) Error vs K

Figure 2: Error variation with L and K. We see that the error approximately reduces withincreasing L and increases with increasing K as expected. This is because by increasing the numberof hash tables L the chances of the actual nearest neighbors falling in the same bucket as that ofthe query point increases, hence the error decreases. Where as with increasing key length K thetotal number of buckets increases, hence the chances of true neighbors falling in the same bucketas the query point decreases, increasing the error.

5

(a) Actual top 10 nearest neighbors (b) Top 10 ANN returned by LSH based search

Figure 3: Visual comparison between actual top 10 nearest neighbors and those obtained by LSH

Σ1 = σ2

[1 00 1

], Σ2 = σ2

[1.2 00 1.2

], Σ3 = σ2

[.28 .42.42 1.4

]

• (10 points) We want to cluster the points generated from the distribution p(x) us-ing the EM algorithm. Generate a sample set S = {xi}ni=1 of size n = 300 fromthe distribution p(x), using σ2 = 2. Starting with initial estimates of the means as

µ(0)1 = [0, 0]T , µ

(0)2 = [10, 15]T , µ

(0)3 = [16, 0]T and covariance Σ

(0)k = I, the 2× 2 identity

matrix, find the final estimates of the means µk and covariances Σk, k ∈ {1, 2, 3} ofthe three components of p(x) using the EM algorithm. Now using the MAP estimateszi = arg maxk P (zi = k|S), obtained by the EM algorithm, cluster the sample set S in 3clusters. Plot the clusters using the matlab function scatter() using separate colors foreach cluster (or plot each cluster in separate graphs).

• (10 points) Define the set Eπ = {xi : π(zi) 6= zi}, for a particular permutation π of thelabels of the clusters. Define error set E = Eπ0 , where π0 = arg minπ |Eπ|. Hence set Econtains the sample points that fall in the wrong cluster. The error fraction is given bye = |E|

n . Define the probability of error PA(E, σ2) for a particular clustering algorithmA as the average e over several sample sets S drawn from the same distribution p(x).Now generate several sample sets varying σ2 between 1 and 30 (also many sample setsfor each σ2) and cluster using both EM and K-mean algorithms starting with same setof initial mean estimates as given in part (a). Plot the probability of error PA(E, σ2)as a function of σ2 for both the algorithms. Also plot the average run-time of both thealgorithms for different σ2.

• (5 points) Define the algorithm B as follows. First run the K-mean algorithm to obtainmean estimates µ′k. Then run the EM algorithm using µ′k as the initial mean estimates.Plot probability of error PB(E, σ2) vs. σ2 for algorithm B.

6

Solution: Let zi,k = P (Zi = k|x) and pk = P (Zi = k) be the priors (can be initialized to1/K). Then the update equations for EM are as follows.

• E step: Calculate the MAP estimates.

zi,k =pk

1

|Σk|1/2exp

[−1

2(xi − µk)T Σ−1k (xi − µk)

]∑K

j=1 pj1

|Σj |1/2exp

[−1

2(xi − µj)T Σ−1j (xi − µj)

]• M step: Estimate parameters

nk =

n∑i=1

zi,k

µk =1

nk

n∑i=1

zi,kxi

Σk =1

nk

n∑i=1

zi,k(xi − µk)(xi − µk)T

pk =nkn

For σ2 = 2 the estimated means and variances were.

µ1 = [4.9759 4.8925] µ2 = [9.8509 19.8908] µ3 = [16.0278 7.8480]

Σ1 =

[1.7987 −0.0990−0.0990 1.6604

]Σ2 =

[2.0729 0.18390.1839 2.0478

]Σ3 =

[0.5635 0.91970.9197 2.9577

]Typical plots are shown in Figure 4 and 5.

5. A few details about spectral clustering.

• (5 points) Suppose that {u1, . . . , uk} and {u1, . . . , uk} are any two orthonormal bases forthe same k-dimensional null-space of a matrix L. Let U and U denote the n×k matriceswhose columns are the respective orthonormal bases. Show that there is an orthonormalk×k matrix Q, for which U = UQ. Converseley, show that if U is an n×k orthonormalmatrix, and Q a k × k orthonormal matrix, then U = UQ is also orthonormal.

• (10 points) Recall that if we have a graph with k connected components, then theLaplacian has a k-dimensional subspace, spanned by k vectors, {u1, . . . , uk}, where ui hassupport only on the elements corresponding to the nodes in the ith connected component.Hence, if we let U be the matrix with these vectors as columns, and then let {y1, . . . , yn}be the rows of U , then if we normalize each yi, each point maps to the standard basiselement ec(i), with c(i) corresponding to the index of the connected component of i.

Show that if instead of {u1, . . . , uk} we have any other orthonormal basis of the nullspace,{u1, . . . , uk}, then if we form the matrix U in the same way, and let yi be rows of U ,then again, xi = yi/‖yi‖ will be one of k distinct, orthonormal vectors.

7

(a) Clusters obtained using EM (b) Error performance of EM and K-Mean with increasingσ2

Figure 4: EM vs K-Means for Gaussian mixture model

(a) Runtime comparison (b) Error performance of Algorithm B

Figure 5: EM vs K-Means for Gaussian mixture model

8

Solution: U = {u1, . . . , uk}, U = {u1, . . . , uk} are bases of the null space N of L. Henceboth spans N . Hence each vector ui can be expressed as linear combinations of the vectorsu1, . . . , uk. Let

ui =

k∑j=1

qjiuj ∀i = 1, . . . , k

for some qji ∈ R. In matrix notation we can simply write U = UQ. This proves existence of

Q. Now since U and U are orthonormal UT U = UTU = I. We can write

I = UT U = QTUTUQ = QT IQ = QTQ

Hence Q is an orthonormal matrix.

Conversely if U and Q are orthonormal matrix and U = UQ then

UT U = QTUTUQ = QT IQ = QTQ = I

Hence U is an orthonormal matrix.

Now for a graph with k connected components let U be the matrix,

U = [u1 . . . uk] =

yT1...yTn

where each uj has support only on elements l when node l belongs to jth cluster. U neednot be orthonormal here since ui’s are any orthogonal vectors spanning the null space. HenceuTj uj = λj > 0, and uTi uj = 0 for i 6= j as the clusters are non-overlapping. Clearlyyj||yj || = ec(j), when jth node belongs to cluster c(j).

For any other orthonormal matrix U = [u1 . . . uk] spanning the same null space it can bewritten as U = UQ for some orthogonal matrix Q = [q1 . . . qk]. Note that

UTU = Λ =

λ1 . . . 0...

. . ....

0 . . . λk

= QT UT UQ = QT IQ = QTQ

Hence,

Q−1 = Λ−1QT =

λ−11 qT1

...

λ−1k qTk

Now U = UQ−1. Hence the jth row of U is simply yTj = yTj Q

−1 = ||yj ||ec(j)Q−1 =

||yj ||λ−1c(j)q

Tc(j).

9

Again if we normalize the rows of U , xi = yi/||yi||, we see that for i 6= j,

xTi xj =yTi yj

||yi|| ||yj ||=||yi|| ||yj ||λ−1

c(i)λ−1c(j)q

Ti qj

||yi|| ||yj ||= 0

Hence xi’s also form k distinct orthonormal vectors.

6. (10 points) Recall the example we did in class via direct calculation:

A =

(0 00 ε

), ∆ =

(0 ββ 0

).

Let E0 denote the smallest eigenvalue of A and F0 the smallest eigenvalue of (A + ∆). Usethe sin-theta theorem to bound dp(E0, F0).

Solution: We have ||∆||2 = β. Minimum eigenvalue of A is 0. The eigenvalues of A+∆ are roots

of the equation det(xI− (A+ ∆)) = 0. Hence the maximum eigenvalue is λ1 = ε2

(1 +

√1 + 4β2

ε2

).

From sin-theta theorem to get a tight bound we can choose δ arbitrarily close to but less than λ1,say δ = λ1 − ν (ν > 0 small). Therefore dp(E0, F0) can be bounded by,

dp(E0, F0) ≤ ||∆||2δ

=β

λ1 − ν≤ β

ε

In order to get the exact bound we need to calculate the value of sin Θ. Let h = βε , ∆ =√

1 + 4h2. We have,

e0 = (1, 0)

f0 =1√

2∆ + 2∆2(1 + ∆,−2h)

< eo, f0 > =1 + ∆√

2∆ + 2∆2= 1− β2

2ε2+O(β4)

Therefore,

dp(E0, F0) = || sin Θ||2 =√

1− < e0, f0 >2 =

√β2

ε2−O(h4) ≤ β

ε

10

Documents

Assignment 1 - University of Texas at Austinusers.ece.utexas.edu/.../problemset1_soln.pdf · Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and 2 have been