Scalable Recommendation Algorithms with LSH

Scalable Recommendation Algorithms for Massive Data

Maruf Aytekin

PhD Candidate

Computer Engineering Department

Bahcesehir University

Outline• Introduction • Collaborative Filtering (CF) and Scalability Problem • Locality Sensitive Hashing (LSH) for Recommendation • Improvement for LSH methods • Preliminary Results • Work Plan

Recommender Systems•Recommender systems •Applied to various domains:

•Book/movie/news recommendations •Contextual advertising •Search engine personalization •Matchmaking

•Two type of problems: • Preference elicitation (prediction) • Set-based recommendations (top-N)

Recommender Systems• Content-based filtering • Collaborative filtering (CF)

• Model-based • Neighborhood-based

Neighborhood-based Methods

The idea: Similar users behave in a similar way. • User-based: rely on the opinion of like-minded users to

predict a rating. • Item-based: look at rating given to similar items. Require computation of similarity weights to select trusted neighbors whose ratings are used in the prediction.

Neighborhood-based Methods

Problem • Compare all users/items to find trusted neighbors

(k-nearest-neighbors) • Not scale well with data size (# of users/items)

Computational Complexity

Space Model Build Query User-based O(m2) O(m2n) O(m) Item-based O(n2) O(n2m) O(n) m : number of users n : number of items

Various MethodsModel-based recommendation techniques • Dimensionality reduction (SVD, PCA, Random projections) • Classification (like, dislike) • Neural network classifier • Clustering (ANN) • Bayesian inference techniques

Distributed computation • Map-reduce • Distributed CF algorithms

Locality Sensitive Hashing (LSH)

• ANN search method • Provides a way to eliminate searching all of the data to

find the nearest neighbors

• Finds the nearest neighbors fast in basic neighbourhood based methods.

Locality Sensitive Hashing (LSH)

General approach:

• “Hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.

• Pairs hashed to the same bucket candidate pairs.

• Check only the candidate pairs for similarity.

Locality-Sensitive FunctionsThe function h will “hash” items, and the decision will be

based on whether or not the result is equal.

• h(x) = h(y) make x and y a candidate pair.

• h(x) ≠ h(y) do not make x and y a candidate pair.

g = h1 AND h2 AND h3 …

or

g = h1 OR h2 OR h3 …

A collection of functions of this form will be called a family of

functions.

LSH for CosineCharikar defines family of functions for Cosine as follows: Let u and v be rating vectors and r is a random generated vector whose components are +1 and −1. The family of hash functions (H) generated:

, where

shows the probability of u and v being declared as a candidate pair.

LSH for CosineExample: r1 = [-1, 1, 1,-1,-1]

r2 = [ 1, 1, 1,-1,-1]

r3 = [-1,-1, 1,-1, 1]

r4 = [-1, 1,-1, 1,-1]

h1(u1) = u1.r1 = -6 => 0 h2(u1) = u1.r2 = 4 => 1 h3(u1) = u1.r3 = -12 => 0 h4(u1) = u1.r4 = 2 => 1

u1 = [5, 4, 0, 4, 1]

u2 = [2, 1, 1, 1, 4]

u3 = [4, 3, 0, 5, 2]

g(u1) = 0 1 0 1

g(u2) = 0 0 1 0 g(u3) = 0 1 0 1

AND g(u1) = 0101

max 24 = 16 buckets

LSH Model Build

U1

U2

U3

Um

.

.

.

.

.

h1

h3U7 U11 U10

.

.

U13 U39

.

. Um

U1 U3 U5

.

.

U2 U9 U6

.

.

bucket 1 key: 0101

bucket 2 key: 1110

bucket 3 key: 1101

bucket 4 key: 1001

h2

h4

[0,1]

[0,1]AN

D-C

onst

ruct

ion

[0,1]

[0,1]

K = 4, number of hash functions . . . .

Hash Tables (Bands)

U2 U6 U1 U3

.

.

.

candidate set for U5 C(U5)

L = 2 K = 4

hash table 1

hash table 2

LSH Methods• Clustering Based:

• UB-KNN-LSH: User-based CF prediction with LSH

• IB-KNN-LSH: Item-based CF with LSH

• Frequency Based:

• UB-LSH1: User-based prediction with LSH

• IB-LSH1: Item-based prediction with LSH

LSH Methods for

Prediction

UB-KNN-LSH IB-KNN-LSH

• find candidate set, C, for target user, u, with LSH.

• find k-nearest-neighbors to u from C that have rated on i.

• use k-nearest-neighbors to generate a prediction for u on i.

• find candidate set, C, for target item, i, with LSH.

• find k-nearest-neighbors to i from C which user u rated on.

• use k-nearest-neighbors to generate a prediction for u on item i.

LSH Methods Prediction

UB-LSH1 IB-LSH1 • find candidate users list, Cl, for

u who rated on i with LSH.

• calculate frequency of each user in Cl who rated on i.

• sort candidate users based on frequency and get top k users

• use frequency as weight to predict rating for u on i with user-based prediction.

• find candidate items list, Cl, for i with LSH.

• calculate frequency of items in Cl which is rated by u.

• sort candidate items based on frequency and get top k items.

• use frequency as weight to predict rating for u on i with item based prediction.

LSH Methods Prediction

Improvement Prediction

UB-LSH2 IB-LSH2

• find candidate users list, Cl, for u who rated on i with LSH.

• select k users from Cl randomly.

• predict rating for u on i with user-based prediction as the average ratings of k users.

• find candidate items list, Cl, for i with LSH.

• select k items rated by u from Cl

randomly.

• predict rating for u on i with item-based prediction as the average ratings of k items.

- Eliminate frequency calculation and sorting. - Frequent users or items in Cl have higher chance to be selected randomly.

Complexity Prediction

Space Model Build Prediction User-based O(m) O(m2) O(mn) Item-based O(n) O(n2) O(mn) UB-KNN-LSH O(mL) O(mLKt) O(L+|C|n+k) IB-KNN-LSH O(nL) O(nLKt) O(L+|C|m+k) UB-LSH1 O(mL) O(mLKt) O(L+|Cl|+|Cl|lg(|Cl|)+k)

IB-LSH1 O(nL) O(nLKt) O(L+|Cl|+|Cl|lg(|Cl|)+k) UB-LSH2 O(mL) O(mLKt) O(L+2k)

IB-LSH2 O(nL) O(nLKt) O(L+2k) m : number of users n : number of items L: number of hash tables K : number of hash functions t : time to evaluate a hash function C: Candidate user (or item) set ( |C| ≤ Lm / 2K or |C| ≤ Ln / 2K )

Cl : Candidate user (or item) list ( | Cl | ≤ Lm / 2K or | Cl | ≤ Ln / 2K )

| Cl | ≤ Lm / 2K

L = 5 m =16,042

Candidate List (Cl) Prediction

0

10000

20000

30000

40000

50000

1 2 3 4 5 6 7 8 9 10

Number of Users

Number of Hash Functions

Cl m

| Cl | ≤ Ln / 2K

L = 5 n =17,454

0

10000

20000

30000

40000

50000

1 2 3 4 5 6 7 8 9 10

Number of Items


Cl n

Results Model Build

Results Prediction

0.8

1

1.2

1.4

1.6

1.8

4 5 6 7 8 9 10 11 12 13

MAE


UB-KNNIB-KNN

UB-KNN-LSHIB-KNN-LSH

UB-LSH1UB-LSH2IB-LSH1IB-LSH2

0.7

0.8

0.9

1

1.1

1.2

1.3

4 5 6 7 8 9 10 11 12 13

MAE


UB-KNNIB-KNN



Movie Lens 1M Amazon Movies

0

2

4

6

8

10

12

14

4 5 6 7 8 9 10 11 12 13

Run Time(ms)




0

0.2

0.4

0.6

0.8

1

1.2

1.4

4 5 6 7 8 9 10 11 12 13

Run Time(ms)





Results Prediction

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

4 5 6 7 8 9 10 11 12 13

Run Time(ms.)



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

4 5 6 7 8 9 10 11 12 13

Run Time(ms.)




Results Prediction

0

0.2

0.4

0.6

0.8

1

4 5 6 7 8 9 10 11 12 13

Prediction Coverage




0

0.2

0.4

0.6

0.8

1

4 5 6 7 8 9 10 11 12 13

Prediction Coverage





Results Prediction

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Coverage - higher is better

Runtime -lower is better

Performance-Coverage tradeoff -upper and left is better


0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Coverage - higher is better

Runtime -lower is better

Performance-Coverage tradeoff -upper and left is better



Results Prediction

Results Prediction

0

0.2

0.4

0.6

0.8

1

0.8 0.85 0.9 0.95 1 1.05 1.1

Running Time(ms.) -lower is better

MAE -lower is better

MAE-Performance tradeoff -lower and left is better


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.75 0.8 0.85 0.9 0.95 1

Running Time(ms.) -lower is better

MAE -lower is better

MAE-Performance tradeoff -lower and left is better



LSH Methods for

Top-N Recommendation

UB-LSH1 IB-LSH1 • find candidate set, C, for user u

with LSH. • for each user, v, in C; retrieve

items that rated by v and add to a running candidate list, Cl.

• calculate frequency of items in Cl.

• sort Cl based on frequency.

• recommend the most frequent N items to u.

• for each item, i, u rated; retrieve candidate set, C, for i with LSH and add C to a running candidate list, Cl.

• calculate frequency of items in Cl.

• sort Cl based on frequency.

• recommend the most frequent N items to u.

LSH Methods Top-N Recommendation

Improvement Top-N Recommendation

UB-LSH2 IB-LSH2 • find candidate set, C, for user

u with LSH. • for each user, v, in C; retrieve

items that rated by v and add to a running candidate list, Cl.

• select N items from Cl randomly and recommend to u.

• for each item, i, u rated; retrieve candidate set, C, for i with LSH and add to a running candidate list, Cl.

• select N items from Cl randomly and recommend to u.

Eliminates frequency calculation and sorting.

Complexity Top-N Recommendation

Space Model Build Top-N Recommendation User-based O(m) O(m2) O(mn) Item-based O(n) O(n2) O(mn) UB-LSH1 O(mL) O(mLKt) O(L+|C|+|Cl|+|Cl|lg(|Cl|)

IB-LSH1 O(nL) O(nLKt) O(pL+|Cl|+|Cl|lg(|Cl|)) UB-LSH2 O(mL) O(mLKt) O(L+|C|+N)

IB-LSH2 O(nL) O(nLKt) O(pL+N) m : number of users n : number of items p : number of ratings of a user L : number of hash tables K : number of hash functions t : time to evaluate a hash function C : Candidate user (or item) set ( |C| ≤ Lm / 2K or |C| ≤ Ln / 2K)

Cl : Candidate item list ( |Cl| ≤ p|C| for UB-LSH1 and IB-LSH1 s.t. |Cl| ≤ Lpn / 2K )

|Cl| ≤ Lpn / 2K )

L = 5 n =1000 p = 100 (avg. number of ratings for a user)

Candidate List (Cl) Top-N Recommendation

0

5000

10000

15000

20000

25000

30000

35000

4 5 6 7 8 9 10 11 12 13

Number of Items


min Cl max Cl

n

Results Top-N Recommendation

0

0.01

0.02

0.03

0.04

0.05

0.06

4 5 6 7 8 9 10 11 12 13

Precision


IB-TOP-NUB-TOP-NIB-LSH1IB-LSH2UB-LSH1UB-LSH2

0

0.005

0.01

0.015

0.02

4 5 6 7 8 9 10 11 12 13

Precision




0

20

40

60

80

100

4 5 6 7 8 9 10 11 12 13

Avg Recc. Time (ms.)


IB-LSH1IB-LSH2UB-LSH1UB-LSH2


0

10

20

30

40

50

60

70

4 5 6 7 8 9 10 11 12 13

Avg Recc. Time (ms.)


IB-LSH1IB-LSH2UB-LSH1UB-LSH2


0

500

1000

1500

2000

2500

3000

3500

4 5 6 7 8 9 10 11 12 13

Aggregate Diversity




0

1000

2000

3000

4000

5000

4 5 6 7 8 9 10 11 12 13

Aggregate Diversity




0

0.2

0.4

0.6

0.8

1

4 5 6 7 8 9 10 11 12 13

Diversity




0

0.2

0.4

0.6

0.8

1

4 5 6 7 8 9 10 11 12 13

Diversity




0

2

4

6

8

10

12

4 5 6 7 8 9 10 11 12 13

Novelty





5

5.5

6

6.5

7

7.5

8

8.5

9

9.5

4 5 6 7 8 9 10 11 12 13

Novelty




Our improvement is simple but efficient;

• Improves: • Performance • Diversity • Coverage • Novelty

• but costs accuracy.

• LSH as a real-time stream recommendation algorithm

• Dimensionality reduction methods (e.g., Matrix Factorization)

• Other ANN Methods: • Tree based • Clustering based

Work Plan

Q & A