Clustering Technique for Collaborative Filtering Recommendation and Application to Venue Recommendation

Lehrstuhl Informatik 5(Informationssysteme)

Prof. Dr. Matthias JarkeI5-PYR-0830-1

Manh Cuong PhamYiwei Cao

Ralf Klamma

TeLLNet

Clustering Techniques for Collaborative Filtering and the Application to Venue

Recommendation

Manh Cuong Pham, Yiwei Cao, Ralf KlammaInformation Systems and Database Technology

RWTH Aachen, Germany

Graz, Austria, September 01, 2010

I-KNOW 2010




Ralf Klamma

TeLLNet

Agenda

Introduction Clustering techniques for collaborative filtering Case study: venue recommendation

- Data sets: DBLP and CiteSeerX- User-based - Item-based

Conclusions and Outlook




Ralf Klamma

TeLLNet

Introduction

Recommender systems: help users dealing with information overload Components of a recommender system [Burke2002]

- Set of users, set of items (products)- Implicit/explicit user rating on items- Additional information: trust, collaboration, etc.- Algorithms for generating recommendations

Recommendation techniques [Adomavicius and Tuzhilin 2005]- Collaborative Filtering (CF) [Breese et al. 1998 ]

- Memory-based algorithms: user-based, item-based [Sarwar 2001]

- Model-based algorithms: Bayesian network [Breese1998]; Clustering [Ungar 1998]; Rule-based [Sarwar2000]; Machine learning on graphs [Zhou 2005, 2008]; PLSA [Hofmann 1999]; Matrix factorization [Koren 2009]

- Content-based recommendation [Sarwar et al. 2001]

- Hybrid approaches [Burke 2002]




Ralf Klamma

TeLLNet

Clustering and Collaborative Filtering

x xx x

x x xx x x

x x xx x x

x xx x

x x xx x x

x x xx x x

x xx x

x x xx x x

x x xx x x

Cluster 2Cluster 1

item-based CF

User clustering

Item clustering

item-based CF

item-based CF

Problems: large-scale data; sparse rating matrix; diversity of users and items

Previous approaches: Clustering based on ratings- K-means, Metis, etc. [Rashid 2006, Xue 2005, O’Connor 2001]

Our approach- Clustering based on additional information: relationships between users, items- Improvement on both efficiency and accuracy




Ralf Klamma

TeLLNet

Evaluation:Venue Recommendation

Recommend venues (conferences, journals, workshops) to researchers User-based CF

- Populate user-item matrix using venue participation history- Ratings: normalized venue publication counts- User-clustering: co-authorship network

Item-based CF- Similarity between venues based on citation- Similarity measure: cosine- Venue clustering: similarity network




Ralf Klamma

TeLLNet

Data Sets DBLP (http://www.informatik.uni-trier.de/~ley/db/)

- 788,259 author’s names- 1,226,412 publications- 3,490 venues (conferences, workshops, journals)

CiteSeerX (http://citeseerx.ist.psu.edu/)- 7,385,652 publications (including publications in reference lists)- 22,735,240 citations- Over 4 million author’s names

Combination- Canopy clustering [McCallum 2000]- Result: 864,097 matched pairs - On average: venues cite 2306 and

are cited 2037 times




Ralf Klamma

TeLLNet

User-based CF:Author Clustering

Data: DBLP Perform 2 test cases for the years of 2005

and 2006 - Clustering of co-authorship networks- 2005s network: 478,108 nodes; 1,427,196 edges- 2006s network: 544,601 nodes; 1,686,867 edges- Prediction of the venue participation

Clustering algorithm- Density-based algorithm [Clauset 2004]- Obtained modularity: 0.829 and 0.82

Cluster size distribution follows Power law




Ralf Klamma

TeLLNet

User-based CF:Performance

Precisions for 1000 random chosen authors

Precisions computed at 11 standard recall levels 0%, 10%,….,100%

Results- Clustering performs better- Not significant improved- Better efficiency

Further improvement- Different networks: citation- Overlapping clustering




Ralf Klamma

TeLLNet

Item-based CF:Venue Network Creation and Clustering

Knowledge network- Aggregate bibliography coupling counts at venue level- Undirected graph G(V, E), where V: venues, E: edges weighted by cosine

similarity

- Threshold: - Clustering: density-based algorithm [Neuman 2004, Clauset 2004]- Network visualization: force-directed paradigm [Fruchterman 1991]

Knowledge flow network (for venue ranking, see Pham & Klamma 2010)- Aggregate bibliography coupling counts at venue level- Threshold: citation counts >= 50 Domains from Microsoft Academic Search

(http://academic.research.microsoft.com/)

n

k kj

n

k ki

n

k kjki

ji

jiji

BB

BB

BB

BBC

1

2,1

2,

1 ,,

22

,

1.0, jiC




Ralf Klamma

TeLLNet

Knowledge Network:the Visualization




Ralf Klamma

TeLLNet

Knowledge Network:Clustering




Ralf Klamma

TeLLNet

Interdisciplinary Venues:Top Betweenness Centrality




Ralf Klamma

TeLLNet

High Prestige Series:Top PageRank




Ralf Klamma

TeLLNet

Conclusions and Future Research Clustering and recommender systems

- Advantage of using additional information for clustering- Application of clustering for both user-based and item-based CF - Key issue: impact of the communities (cluster) on the quality of recommendations;

non-overlapping communities vs. overlapping communities Outlook

- Further evaluation: trust networks clustering, paper and potential collaborator recommendation

- Datasets: Epinion, Last.fm, etc.- Digital libraries in Web 2.0: Mendeley, ResearchGate, etc.

Education

Clustering Technique for Collaborative Filtering Recommendation and Application to Venue Recommendation