Upload
pham-cuong
View
3.847
Download
3
Embed Size (px)
Citation preview
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-1
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Clustering Techniques for Collaborative Filtering and the Application to Venue
Recommendation
Manh Cuong Pham, Yiwei Cao, Ralf KlammaInformation Systems and Database Technology
RWTH Aachen, Germany
Graz, Austria, September 01, 2010
I-KNOW 2010
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-2
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Agenda
Introduction Clustering techniques for collaborative filtering Case study: venue recommendation
- Data sets: DBLP and CiteSeerX- User-based - Item-based
Conclusions and Outlook
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-3
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Introduction
Recommender systems: help users dealing with information overload Components of a recommender system [Burke2002]
- Set of users, set of items (products)- Implicit/explicit user rating on items- Additional information: trust, collaboration, etc.- Algorithms for generating recommendations
Recommendation techniques [Adomavicius and Tuzhilin 2005]- Collaborative Filtering (CF) [Breese et al. 1998 ]
- Memory-based algorithms: user-based, item-based [Sarwar 2001]
- Model-based algorithms: Bayesian network [Breese1998]; Clustering [Ungar 1998]; Rule-based [Sarwar2000]; Machine learning on graphs [Zhou 2005, 2008]; PLSA [Hofmann 1999]; Matrix factorization [Koren 2009]
- Content-based recommendation [Sarwar et al. 2001]
- Hybrid approaches [Burke 2002]
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-4
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Clustering and Collaborative Filtering
x xx x
x x xx x x
x x xx x x
x xx x
x x xx x x
x x xx x x
x xx x
x x xx x x
x x xx x x
Cluster 2Cluster 1
item-based CF
User clustering
Item clustering
item-based CF
item-based CF
Problems: large-scale data; sparse rating matrix; diversity of users and items
Previous approaches: Clustering based on ratings- K-means, Metis, etc. [Rashid 2006, Xue 2005, O’Connor 2001]
Our approach- Clustering based on additional information: relationships between users, items- Improvement on both efficiency and accuracy
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-5
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Evaluation:Venue Recommendation
Recommend venues (conferences, journals, workshops) to researchers User-based CF
- Populate user-item matrix using venue participation history- Ratings: normalized venue publication counts- User-clustering: co-authorship network
Item-based CF- Similarity between venues based on citation- Similarity measure: cosine- Venue clustering: similarity network
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-6
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Data Sets DBLP (http://www.informatik.uni-trier.de/~ley/db/)
- 788,259 author’s names- 1,226,412 publications- 3,490 venues (conferences, workshops, journals)
CiteSeerX (http://citeseerx.ist.psu.edu/)- 7,385,652 publications (including publications in reference lists)- 22,735,240 citations- Over 4 million author’s names
Combination- Canopy clustering [McCallum 2000]- Result: 864,097 matched pairs - On average: venues cite 2306 and
are cited 2037 times
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-7
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
User-based CF:Author Clustering
Data: DBLP Perform 2 test cases for the years of 2005
and 2006 - Clustering of co-authorship networks- 2005s network: 478,108 nodes; 1,427,196 edges- 2006s network: 544,601 nodes; 1,686,867 edges- Prediction of the venue participation
Clustering algorithm- Density-based algorithm [Clauset 2004]- Obtained modularity: 0.829 and 0.82
Cluster size distribution follows Power law
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-8
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
User-based CF:Performance
Precisions for 1000 random chosen authors
Precisions computed at 11 standard recall levels 0%, 10%,….,100%
Results- Clustering performs better- Not significant improved- Better efficiency
Further improvement- Different networks: citation- Overlapping clustering
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-9
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Item-based CF:Venue Network Creation and Clustering
Knowledge network- Aggregate bibliography coupling counts at venue level- Undirected graph G(V, E), where V: venues, E: edges weighted by cosine
similarity
- Threshold: - Clustering: density-based algorithm [Neuman 2004, Clauset 2004]- Network visualization: force-directed paradigm [Fruchterman 1991]
Knowledge flow network (for venue ranking, see Pham & Klamma 2010)- Aggregate bibliography coupling counts at venue level- Threshold: citation counts >= 50 Domains from Microsoft Academic Search
(http://academic.research.microsoft.com/)
n
k kj
n
k ki
n
k kjki
ji
jiji
BB
BB
BB
BBC
1
2,1
2,
1 ,,
22
,
1.0, jiC
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-10
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Knowledge Network:the Visualization
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-11
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Knowledge Network:Clustering
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-12
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Interdisciplinary Venues:Top Betweenness Centrality
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-13
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
High Prestige Series:Top PageRank
Lehrstuhl Informatik 5(Informationssysteme)
Prof. Dr. Matthias JarkeI5-PYR-0830-14
Manh Cuong PhamYiwei Cao
Ralf Klamma
TeLLNet
Conclusions and Future Research Clustering and recommender systems
- Advantage of using additional information for clustering- Application of clustering for both user-based and item-based CF - Key issue: impact of the communities (cluster) on the quality of recommendations;
non-overlapping communities vs. overlapping communities Outlook
- Further evaluation: trust networks clustering, paper and potential collaborator recommendation
- Datasets: Epinion, Last.fm, etc.- Digital libraries in Web 2.0: Mendeley, ResearchGate, etc.