View
224
Download
4
Embed Size (px)
Citation preview
A Review of
Information FilteringPart II: Collaborative Filtering
Chengxiang Zhai
Language Technologies InstitiuteSchool of Computer ScienceCarnegie Mellon University
Outline
• A Conceptual Framework for Collaborative
Filtering (CF)
• Rating-based Methods (Breese et al. 98)
– Memory-based methods
– Model-based methods
• Preference-based Methods (Cohen et al. 99 & Freund et al. 98)
• Summary & Research Directions
What is Collaborative Filtering (CF)?
• Making filtering decisions for an individual user
based on the judgments of other users
• Inferring individual’s interest/preferences from
that of other similar users
• General idea
– Given a user u, find similar users {u1, …, um}
– Predict u’s preferences based on the preferences of
u1, …, um
CF: Applications
• Recommender Systems: books, CDs, Videos,
Movies, potentially anything!
• Can be combined with content-based filtering
• Example (commercial) systems
– GroupLens (Resnick et al. 94): usenet news rating
– Amazon: book recommendation
– Firefly (purchased by Microsoft?): music
recommendation
– Alexa: web page recommendation
CF: Assumptions
• Users with a common interest will have similar
preferences
• Users with similar preferences probably share
the same interest
• Examples
– “interest is IR” => “read SIGIR papers”
– “read SIGIR papers” => “interest is IR”
• Sufficiently large number of user preferences
are available
CF: Intuitions
• User similarity– If Jamie liked the paper, I’ll like the paper
– ? If Jamie liked the movie, I’ll like the movie
– Suppose Jamie and I viewed similar movies in the past six months …
• Item similarity– Since 90% of those who liked Star Wars also liked
Independence Day, and, you liked Star Wars
– You may also like Independence Day
Collaborative Filtering vs. Content-based Filtering
• Basic filtering question: Will user U like item
X?
• Two different ways of answering it
– Look at what U likes
– Look at who likes X
• Can be combined
=> characterize X => content-based filtering
=> characterize U => collaborative filtering
Rating-based vs. Preference-based
• Rating-based: User’s preferences are encoded
using numerical ratings on items
– Complete ordering
– Absolute values can be meaningful
– But, values must be normalized to combine
• Preferences: User’s preferences are represented
by partial ordering of items
– Partial ordering
– Easier to exploit implicit preferences
A Formal Framework for Rating
u1
u2
…
ui
...
um
Users: U
Objects: O
o1 o2 … oj … on
3 1.5 …. … 2
2
1
3
Xij=f(ui,oj)=?
?
The task
Unknown function f: U x O R
• Assume known f values for some (u,o)’s
• Predict f values for other (u,o)’s
• Essentially function approximation, like
other learning problems
Where are the intuitions?
• Similar users have similar preferences
– If u u’, then for all o’s, f(u,o) f(u’,o)
• Similar objects have similar user preferences
– If o o’, then for all u’s, f(u,o) f(u,o’)
• In general, f is “locally constant”
– If u u’ and o o’, then f(u,o) f(u’,o’)
– “Local smoothness” makes it possible to predict unknown
values by interpolation or extrapolation
• What does “local” mean?
Two Groups of Approaches
• Memory-based approaches
– f(u,o) = g(u)(o) g(u’)(o) if u u’
– Find “neighbors” of u and combine g(u’)(o)’s
• Model-based approaches
– Assume structures/model: object cluster, user cluster, f’
defined on clusters
– f(u,o) = f’(cu, co)
– Estimation & Probabilistic inference
Memory-based Approaches (Breese et al. 98)
• General ideas:
– Xij: rating of object j by user i
– ni: average rating of all objects by user i
– Normalized ratings: Vij = Xij - ni
– Memory-based prediction
• Specific approaches differ in w(a,i) -- the
distance/similarity between user a and i
m
iaajajij
m
iaj iawknvxviawKv
11
),(/1ˆˆ),(ˆ
User Similarity Measures
• Pearson correlation coefficient (sum over commonly rated items)
• Cosine measure
• Many other possibilities!
jiij
jaaj
jiijaaj
pnxnx
nxnx
iaw22 )()(
))((
),(
n
jij
n
jaj
n
jijaj
c
xx
xx
iaw
1
2
1
2
1),(
Improving User Similarity Measures (Breese et al. 98)
• Dealing with missing values: default
ratings
• Inverse User Frequency (IUF): similar to
IDF
• Case Amplification: use w(a,I)p, e.g., p=2.5
Model-based Approaches (Breese et al. 98)
• General ideas
– Assume that data/ratings are explained by a
probabilistic model with parameter
– Estimate/learn model parameter based on data
– Predict unknown rating using E [xk+1 | x1, …, xk], which
is computed using the estimated model
• Specific methods differ in the model used and how
the model is estimated
rxxrxpxxxE kkr
kk ),,...,|(],...,|[ 1111
Probabilistic Clustering
• Clustering users based on their ratings
– Assume ratings are observations of a
multinomial mixture model with parameters
p(C), p(xi|C)
– Model estimated using standard EM
• Predict ratings using E[xk+1 | x1, …, xk]
)|(),...,|(),...,|(
),...,|(],...,|[
1111
1111
cCrxpxxcCpxxrxp
rxxrxpxxxE
kkc
kk
kkr
kk
Bayesian Network
• Use BN to capture object/item dependency
– Each item/object is a node
– (Dependency) structure is learned from all data
– Model parameters: p(xk+1 |pa(xk+1)) where
pa(xk+1) is the parents/predictors of xk+1
(represented as a decision tree)
• Predict ratings using E[xk+1 | x1, …, xk]
111
1111
),...,|(
),...,|(],...,|[
kkk
kkr
kk
xnodeattreedecisionthebygivenxxrxp
rxxrxpxxxE
Three-way Aspect Model(Popescul et al. 2001)
• CF + content-based
• Generative model
• (u,d,w) as observations
• z as hidden variable
• Standard EM
• Essentially clustering the joint data
• Evaluation on ResearchIndex data
• Found it’s better to treat (u,w) as observations
Evaluation Criteria (Breese et al. 98)
• Rating accuracy
– Average absolute deviation
– Pa = set of items predicted
• Ranking accuracy
– Expected utility
– Exponentially decaying viewing probabillity
( halflife )= the rank where the viewing probability
=0.5
– d = neutral rating
|ˆ||| ajajPj
Pa xxSa
a
1
jj
aja
dxS
)/()(
),max(112
0
Datasets
Results
- BN & CR+ are generally better than VSIM & BC- BN is best with more training data- VSIM is better with little training data- Inverse User Freq. Is effective- Case amplification is mostly effective
Summary of Rating-based Methods
• Effectiveness
– Both memory-based and model-based methods can
be effective
– The correlation method appears to be robust
– Bayesian network works well with plenty of training
data, but not very well with little training data
– The cosine similarity method works well with little
training data
Summary of Rating-based Methods (cont.)
• Efficiency
– Memory based methods are slower than model-
based methods in predicting
– Learning can be extremely slow for model-based
methods
Preference-based Methods(Cohen et al. 99, Freund et al. 98)
• Motivation
– Explicit ratings are not always available, but
implicit orderings/preferences might be available
– Only relative ratings are meaningful, even if when
ratings are available
– Combining preferences has other applications, e.g.,
• Merging results from different search engines
A Formal Model of Preferences
• Instances: O={o1,…, on}
• Ranking function: R: (U x) O x O [0,1]
– R(u,v)=1 means u is strongly preferred to v
– R(u,v)=0 means v is strongly preferred to u
– R(u,v)=0.5 means no preference
• Feedback: F = {(u,v)}, u is preferred to v
• Minimize Loss:),(minarg),(
||),(
),(
FRLRvuRF
FRLHRFvu
11
Hypothesis space
The Hypothesis Space H
• Without constraints on H, the loss is
minimized by any R that agrees with F
• Appropriate constraints for collaborative
filtering
• Compare this with
1
iaUi
iia wvuRwvuR}{
),(),(
m
iaajajij
m
iaj iawknvxviawKv
11
),(/1ˆˆ),(ˆ
The Hedge Algorithm for Combining Preferences
• Iterative updating of w1, w2, …, wn
• Initialization: wi is uniform
• Updating: [0,1]
• L=0 => weight stays
• L is large => weight is decreased
t
FRLtit
i Zw
wtt
i ),(1
Some Theoretical Results
• The cumulative loss of Ra will not be much
worse than that of the best ranking
expert/feature
• Preferences Ra => ordering => R
L(R,F) <= DISAGREE(,Ra)/|F| + L(Ra,F)
• Need to find that minimizes disagreement
• General case: NP-complete
A Greedy Ordering Algorithm
• Use weighted graph to represent preferences R
• For each node, compute the potential value, I.e., outgoing_weights - ingoing_weights
• Rank the node with the highest potential value above all others
• Remove this node and its edges, repeat
• At least half of the optimal agreement is guaranteed
OuOu
vuRuvRv ),(),()(
Improvement
• Identify all the strongly connected
components
• Rank the components consistently with the
edges between them
• Rank the nodes within a component using
the basic greedy algorithm
Evaluation of Ordering Algorithms
• Measure: “weight coverage”
• Datasets = randomly generated small graphs
• Observations
– The basic greedy algorithm works better than a
random permutation baseline
– Improved version is generally better, but the
improvement is insignificant for large graphs
Metasearch Experiments• Task: Known item search
– Search for a ML researchers’ homepage
– Search for a university homepage
• Search expert = variant of query
• Learn to merge results of all search experts
• Feedback
– Complete : known item preferred to all others
– Click data : known item preferred to all above it
• Leave-one-out testing
Metasearch Results
• Measures: compare combined preferences with individual ranking function
– sign test: to see which system tends to rank the known relevant article higher.
– #queries with the known relevant item ranked above k.
– average rank of the known relevant item
• Learned system better than individual expert by all measure (not surprising, why?)
Metasearch Results (cont.)
Direct Learning of an Ordering Function
• Each expert is treated as a ranking feature fi: O R U
{0} (allow partial ranking)
• Given preference feedback : X x X R
• Goal: to learn H that minimizes the loss
• D (x0,x1): a distribution over X x X (actually a uniform
dist. over pairs with feedback order) D (x0,x1) = c max{0,
(x0,x1) }
)]()([Pr)]]()()[[,()( ~),(,
101010 10
10
xHxHxHxHxxDHrloss Dxxxx
D
The RankBoost Algorithm
• Iterative updating of D(x0,x1)
• Initialization: D1= D
• For t=1,…,T:– Train weak learner using Dt
– Get weak hypothesis ht: X R – Choose t >0– Update
• Final hypothesis:
t
xhxht
t Z
exxDxxD
ttt ))()((),(),(
1010
101
T
tt xhxH
1
)()(
How to Choose t and Design ht ?
• Bound on the ranking loss
• Thus, we should choose t that minimizes the
bound
• Three approaches:
– Numerical search
– Special case: h is either 0 or 1
– Approximation of Z, then find analytic solution
T
ttD Zrloss
1
Efficient RankBoost for Bipartite Feedback
t
xhxht
t Z
exxDxxD
ttt ))()((),(),(
1010
101
0
)(0
01
0)()(
t
xht
tZ
exvxv
tt
1
)(1
11
1)()(
t
xht
tZ
exvxv
tt
10
1010 )()(),(
ttt
ttt
ZZZ
xvxvxxD
Complexity at each round: O(|X0||X1|) O(|X0|+|X1|)
Bipartite feedback: Essentially binary classification
X0
X1
Evaluation of RankBoost
• Meta-search: Same as in (Cohen et al 99)
• Perfect feedback
• 4-fold cross validation
EachMovie Evaluation
# users #movies/user
#feedback movies
Performance ComparisonCohen et al. 99 vs. Freund et al. 99
Summary
• CF is “easy”
– The user’s expectation is low
– Any recommendation is better than none
– Making it practically useful
• CF is “hard”
– Data sparseness
– Scalability
– Domain-dependent
Summary (cont.)• CF as a Learning Task
– Rating-based formulation
• Learn f: U x O -> R
• Algorithms
– Instance-based/memory-based (k-nearest neighbors)
– Model-based (probabilistic clustering)
– Preference-based formulation
• Learn PREF: U x O x O -> R
• Algorithms
– General preference combination (Hedge), greedy ordering
– Efficient restricted preference combination (RankBoost)
Summary (cont.)
• Evaluation
– Rating-based methods
• Simple methods seem to be reasonably effective
• Advantage of sophisticated methods seems to be limited
– Preference-based methods
• More effective than rating-based methods according to
one evaluation
• Evaluation on meta-search is weak
Research Directions
• Exploiting complete information
– CF + content-based filtering + domain knowledge +
user model …
• More “localized” kernels for instance-based
methods
– Predicting movies need different “neighbor users”
than predicting books
– Suggesting using items similar to the target item as
features to find neighbors
Research Directions (cont.)
• Modeling time
– There might be sequential patterns on the items a
user purchased (e.g., bread machine -> bread
machine mix)
• Probabilistic model of preferences
– Making preference function a probability function,
e.g, P(A>B|U)
– Clustering items and users
– Minimizing preference disagreements
References
• Cohen, W.W., Schapire, R.E., and Singer, Y. (1999) "Learning to Order Things",
Journal of AI Research, Volume 10, pages 243-270.
• Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1999). An efficient boosting
algorithm for combining preferences. Machine Learning Journal. 1999.
• Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical Analysis of Predictive
Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on
Uncertainty in Articial Intelligence, pp. 43-52.
• Alexandrin Popescul and Lyle H. Ungar, Probabilistic Models for Unified
Collaborative and Content-Based Recommendation in Sparse-Data Environments,
UAI 2001.
• N. Good, J.B. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl.
"Combining Collaborative Filtering with Personal Agents for Better
Recommendations." Proceedings AAAI-99. pp 439-446. 1999.