Documentcf

Collaborative Filtering

CF deals with automating subjective preferences. Which lm is good? Would I like to watch a given DVD? Is that copy of the TTS exam reliable? Show me only messages I would want to read.

Documents themselves may not be a good source of information about userpreference.

1

Example

An example matrix:

Message Ken Lee Meg Nan

1 1 4 2 2

2 5 2 4 43 - - 3 -4 2 5 - 55 4 1 - 16 7 2 5 7

2

3

Relationship to Data Mining

CF is similar to dealing with missing or censored data.

Typical DM approaches:assume data is missing at random (MAR): Unseen values can be determined from seen values.

Could use the mean of related values.

Problems: Non-uniform value assignment. There may be a reason for values being missing and so the MAR

assumption is violated.

4

CF: main aspects

Instead of trying to nd a single expert on some task (and use theirjudgement): Use the collective, average opinion. Use opinions which are close to your own.

Collaborative decision making is likely to be more robust than individualdecision making.

Using collective decision making divorces recommendations from surfacecharacteristics of the document.

5

CF: main aspects

There are two main modelling tasks: Predict missing values. Rank examples and present the top n most highly ranked examples as

being preferred.

Missing value prediction is the harder of the two.

6

GroupLens

GroupLens is an example CR system:

The task is collaborative ltering of Usenet news articles: High volume. Low signal-to-noise ratio (most articles are irrelevant or worse).

Users rate read articles (using some range). Each document is then a row in a table.

Columns denote users.

Basic task: predict the missing entry of that matrix based upon the rest of thematrix.

7

GroupLens

GroupLens uses the heuristic: People who agreed in the past are likely to agree on the future.

To judge between user agreement (eg Ken and Lee), statistical correlationtests are applied.

Only documents that both people rated are considered.

A persons predicted score of a document is then a combination of theiraverage document score and other peoples scores for the document inquestion.

This term is weighted by user agreement.

Unclear how well this does.

8

Method comparison

The prediction of document ratings in GroupLens is arguably ad-hoc.

How well do other methods do?

Breese et al implemented a number of CF approaches: A correlation-based method (CR). A vector-space model (VS):

Items are terms.

Users are documents. A baseline model (POP):

Show people the most highly rated items. Bayes networks (BN).

9

CF evaluation

Breese et al evaluated CF as follows: Two protocols were considered:

1. For each user, withhold a single value for an item and attempt to predictit (AllBut1).

2. For each user, randomly use 2, 5 or 10 values and predict the rest(Given2, Given5 etc).

The rst protocol considers using as much data as possible. The other protocols consider situations using sparse counts.

Scoring is based upon how close a predicted value is to the real value.

10

CF evaluation

Three datasets were used:1. MS Web: visits to various areas of the Microsoft web site, A value is

whether a user visits an area.2. TV: Which programmes people watched.3. Movie: Film preferences.

MS Web TV Movie

Users 3453 1463 4119Items 294 203 1623Mean choices per user 3.95 9.55 46.4

11

CF results: MS Web

Alg. Given2 Given5 Given10 AllBut1BN 60 60 54 67CR 61 58 51 64VS 59 56 49 62POP 49 47 41 50

Higher is better

12

CF results: TV


13

CF results: Film


14

CF results: comments

The results are very counter-intuitive! Training on more labelled data should yield the best results. Here, sometimes it is better not to use labelled data.

The various algorithms usually outperform the baseline. Correlation is typically the best approach.

The Web results are best and Film recommendations are worst: Web page selection is probably less inuenced by peer pressure.

Yu et al showed the expected pattern of results on the Film task.

15

Preferences

Where do the preferences come from in the rst place? People who rate items are likely to rate items they have a strong view on. If large numbers of ratings are needed before a CF system is useful, people

may not have patience to keep training it until it does become useful.

One strategy might be to derive preferences from user actions. Similar to relevance feedback in terms of those documents which are

actually selected.

Another idea might be to treat user preferences as hidden data and use EM orsome variant. Bootstrapping systems with little labelled data is a hot area.

16

Scaling

A signicant problem is scaling.

CF methods which focus upon locating similar users do not scale well: Typically, the number of users grows at a faster rate than the number of

items.

Each users set of actively selected items may be very small.

An interesting direction is instead to focus upon item-item similarity: Consider just previously rated items. Measure the similarity of these items to the target item. Sanwar et al suggest that dramatic computation savings can be achieved

over user-specic approaches.

17

Summary

CF can allow subjective rating of documents.

Simple correlation-based methods are competitive with more elaboratemachine learning approaches.

Processing large volumes of transactions, with many users, can bechallenging.

CF research seems to be on the wane. . . . post dot-com bust?

18

Further reading

Paul Resnick, Neophytos, Mitesh Suchak, Peter Bergstrom and John Riedl.GroupLens: An Open Architecture for Collaborative Filtering of Netnews.Proc. of ACM 1994 Conference on Computer Supported Cooperative Work,

John Breese, David Heckerman and Carl Kadie. Empirical Analysis ofPredictive Algorithms for Collaborative Filtering. 14th Conference onUncertainty in Articial Intelligence, 1998.

Badrul Sarwar, George Karypis, Joseph Konstan and John Riedl. Item-BasedCollaborative Filtering Recommendation Algorithms. WWW10, 2001.

Kai Yu, Zhong Wen, Xiaowei Xu and Martin Ester. Feature Weighting andInstance Selection for Collaborative Filtering. 2nd International Workshop onManagement of Information on the Web. 2001.

19