11
Filtering and Recommendation INST 734 Module 9 Doug Oard

Filtering and Recommendation INST 734 Module 9 Doug Oard

Embed Size (px)

Citation preview

Page 1: Filtering and Recommendation INST 734 Module 9 Doug Oard

Filtering and Recommendation

INST 734

Module 9

Doug Oard

Page 2: Filtering and Recommendation INST 734 Module 9 Doug Oard

Agenda

Filtering

• Recommender ystems

• Classification

Page 3: Filtering and Recommendation INST 734 Module 9 Doug Oard

Information Filtering

• An abstract task in which:– The information need is stable– A stream of documents is arriving– The system must decide which ones to present

• Introduced by Luhn in 1958– As Selective Dissemination of Information

(SDI)– Named “Filtering” by Denning in 1975– After 1983, SDI came to have another meaning

Page 4: Filtering and Recommendation INST 734 Module 9 Doug Oard

Information Filtering

User Profile

Matching

New Documents

Recommendation

Rating

Page 5: Filtering and Recommendation INST 734 Module 9 Doug Oard

Information Access Problems

Collection

Info

rmat

ion

Nee

d

Stable

Stable

DifferentEach Time

DataMining

Retrieval

Filtering

DifferentEach Time

Page 6: Filtering and Recommendation INST 734 Module 9 Doug Oard

Information Filtering Examples

• Email spam filtering

• Personalized newspaper

• Children’s Internet Protection Act (CIPA)

Page 7: Filtering and Recommendation INST 734 Module 9 Doug Oard

Standing Queries

• Have the user specify a “standing query”– This is the initial “profile”

• Allow updates based on relevance feedback– Track changing interests– Learn new terms

• Match each arriving document to the profile– On arrival, on a schedule, or when app is opened

Page 8: Filtering and Recommendation INST 734 Module 9 Doug Oard

Profile Indexing

• Build an inverted file of profiles– Postings are profiles that contain each term

• RAM can hold ~5 million profiles/GB– And several machines could run in parallel

• Challenges:– New terms (would) have infinite IDF!– No obvious a priori way to do threshold selection– Privacy (with a centralized profile index)

Page 9: Filtering and Recommendation INST 734 Module 9 Doug Oard

Content-Based Filtering“Fast Data Finder”

• Boolean filtering using custom hardware– Up to 10,000 documents per second (in 1996!)

• Words pass through a pipeline architecture– Each element looks for one word

good partygreat aid

OR

AND

NOT

Page 10: Filtering and Recommendation INST 734 Module 9 Doug Oard

Spam Filtering

• Adversarial IR yields an adaptation cycle• Content signatures

– Compression-based techniques

• Source signatures– Blacklists and whitelists– DomainKey Identified Mail (DKIM)

• Behavioral signatures

Page 11: Filtering and Recommendation INST 734 Module 9 Doug Oard

Agenda

• Filtering

Recommender systems

• Classification