Level Search Filtering for IR Model Reduction

Michael W. BerryXiaoyan (Kathy) ZhangPadma RaghavanDepartment of Computer ScienceUniversity of Tennessee

Level Search Filtering for

IR Model Reduction

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

Computational Models for IR

1. Need framework for designing concept-based IR models.

2. Can we draw upon backgrounds and experiences of computer scientists and mathematicians?

3. Effective indexing should address issues of scale and accuracy.

The Vector Space Model

Represent terms and documents as vectors in k-dimensional space

Similarity computed by measures such as cosine or Euclidean distance

Early prototype - SMART system developed by Salton et al. [70’s, 80’s]

Motivation for LSI Two fundamental query

matching problems:synonymy

(image, likeness, portrait, facsimile, icon)

polysemy(Adam’s apple, patient’s discharge, culture)

Motivation for LSIApproach

Treat word to document association data as an unreliable estimate of a larger set of applicable words.

Goal Cluster similar documents which

may share no terms in a low-dimensional subspace (improve recall).

LSI Approach Preprocessing

Compute low-rank approximation to the original term-by-document (sparse) matrix

Vector Space Model Encode terms and documents using

factors derived from SVD (ULV, SDD) Postprocessing

Rank similarity of terms and docs to query via Euclid. distances or cosines

SVD Encoding

Ak is the best rank-k approx. to term-by-document matrix A

Ak = Uk

Term Vectors

Doc Vectors

Vector Space Dimension

Want minimum no. of factors (k ) that discriminates most concepts

In practice, k ranges between 100 and 300 but could be much larger.

Choosing optimal k for different collections is challenging.

Strengths of LSI Completely automatic

no stemming required, allow misspellings

Multilanguage search capabilityLandauer (Colorado), Littman (Duke)

Conceptual IR capability (Recall) Retrieve relevant documents that

do not contain any search terms

Changing the LSI Model Updating

Folding-in new terms or documents [Deerwester et al. ‘90]

SVD-updating [O’Brien ‘94], [Simon & Zha ‘97]

DowndatingModify SVD w.r.t. term or document deletions[Berry & Witter ‘98]

Recent LSI-based Research

Implementation of kd-trees to reduce query matching complexity (Hughey & Berry ‘00, Info. Retrieval )

Unsupervised learning model for data mining electronic commerce data (J. Jiang et al. ‘99, IDA)

Recent LSI-based Research

Nonlinear SVD approach for constraint-based feedback (E. Jiang & Berry ‘00, Lin. Alg. & Applications)

Future incorporation of up- and down-dating into LSI-based client/servers

Information Filtering

Concept: Reduce a large document collection

to a reasonably sized set of potential retrievable documents.

Goal: Produce a relatively small subset

containing a high proportion of relevant documents.

Approach: Level Search Reduce sparse SVD computation cost by selecting a small subset from the original term-by-document matrix

Use undirected graphic model . . . Term or document: vertices Term weight: edge weight Term in document or document

containing term: edges in graph

Level Search

Term 1

Term 2

Term 3

QueryDoc 1

Document

Term 5

Term 6

Term 7

Term 8

Level 1 Level 2 Level 3

DocumentDoc 5

Level 4

Similarity measures Recall: ratio of no. of documents

retrieved that are relevant to total no. of relevant documents.

Precision: ratio of no. of documents retrieved that are relevant to total no. of documents. retrieved

Test Collections

Collection Matrix Size (Docs Terms Non-zeros)

MEDLINE 1033 5831 52009TIME 425 10804 68240CISI 1469 5609 83602FBIS 4974 42500 1573306

Avg Recall & Submatrix Sizes for LS

Collection Avg R %D %T %NMEDLINE 85.7 24.8 63.2 27.8TIME 69.4 15.3 61.9 22.7CISI 55.1 21.4 64.1 25.2FBIS 82.1 28.5 55.0 52.9Mean 67.8 18.2 53.4 27.0

Results for MEDLINE

0 0.2 0.4 0.6 0.8 1

Recall

n LSI Only

Level Search Plus LSI

5,831 terms 1,033 docs

Results for CISI

0 0.2 0.4 0.6 0.8 1

Recall

LSI Only

5,609 terms 1,469 docs

Results for TIME

0 0.2 0.4 0.6 0.8 1

Recall

LSI Only

10,804 terms 425

Results for FBIS (TREC-5)

0 0.2 0.4 0.6 0.8 1

Recall

LSI Only

42,500 terms 4,974 docs

Level Search with Pruning

Term 1

Term 2

Term 3

QueryDoc 1

Document

Term 5

Term 6

Term 7

Term 8

Level 1 Level 2 Level 3

DocumentDoc 5

Level 4

Deletesingletonterms

Prune terms to further reduce submatrix andmaintain recall; no affect on documents.

Effects of Pruning

0102030405060708090

MEDLINE CISI TIME FBIS LATIMES

LSI input matrix density comparisons after level search filtering (L) and pruning (P).

LSILSI&LLSI&LP

17,903 terms 1,086 docs (TREC5)

Effects of Pruning

0102030405060708090

MEDLINE CISI TIME FBIS LATIMES LSI average precision comparisons with/ without

level search (L) and/ or pruning (P).

LSILSI&LLSI&LP

230 terms/doc 29 terms/query

Impact Level Search is a simple and cost-

effective filtering method for LSI; scalable IR.

May reduce the effective term-by-document matrix size by 75% with no significant loss of LSI precision (less than 5%).

Some Future Challenges for LSI

Agent-based software for indexing remote/distributed collections

Effective updating with global weighting Incorporate phrases and proximity Expand cosine matching to incorporate

other similarity-based data (e.g., images) Optimal number of dimensions

LSI Web Site

InvestigatorsPapersDemo’sSoftware

http://www.cs.utk.edu/~lsi

SIAM Book (June’99)

Document File Prep.Vector Space ModelsMatrix

DecompositionsQuery ManagementRanking & Relevance

FeedbackUser InterfacesA Course ProjectFurther Reading

CIR00 Workshophttp://www.cs.utk.edu/cir00

10-22-00, Raleigh NC

Invited Speakers:I. Dhillon (Texas)C. Ding (NERSC)K. Gallivan (FSU)D. Martin (UTK)H. Park (Minnesota)B. Pottenger (Lehigh)P. Raghavan (UTK)J. Wu (Boeing)

Level Search Filtering for IR Model Reduction

Documents

PAPR Reduction in an OFDM system using Recursive Clipping ... · PAPR Reduction in an OFDM system using Recursive Clipping and Filtering Technique . Md. ANAMUL ISLAM1, N. AHMED2*,

A CCD / IR Data Reduction Primer - WordPress.com · 2012. 3. 12. · ch12: pre-reduction ch13: cosmic-ray events & anticoincidence ch14: trails & satellites ch15: source detection,

Uncooled IR Cameras & Detectors for Thermography and … · Uncooled IR Cameras & Detectors for Thermography and Vision Driven by dramatic cost reduction of detectors, the market

Elastomeric bearings - FreyssinetUK · bearings may be used for filtering vibrations (reduction of noise annoyance for example) or even earthquake proof insulators (acceleration reduction)

PAPR Reduction in an OFDM system using Recursive · PDF filePAPR Reduction in an OFDM system using Recursive Clipping and Filtering ... out through MATLAB simulation and ... search

Rank-Reduction Filtering In Seismic Exploration - … · Rank-Reduction Filtering In Seismic Exploration Stewart Trickett ... Suppose we have a grid of seismic traces in any number

Investigation of Inductance effects reduction in IR drop ... · and IR drop. Figure 2 voltage drop in circuit . Experiments have shown that a 5 percent IR drop on a clock buffer can

Noise reduction by vector median filteringNoise reduction by vector median filtering Yike Liu1 ABSTRACT The scalar median filter (SMF) is often used to reduce noise in scalar geophysical

Resources can be found: Filtering: Filtering

Non-linear Dimensionality Reduction and Embeddingmachinelearning.math.rs/Jovanovic-DimensionalityReduction.pdf · Collaborative Filtering via Euclidean Embedding. RecSys 2010, Barcelona,

Hardware Implementation of PAPR Reduction with … Implementation of PAPR Reduction with Clipping and Filtering Technique for Mobile Applications ... Abstract-Orthogonal frequency

LNCS 4148 - IR-drop Reduction Through Combinational ...nics.ee.tsinghua.edu.cn/people/wangyu/PATMOS06_hailin.pdf · IR-drop Reduction Through Combinational Circuit Partitioning 373

Risk Reduction Overviewrro.sourceforge.net/downloads/ares4.pdfOlivier Sessink, Hellen Havinga 1 Risk Reduction Overview for Risk Management Dr. ir. Olivier D.T. Sessink Head of section

Reduction and IR-drop Compensations Techniques for ...xinli/papers/2014_ICCAD_ncs.pdfFig. 1: (a) Metal-oxide memristor [7]. (b) Device programming [8]. Reduction and IR-drop Compensations

Median Filtering andMedian Filtering and …eeweb.poly.edu/~yao/EL5123/lecture7_median_morph.pdfMedian Filtering andMedian Filtering and Morphological Filtering Yao Wang Polytechnic

DRAG REDUCTION OF IR AQI CRUDE OIL FLOW IN PI PELINES BY ...€¦ · DRAG REDUCTION OF IR FLOW IN PI University of Kerbala, Petroleum Eng. Dept. Akram Jassim Jawad, Auda J.Braihi

Packet Filtering. 2 Objectives Describe packets and packet filtering Explain the approaches to packet filtering Recommend specific filtering rules

BM83 Bluetooth Stereo Audio Module Data Sheetww1.microchip.com/downloads/en/DeviceDoc/BM83_Bluetooth_Ster… · Noise Reduction, Acoustic Echo Cancellation (AEC), and EQ filtering

Guided Image Filtering - Kaiming Hekaiminghe.com/publications/eccv10guidedfilter.pdf · Guided Image Filtering ... explicit linear translation-invariant ... as noise reduction, detail

IR MODELS - tarjni.files.wordpress.com · List of Docs match abstract Tarjni Vyas 2. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing