Upload
efthimis-n-efthimiadis
View
220
Download
0
Embed Size (px)
Citation preview
Approaches to Teaching & Learning Information Retrieval
Sponsored by: SIGs ED, HCI, DL
Efthimis N. Efthimiadis, ModeratorAssociate Professor, The Information School, University of Washington Suite 370
Mary Gates Hall, Box 352840 Seattle, WA 98195-2840, USA, Phone: (off.)
206-616-6077, (sch.) 206-685-9937; Fax. 206-616-3152
Jamie CallanAssociate Professor, Language Technologies Institute (School of Computer
Science) and Heinz School of Public Policy and Management, Carnegie Mellon
University 5000 Forbes Avenue 4502 Newell Simon Hall, LTI Pittsburgh, PA
15213-8213, Phone: 412-268-4525; Fax: 412-268-6298 [email protected]
Ray R. LarsonAssociate Professor, School of Information University of California, Berkeley
Berkeley, California 94720-4600, Phone: 510-642-6046 [email protected]
Summary
The explosion of the web has made search an integral part of our daily lives. Wesearch for almost any conceivable topic. Web search engines have made searcheasily approachable to almost everyone. Yet, for information professionals it is moreimportant than ever before to know “how search works” in order to be more effectivein their work. Search Engines or Information Retrieval systems often appear tosearchers as “black boxes.” There is some sort of magic that happens betweentyping some keywords in a query box and getting back results. This approachcontributes to the development of inadequate conceptual models of search.
The panel brings LIS and CS educators involved in teaching “information retrieval”to discuss experiential learning approaches to teaching IR. Efthimiadis (UW) will be
presenting the IR-Toolbox, an interactive system developed for teaching IRprocesses to Information School students. Ray Larson (UCB) will be discussing hisapproach of using open source IR engines to create a mini-TREC competitionenvironment in class. Jamie Callan (CMU) will be talking about the Lemur Toolkitsystem and its use in teaching undergraduate and graduate students.
Following the panel presentation of the experiential teaching methods, an interactivesession with the audience will follow. The discussion session will focus on theaudience’s needs and experiences while learning or teaching information retrieval.
IR-Toolbox (Efthimiadis)
The IR-Toolbox is an experiential teaching tool for learning about information retrieval (IR)
systems. Through hands on interaction, the IR-Toolbox helps students develop their conceptual
model of search engines by exploring, visualizing, and understanding IR processes and
algorithms without needing to program. In a sequential fashion, the IR-Toolbox presents the
following processing steps: a) Document analysis (e.g., tokenizers [letter, white-space,
grammar], stemmers [Porter, Krovetz], and a variety of stop lists), b) Indexing (e.g., ability to
browse the inverted file and extract statistics), c) Searching (e.g., ability to enter queries and
select weighing algorithms such as IDF, TF-IDF, OKAPI/BM25), d) Evaluation (e.g., evaluate
results using the TREC evaluation software (trec-eval) and associated TREC collections,
presenting recall-precision tables and graphs). The IR-Toolbox uses Lucene as its underlining
search engine. Students can interact with the IR-Toolbox at different levels of complexity on
individual or group exercises that help them understand the different IR processes and build a
more detailed conceptual model of search engines.
<http://irtoolbox.ischool.washington.edu >
Lemur (Callan)
The Lemur Toolkit has become a popular platform for doing a wide range of information
retrieval teaching and research. Lemur is an open-source toolkit, written in C++ that supports
several approaches to document indexing, most of the standard retrieval models, and a set of
applications that includes retrieval, clustering, cross-lingual IR, federated search (distributed
IR), and summarization. Indri, provides a structured query language which is used for
searching large indexes and allows Web server integration for users that just want a
high-quality search engine.
The talk will discuss the use of Lemur for class homework, and Lemur’s new educational
facilities to support undergraduate and graduate IR classes at institutions with minimal local
resources. Lemurproject.org provides web-based access to search engine indices that allows
students to immediately begin writing Java-based programs that implement basic ranking
functions. Their programs parse a query, use http Web page requests to retrieve inverted lists
from the CMU indexes, rank documents, and upload results to the CMU web-based trec-eval to
evaluate their results. Providing web-based access to search engine indices enables students
to begin doing interesting homework assignments in the third week of classes, and also
provides access to standard corpora (e.g., TREC corpora) that students might not have locally.
<http://www.cs.cmu.edu/~lemur/3.0/overview.html >
<http://www.lemurproject.org />
Cheshire II and Cheshire3 (Larson)
The Cheshire systems are experimental IR systems that have also been used in production
environments. They provide support for development of Digital Library (DL) retrieval
applications including support for MARC processing and the Z39.50 IR protocol. The also
support Boolean searching as well as a variety of ranking algorithms (including Logistic
Regression-Based Probabilistic ranking, Vector Space ranking, OKAPI BM-25, and a number of
other techniques). In addition the systems support Data fusion approaches to combining the
results of using the different algorithms into a single ranked set. Cheshire has been used by
students for "mini-TREC" evaluation competitions in IR courses at Berkeley for many years. It
is also used extensively in the UK for production DL systems.