Organizational issues Course overview
Web Technologies(Technologien fur das Internet I)
Foundations of Information Retrievalhttp://www2.kbs.uni-hannover.de/internet1.html
Introduction
Prof. Wolfgang Nejdl, Elena Demidova
Institut fur Verteilte SystemeL3S Researh Center
Leibniz Universitat Hannover
13 October 2011
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 1 / 21
Organizational issues Course overview
Plan for today
Organizational issues
Course overview
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 2 / 21
Organizational issues Course overview
Outline
1 Organizational issues
2 Course overview
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 3 / 21
Organizational issues Course overview
Information for the ITIS students
Transmission will be available to the Internet Technologies andInformation Systems (ITIS) students located outside Hannoverupon request athttps://webconf.vc.dfn.de/foundations_of_ir/. If thisapplies to you, please ask for a password via email from ElenaDemidova, demidova (at) L3S.de
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 4 / 21
Organizational issues Course overview
Lecture and exam dates
We meet every Thursday 11:30 - (ca.) 14:00. The exercisesessions follow the lectures (do not be late!).
The lectures start next week (October 20).
The exercise sessions start the week after (October 27).
StudIP: please register for the exercise sessions (mailing list).
Exam: 90 minutes written exam on the March 5, 2012.
Prerequisites for the ITIS-students will be clarified later.
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 5 / 21
Organizational issues Course overview
Outline
1 Organizational issues
2 Course overview
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 6 / 21
Organizational issues Course overview
Literature
Selected Chapters of the IIR Book: Christopher D. Manning,Prabhakar Raghavan and Hinrich Schtze, Introduction toInformation Retrieval, Cambridge University Press. 2008.http://www-nlp.stanford.edu/IR-book/
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 7 / 21
Organizational issues Course overview
Outlook for the lectures
We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21
Organizational issues Course overview
Outlook for the lectures
We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21
Organizational issues Course overview
Outlook for the lectures
We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21
Organizational issues Course overview
Outlook for the lectures
We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21
Organizational issues Course overview
Outlook for the lectures
We will look at the algorithms and technologies used in themodern search engines to satisfy informational needs of theusers from within large document collections (usually storedon computers).
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 8 / 21
Organizational issues Course overview
IIR 01: Boolean retrieval
Design and data structures of a simple information retrievalsystem
Queries are Boolean expressions, e.g., Caesar and Brutus
The seach engine returns all documents that satisfy theBoolean expression.
Inverted index:
Brutus −→ 1 2 4 11 31 45 173 174
Caesar −→ 1 2 4 5 6 16 57 132 . . .︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 9 / 21
Organizational issues Course overview
IIR 01: Boolean retrieval
Design and data structures of a simple information retrievalsystem
Queries are Boolean expressions, e.g., Caesar and Brutus
The seach engine returns all documents that satisfy theBoolean expression.
Inverted index:
Brutus −→ 1 2 4 11 31 45 173 174
Caesar −→ 1 2 4 5 6 16 57 132 . . .︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 9 / 21
Organizational issues Course overview
IIR 01: Boolean retrieval
Design and data structures of a simple information retrievalsystem
Queries are Boolean expressions, e.g., Caesar and Brutus
The seach engine returns all documents that satisfy theBoolean expression.
Inverted index:
Brutus −→ 1 2 4 11 31 45 173 174
Caesar −→ 1 2 4 5 6 16 57 132 . . .︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 9 / 21
Organizational issues Course overview
IIR 02: The term vocabulary and postings lists
Phrase queries: Stanford University
Proximity: find Gates near Microsoft
We need an index that captures position information forphrase queries and proximity queries.
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 10 / 21
Organizational issues Course overview
IIR 03: Dictionaries and tolerant retrieval
rd aboard ardent boardroom border
or border lord morbid sordid
bo aboard about boardroom border
- - - -
- - - -
- - - -
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 11 / 21
Organizational issues Course overview
IIR 06: Scoring, term weighting and the vector spacemodel
Ranking search results
Boolean queries only give inclusion or exclusion of documents.
For ranked retrieval, we measure the proximity from query toeach document.
One formalism for doing this: the vector space model
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 12 / 21
Organizational issues Course overview
IIR 08: Evaluation and dynamic summaries
Benchmarks (e.g. TREC = Text Retrieval Conference)
Measures (Precision, Recall, etc.)
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 13 / 21
Organizational issues Course overview
IIR 09: Relevance feedback and query expansion
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 14 / 21
Organizational issues Course overview
IIR 10: XML retrieval
Semi-structured / structured documents vs. unstructureddocuments
Can we utilize the structure of the data in IR systems?
Databases support search for numerical range and exactmatch, e.g., salary < 60,000 and manager=Smith.
If your data is structured and you only need precise queries likethis (numerical, exact match etc), do not use an IR system.
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 15 / 21
Organizational issues Course overview
IIR 13: Text classification and Naive Bayes
Text classification = assigning documents automatically topredefined classes
Examples:
a. Language (English vs. French)b. Location
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 16 / 21
Organizational issues Course overview
IIR 16: Flat clustering
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 17 / 21
Organizational issues Course overview
IIR 19: Web search
Unusual and diverse documents
Unusual and diverse users and information needs
Beyond terms and text: exploit link analysis, user data
How do web search engines work?
How can we make them better?
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 18 / 21
Organizational issues Course overview
IIR 20: Crawling
www
fetch
DNS
parse
URL frontier
contentseen?
��
����
docFPs �
�����
robotstemplates �
�����
URLset
URLfilter
dupURLelim-
�
-
6
�-
?6
- - -
�
6?
6?
6?
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 19 / 21
Organizational issues Course overview
IIR 21: Link analysis / PageRank
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 20 / 21
Organizational issues Course overview
Questions?
Thank you for your attention!
Prof. Wolfgang Nejdl, Elena Demidova: Introduction (with slides of Hinrich Schutze) 21 / 21