Information Retrieval intro TMM

Text Mining lectureInformation RetrievalProf.dr.ir. Arjen P. de Vries

[email protected]

Nijmegen, October 18th, 2017

A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc

Core Research Questions How to represent information?

- The information need and search requests- The objects to be shown in response to an information request

How to match information representations?

The information objects to be retrieved

are not necessarily textual!

Van Rijsbergen, 1979

Two views on ‘search’

DB

Business applications Deductive reasoning Precise and efficient

query processing Users with technical skills

(SQL) and precise information needs

SelectionBooks where category=‘CS’

IR

Digital libraries, patent collections, etc.

Inductive reasoning Best-effort processing Untrained users with

imprecise information needs

RankingBooks about CS

Note: SemWeb more DB than IR!!!

Symbolic Connectionist

Search Flow Chart

A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc 5

IR vs. AI Many related topics in AI:

- Computational Linguistics- Natural Language Processing

- Question Answering- Information Extraction- Machine Translation

- Computer vision / Multimedia

vs.

Information Retrieval?

IR vs. AI (Kunstmatige Intelligentie) “In some sense, of course, classic IR is superhuman: there was

no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.”

Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006

IR vs. AI “In some sense, of course, classic IR is superhuman: there was



IR vs. AI “In some sense, of course, classic IR is superhuman: there was



Relevance Inherently dependent on user, context and task Different “relevance criteria”

- Topicality: is the document about the information request?- Readability: can I understand the text?- Authoritiveness: can I trust the text?- Child-suitability: is the text appropriate for children?- Etc.

“Computational Relevance” “Intellectually it is possible for a human to establish the

relevance of a document to a query. For a computer to do this we need to construct a model within which relevance decisions can be quantified. It is interesting to note that most research in information retrieval can be shown to have been concerned with different aspects of such a model.”

Van Rijsbergen, 1976

Retrieval Model

‘Computational Relevance’ How to combine different

indicators of relevance?- E.g., topicality, child-

suitability, polarity, …

Apply ‘copulas’ (a technique from econometrics) to model non-linear dependencies

(SIGIR 2013, CIKM 2014)

Relevance Various aspects of understanding this notion of relevance

position information retrieval between computer science and information science

Examples of questions that traditionally do not even presume involvement of a computer:- What makes an information object relevant?- What stages constitute a search process?- How does relevance evolve during this search process?- How do users learn from the search process?- Why do users issue short queries even if we know that long

ones are more effective?Etc.

NLP in IR Stemming & Stopping

- De facto default setting

N-grams (bi-grams)- SDM (Sequential Dependence Model)

Entity tagging

Footnote in Victor Lavrenko’s PhD thesis “It is my personal observation that almost every

mathematically inclined graduate student in Information Retrieval attempts to formulate some sort of a non-independent model of IR within the first two or three years of his studies. The vast majority of these attempts yield no improvements and remain unpublished.”

Take words as they stand !

The Secret The user can simply reformulate their information need in

response to insufficiently relevant results retrieved by the system!

Why Search Remains Difficult to Get Right Heterogeneous data sources

- WWW, wikipedia, news, e-mail, patents, twitter, personal information, …

Varying result types- “Documents”, tweets, courses, people, experts, gene

expressions, temperatures, … Multiple dimensions of relevance

- Topicality, recency, reading level, …

Actual information needs often require a mix within and across dimensions. E.g., “recent news and

patents from our top competitors”

System’s internal information representation- Linguistic annotations

- Named entities, sentiment, dependencies, …- Knowledge resources

- Wikipedia, Freebase, IDC9, IPTC, …- Links to related documents

- Citations, urls Anchors that describe the URI

- Anchor text Queries that lead to clicks on the URI

- Session, user, dwell-time, … Tweets that mention the URI

- Time, location, user, … Other social media that describe the URI

- User, rating- Tag, organisation of `folksonomy’

+ UNCERTAINTY ALL OVER!

Science

Information Retrieval intro TMM