20
Text Mining lecture Information Retrieval Prof.dr.ir. Arjen P. de Vries [email protected] Nijmegen, October 18 th , 2017

Information Retrieval intro TMM

Embed Size (px)

Citation preview

Page 1: Information Retrieval intro TMM

Text Mining lectureInformation RetrievalProf.dr.ir. Arjen P. de Vries

[email protected]

Nijmegen, October 18th, 2017

Page 2: Information Retrieval intro TMM

A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc

Page 3: Information Retrieval intro TMM

Core Research Questions How to represent information?

- The information need and search requests- The objects to be shown in response to an information request

How to match information representations?

The information objects to be retrieved

are not necessarily textual!

Van Rijsbergen, 1979

Page 4: Information Retrieval intro TMM

Two views on ‘search’

DB

Business applications Deductive reasoning Precise and efficient

query processing Users with technical skills

(SQL) and precise information needs

SelectionBooks where category=‘CS’

IR

Digital libraries, patent collections, etc.

Inductive reasoning Best-effort processing Untrained users with

imprecise information needs

RankingBooks about CS

Note: SemWeb more DB than IR!!!

Symbolic Connectionist

Page 5: Information Retrieval intro TMM

Search Flow Chart

A Tutorial on Models of Information Seeking, Searching & Retrieval by @leifos & @guidozuc 5

Page 6: Information Retrieval intro TMM

IR vs. AI Many related topics in AI:

- Computational Linguistics- Natural Language Processing

- Question Answering- Information Extraction- Machine Translation

- Computer vision / Multimedia

vs.

Information Retrieval?

Page 7: Information Retrieval intro TMM

IR vs. AI (Kunstmatige Intelligentie) “In some sense, of course, classic IR is superhuman: there was

no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.”

Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006

Page 8: Information Retrieval intro TMM

IR vs. AI “In some sense, of course, classic IR is superhuman: there was

no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.”

Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006

Page 9: Information Retrieval intro TMM

IR vs. AI “In some sense, of course, classic IR is superhuman: there was

no pre-existing human skill, as there was with seeing, talking or even chess playing that corresponded to the search through millions of words of text on the basis of indices. But if one took the view, by contrast, that theologians, lawyers and, later, literary scholars were able, albeit slowly, to search vast libraries of sources for relevant material, then on that view IR is just the optimisation of a human skill and not a superhuman activity. If one takes that view, IR is a proper part of AI, as traditionally conceived.”

Yorick Wilks, Unhappy bedfellows: the relationship of AI and IR An “Essay in honour of Karen Spärck Jones”, 2006

Page 10: Information Retrieval intro TMM

Relevance Inherently dependent on user, context and task Different “relevance criteria”

- Topicality: is the document about the information request?- Readability: can I understand the text?- Authoritiveness: can I trust the text?- Child-suitability: is the text appropriate for children?- Etc.

Page 11: Information Retrieval intro TMM

“Computational Relevance” “Intellectually it is possible for a human to establish the

relevance of a document to a query. For a computer to do this we need to construct a model within which relevance decisions can be quantified. It is interesting to note that most research in information retrieval can be shown to have been concerned with different aspects of such a model.”

Van Rijsbergen, 1976

Retrieval Model

Page 12: Information Retrieval intro TMM

‘Computational Relevance’ How to combine different

indicators of relevance?- E.g., topicality, child-

suitability, polarity, …

Apply ‘copulas’ (a technique from econometrics) to model non-linear dependencies

(SIGIR 2013, CIKM 2014)

Page 13: Information Retrieval intro TMM

Relevance Various aspects of understanding this notion of relevance

position information retrieval between computer science and information science

Examples of questions that traditionally do not even presume involvement of a computer:- What makes an information object relevant?- What stages constitute a search process?- How does relevance evolve during this search process?- How do users learn from the search process?- Why do users issue short queries even if we know that long

ones are more effective?Etc.

Page 14: Information Retrieval intro TMM

NLP in IR Stemming & Stopping

- De facto default setting

N-grams (bi-grams)- SDM (Sequential Dependence Model)

Entity tagging

Page 15: Information Retrieval intro TMM

Footnote in Victor Lavrenko’s PhD thesis “It is my personal observation that almost every

mathematically inclined graduate student in Information Retrieval attempts to formulate some sort of a non-independent model of IR within the first two or three years of his studies. The vast majority of these attempts yield no improvements and remain unpublished.”

Page 16: Information Retrieval intro TMM

Take words as they stand !

Page 17: Information Retrieval intro TMM
Page 18: Information Retrieval intro TMM

The Secret The user can simply reformulate their information need in

response to insufficiently relevant results retrieved by the system!

Page 19: Information Retrieval intro TMM

Why Search Remains Difficult to Get Right Heterogeneous data sources

- WWW, wikipedia, news, e-mail, patents, twitter, personal information, …

Varying result types- “Documents”, tweets, courses, people, experts, gene

expressions, temperatures, … Multiple dimensions of relevance

- Topicality, recency, reading level, …

Actual information needs often require a mix within and across dimensions. E.g., “recent news and

patents from our top competitors”

Page 20: Information Retrieval intro TMM

System’s internal information representation- Linguistic annotations

- Named entities, sentiment, dependencies, …- Knowledge resources

- Wikipedia, Freebase, IDC9, IPTC, …- Links to related documents

- Citations, urls Anchors that describe the URI

- Anchor text Queries that lead to clicks on the URI

- Session, user, dwell-time, … Tweets that mention the URI

- Time, location, user, … Other social media that describe the URI

- User, rating- Tag, organisation of `folksonomy’

+ UNCERTAINTY ALL OVER!