64
A Tutorial on Question Answering Systems Saeedeh Shekarpour Postdoc researcher from EIS research group 31 August 2015 EIS research group - Bonn University 1

Tutorial on Question Answering Systems

Embed Size (px)

Citation preview

Page 1: Tutorial on Question Answering Systems

EIS research group - Bonn University

1

A Tutorial on Question Answering SystemsSaeedeh Shekarpour

Postdoc researcher from EIS research group

31 August 2015

Page 2: Tutorial on Question Answering Systems

EIS research group - Bonn University

2

Outline

Introduction Associations for evaluation QA systems Preliminary Concepts Data Web Emerging Concepts Deeper view on SINA Project

31 August 2015

Page 3: Tutorial on Question Answering Systems

EIS research group - Bonn University

3

Question Answering System

Query

Answer

What is a Question Answering (QA) system?

Systems that automatically answer questions posed by humans in natural language query.

31 August 2015

Natural Language

Natural language is the common way for sharing knowledge

Page 4: Tutorial on Question Answering Systems

EIS research group - Bonn University

4

Question answering is a multi disciplinary field

Information Retrieval

Natural Language Processing

Artificial Intelligence

Knowledge Base

Software Engineering Linguistic

31 August 2015

Semantic Web

Linked Data

Page 5: Tutorial on Question Answering Systems

EIS research group - Bonn University

5

Difference to Information Retrieval

Information Retrieval is a query driven approach for accessing information.

System returns a list of documents. It is responsibility of user to navigate on the retrieved

documents and find its own information need.

Question Answering is an answer driven approach for accessing information.

User asks its question in natural language ( i.e. phrase-based, full sentence or even keyword based) queries.

System returns the list of short answers. More complex functionality.

31 August 2015

Page 6: Tutorial on Question Answering Systems

EIS research group - Bonn University

6

Search engines are moving towards QA

31 August 2015

Page 7: Tutorial on Question Answering Systems

EIS research group - Bonn University

7

Search engines still lack the ability to answer more complex queries

31 August 2015

Page 8: Tutorial on Question Answering Systems

EIS research group - Bonn University

8

Natural language queries are classified into different categories Factoid queries: WH questions like when, who,

where. Yes/ No queries: Is Berlin capital of Germany? Definition queries: what is leukemia? Cause/consequence queries: How, Why, What.

what are the consequences of the Iraq war ?

31 August 2015

Page 9: Tutorial on Question Answering Systems

EIS research group - Bonn University

9

Natural language queries are classified into different categories

Procedural queries: which are the steps for getting a Master degree?

Comparative queries: what are the differences between the model A and B?

Queries with examples: list of hard disks similar to hard disk X.

Queries about opinion: What is the opinion of the majority of Americans about the Iraq war?

31 August 2015

Page 10: Tutorial on Question Answering Systems

EIS research group - Bonn University

10

Corpus Type

Structured data (relational data bases, RDF knowledge bases).

Semi-structured data (XML databases) Free text Multimodal data: image, voice, video

31 August 2015

Page 11: Tutorial on Question Answering Systems

EIS research group - Bonn University

11

Types of QA systems

Open-domain: domain independent QA systems can answer any query from any corpus+ covers wide range of queries- low accuracy

Closed-domain: domain specific QA systems are limited to specific domains+ High accuracy- limited coverage over the possible queries- Needs domain expert

31 August 2015

Page 12: Tutorial on Question Answering Systems

EIS research group - Bonn University

12

Is QA system a need for user?Search engines query log analysis shows that

Real informational queries in Google: Who first invented rock and roll music? When was the mobile phone invented? Where was the hamburger invented? How to lose weight?

31 August 2015

Type of query

Query log analysis

Informational 48%Navigational 20%Transactional 30%

Page 13: Tutorial on Question Answering Systems

EIS research group - Bonn University

13

Introduction Associations for evaluation QA systems Preliminary Concepts Data Web Emerging Concepts Deeper view on SINA Project

31 August 2015

Page 14: Tutorial on Question Answering Systems

EIS research group - Bonn University

14

Text Retrieval Conference (Trec)

In 1999, Trec initiated a QA track, From 1999-2002, participant systems were allowed to

return ranked answer snippets. These snippets of text contain the actual answer.

From 2002, participant systems were allowed to return only the exact answer with confidence rate.

The evaluation metric was mean reciprocal rank (MRR).

31 August 2015

Page 15: Tutorial on Question Answering Systems

EIS research group - Bonn University

15

Typical pipeline for the participating systems in Trec

31 August 2015

Query Expansion

IR Approach

Passage Retrieval

Answer Selection

Predicting the type

of answer

AnswerQuery

Page 16: Tutorial on Question Answering Systems

EIS research group - Bonn University

16

QA at Clef

The Cross-Language Evaluation Forum (CLEF) initiative provides

Since 2003, Clef included a QA track. an infrastructure for the testing, tuning and evaluation

of information retrieval systems operating on European languages in both monolingual and cross-language contexts.

31 August 2015

Page 17: Tutorial on Question Answering Systems

EIS research group - Bonn University

17

Outline

Introduction Associations for evaluation QA systems Preliminary Concepts Data Web Emerging Concepts Deeper view on SINA Project

31 August 2015

Page 18: Tutorial on Question Answering Systems

EIS research group - Bonn University

18

Core of a QA system

31 August 2015

Processing Query Processing Data

Annotation

Indexing

Parsing

Type of answer

Type of query

Syntactic/Semantic Parsing

Query Expansion

Query Translation

Query Reformulation

query

Lexicon Construction

Query Matching

Answer

Extraction

Ranking

Validating

Representation

Page 19: Tutorial on Question Answering Systems

EIS research group - Bonn University

19

Syntactic Parsing: Part-of-speech Tagging

I like sweet cakes

I/PRP like/VBP sweet/JJ cakes/NNS

31 August 2015

like/VBP

I/PRP cakes/NNS

sweet/JJ

Page 20: Tutorial on Question Answering Systems

EIS research group - Bonn University

20

Type of Answer

Where was the hamburger invented?

White Castle traces the origin of the hamburger to Hamburg, Germany with its invention by Otto Kuase.

31 August 2015

Place

Page 21: Tutorial on Question Answering Systems

EIS research group - Bonn University

21

Named Entity Recognition on Query

where was Franklin Roosevelt born?

31 August 2015

Named Entity: Person

Page 22: Tutorial on Question Answering Systems

EIS research group - Bonn University

22

Relation Extraction

Barack Hussein Obama is the 44th and current President of the United States. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School.

31 August 2015

Named Entity: PersonNamed Entity: Place

Relation: President of

Page 23: Tutorial on Question Answering Systems

EIS research group - Bonn University

23

Watson Project

Watson is a computer which is capable of answering question issued in natural language.

Questions come from quiz show called Jeopardy. The software of this project is called DeepQA project. In 2011, Watson won the former winners of quiz show

Jeopardy.

31 August 2015

Page 24: Tutorial on Question Answering Systems

EIS research group - Bonn University

24

Watson Description

Hardware: Watson system has 2,880 POWER processor threads and has 16 terabytes of RAM.

Data: encyclopedias, dictionaries, thesauri, newswire articles, and literary works. Watson also used databases, taxonomies, and ontologies such as DBPedia, WordNet, and Yago were used.

Software: DeepQA software and the Apache UIMA framework. The system was written in various languages, including Java, C++, and Prolog, and runs on the SUSE Linux Enterprise Server 11 operating system using Apache Hadoop framework to provide distributed computing.

31 August 2015

Page 25: Tutorial on Question Answering Systems

EIS research group - Bonn University

25

The high level architecture of DeepQA

31 August 2015

Page 26: Tutorial on Question Answering Systems

EIS research group - Bonn University

26

Decomposition exampleQuery: Of the four countries in the world that the United States does not have diplomatic relations with, the one that’s farthest north.

Bhutan, Cuba, Iran, North Korea

North Korea

31 August 2015

Page 27: Tutorial on Question Answering Systems

EIS research group - Bonn University

27

Outline

Introduction Associations for evaluation QA systems Preliminary Concepts Data Web Emerging Concepts Deeper view on SINA Project

31 August 2015

Page 28: Tutorial on Question Answering Systems

EIS research group - Bonn University

28

Evolution of Web

31 August 2015

Web of Documents

Semantic Web

Web of Data

Page 29: Tutorial on Question Answering Systems

EIS research group - Bonn University

29

The growth of Linked Open Data

31 August 2015

August 2014570 DatasetsMore than 74 billion triples

May 200712 Datasets

Page 30: Tutorial on Question Answering Systems

EIS research group - Bonn University

30

How to retrieve data from Linked Data?

31 August 2015

Linked Data characteristics:• Wide range of topical domains• Variety in vocabularies• Interlinked data

SPARQL queries:• Knowledge about the ontology• Proficiency in formulating formal queries• Explicit and unambigious semantics

Text queries (either keyword or natural language ):• Simple retrieval approach• Implicit and ambiguous semantics• Popular

Page 31: Tutorial on Question Answering Systems

EIS research group - Bonn University

31

RDF Model

RDF is an standard for describing Web resources. The RDF data model expresses statements about Web

resources in the form of subject-predicate-object (triple).

The statement “Jack knows Alice” is represented as:

31 August 2015

Jack Aliceknowing

Page 32: Tutorial on Question Answering Systems

EIS research group - Bonn University

32

Outline

Introduction Associations for evaluation QA systems Preliminary Concepts Data Web Emerging Concepts Deeper view on SINA Project

31 August 2015

Page 33: Tutorial on Question Answering Systems

EIS research group - Bonn University

33

Semantic Indexing

31 August 2015

Documents Terms/Documents Index

EntitiesTerms/Concepts

Index

Page 34: Tutorial on Question Answering Systems

EIS research group - Bonn University

34

Semantic Annotation

Name Entity Recognition

Where is the capital of Germany?

Semantic Annotation

Where is the capital of Germany?

31 August 2015

Named Entity: Place

Entity: http://dbpedia.org/resource/Germany

Page 35: Tutorial on Question Answering Systems

EIS research group - Bonn University

35

31 August 2015

Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89.

Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89.

Page 36: Tutorial on Question Answering Systems

EIS research group - Bonn University

36

Relation extraction leveraging Data Web: BOA Library

31 August 2015

Page 37: Tutorial on Question Answering Systems

EIS research group - Bonn University

37

HAWK: Hybrid Question Answering over Linked Data

31 August 2015

Page 38: Tutorial on Question Answering Systems

EIS research group - Bonn University

38

TBSL: Template-based SPARQL generator

31 August 2015

Page 39: Tutorial on Question Answering Systems

EIS research group - Bonn University

39

Outline

Introduction Associations for evaluation QA systems Preliminary Concepts Data Web Emerging Concepts Deeper view on SINA Project

31 August 2015

Page 40: Tutorial on Question Answering Systems

EIS research group - Bonn University

40

SINA Architecture

31 August 2015

Client

Query Preprocessing

Query Expansion

Resource Retrieval

Disambiguation

Query Construction

Representation

Server

Underlying Interlinked Knowledge Bases

query result

keywords

valid segments

mapped resources

tuple of resources

SPARQLqueries

OWL API

http client

Stanford CoreNLP

Segment Validation

Reformulated query

Page 41: Tutorial on Question Answering Systems

EIS research group - Bonn University

41

Comparison of search approaches

31 August 2015

Data-semanticunaware

Data-semanticaware

Keyword-based query

Question Answering

Systems

Information RetrievalSystems

Our approach: SINA

Page 42: Tutorial on Question Answering Systems

EIS research group - Bonn University

42

Objective: transformation from textual query to formal query

31 August 2015

Which televisions shows were created by Walt Disney?

SELECT * WHERE { ?v0 a dbo:TelevisionShow. ?v0 dbo:creator dbr:Walt_Disney. }

1

2

3

Page 43: Tutorial on Question Answering Systems

EIS research group - Bonn University

43

The addressed challenges in SINA

31 August 2015

Challenges

Query Segmentati

on

Resource Disambigu

ation

Query Expansion

Formal Query

Construction

Data Fusion on

Linked Data

Page 44: Tutorial on Question Answering Systems

EIS research group - Bonn University

44

Query Segmentation

31 August 2015

Two segmentations:

Sequence of keywords:

Input Query:What are the side effects of drugs used for Tuberculosis?

(side, effect, drug , Tuberculosis)

side effect | drug | Tuberculosis

side effect drug | Tuberculosis

Definition: query segmentation is the process of identifying the right segments of data items that occur in the keyword queries.

Page 45: Tutorial on Question Answering Systems

EIS research group - Bonn University

45

Resource Disambiguation

31 August 2015

Definition: resource disambiguation is the process of recognizing the suitable resources in the underlying knowledge base.

Input query

• Who produced films starring Natalie Portman?

Ambiguous Resources

• dbpedia/ontology/film• dbpedia/property/film

Input query

• What are the side effects of drugs used for Tuberculosis?

Ambiguous Resources

• diseasome:Tuberculosis• sider:Tuberculosis

Page 46: Tutorial on Question Answering Systems

EIS research group - Bonn University

46

Concurrent Approach

31 August 2015

Query Segmentation

Resource Disambiguation

Page 47: Tutorial on Question Answering Systems

EIS research group - Bonn University

47

Modeling using hidden Markov model

31 August 2015

1

2

3

Unknown Entity

4

5

6

7

8

9

Start

Keyword 1 Keyword 3Keyword 2 Keyword 4

Page 48: Tutorial on Question Answering Systems

EIS research group - Bonn University

48

Bootstrapping the model parameters

1. Emission probability is defined based on the similarity of the label of each state with a segment, this similarity is computed based on string-similarity and Jaccard-similarity.

2. Semantic relatedness is a base for transition probability and initial probability. Intuitively, it is based on two values: distance and connectivity degree. We transform these two values to hub and authority values using weighted HITS algorithm.

3. HITS algorithm is a link analysis algorithm that was originally developed for ranking Web pages. It assign a hub and authority value to each web page.

4. Initial probability and transition probability are defined as a uniform distribution over the hub and and authority values.

31 August 2015

Page 49: Tutorial on Question Answering Systems

EIS research group - Bonn University

49

31 August 2015

Sequence of keywords

(television show creat Walt Disney)

Paths 0.0023 dbo:TelevisionShow dbo:creatordbr:Walt_Disney

0.0014 dbo:TelevisionShow dbo:creatordbr:Category:Walt_Disney

0.000589 dbr:TelevisionShow dbo:creatordbr:Walt_Disney

0.000353 dbr:TelevisionShow dbo:creatordbr:Category:Walt_Disney

0.0000376 dbp:television dbp:show dbo:creatordbr:Category:Walt_Disney

Output of the model

Page 50: Tutorial on Question Answering Systems

EIS research group - Bonn University

50

Query Expansion

Definition: query expansion is a way of reformulating the input query in order to overcome the vocabulary mismatch problem.

31 August 2015

Input query• Wife of Barak Obama

Reformulated query

• Spouse of Barak Obama

Page 51: Tutorial on Question Answering Systems

EIS research group - Bonn University

51

Query Expansion

31 August 2015

Analysis of linguistic features vs. semantic featuresA method for automatic query expansion

Page 52: Tutorial on Question Answering Systems

EIS research group - Bonn University

52

Linguistic features

WordNet is a popular data source for expansion. Linguistic features extracted from WordNet are:

1. Synonyms: words having a similar meanings to the input keyword.

2. Hyponyms: words representing a specialization of the input keyword.

3. Hypernyms: words representing a generalization of the input keyword.

31 August 2015

Page 53: Tutorial on Question Answering Systems

EIS research group - Bonn University

53

Semantic features from Linked Data

31 August 2015

1. SameAs: deriving resources using owl:sameAs. 2. SeeAlso: deriving resources using rdfs:seeAlso. 3. Equivalence class/property: deriving classes or

properties using owl:equivalentClass and owl:equivalentProperty.

4. Super class/property: deriving all super classes/properties of by following the rdfs:subClassOf or rdfs:subPropertyOf property.

5. Sub class/property: deriving resources by following the rdfs:subClassOf or rdfs:subPropertyOf property paths ending with the input resource.

Page 54: Tutorial on Question Answering Systems

EIS research group - Bonn University

54

Semantic features from Linked Data

5. Broader concepts: deriving using the SKOS vocabulary properties skos:broader and skos:broadMatch.

6. Narrower concepts: deriving concepts using skos:narrower and skos:narrowMatch.

7. Related concepts: deriving concepts using skos:closeMatch, skos:mappingRelation and skos:exactMatch.

31 August 2015

Page 55: Tutorial on Question Answering Systems

EIS research group - Bonn University

55

Exemplary expansion graph of the word movie

31 August 2015

movie

super resource

,

hypernym

home movie

hyponym

production

synonym, sameAs,

equivalence

film

synonym, sameAs,

equivalence

motion picture

hype

rny

m

show

super resourcevideo

hypernym

telefilm

Page 56: Tutorial on Question Answering Systems

EIS research group - Bonn University

56

Statistics over the number of the derived words from WordNet and Linked Data

31 August 2015

Feature #derived words

synonym 503hyponym 2703hypernym 657sameAs 2332seeAlso 49equivalence 2super class/property 267Sub class/property 2166

Page 57: Tutorial on Question Answering Systems

EIS research group - Bonn University

57

Automatic query expansion

31 August 2015

Input query

External Data source

Data extraction and preparation

Heuristic method

Reformulated query

Page 58: Tutorial on Question Answering Systems

EIS research group - Bonn University

58

Expansion set for each segment

31 August 2015

Expansion Set

Original segment

Lemmatized segment

Derived words from WordNet

Synonym

Hyponym

Hypernym

Page 59: Tutorial on Question Answering Systems

EIS research group - Bonn University

59

Reformulating query using hidden Markov model

31 August 2015

Barak

Barak Obama

spouseObama

wife

first lady

woman

Barak Obama wife Barak

Obama

Start

Obama wife

Barak Obama

wife

Input query: wife of Barak Obama

Page 60: Tutorial on Question Answering Systems

EIS research group - Bonn University

60

Formal Query Construction Definition: Once the resources are detected, a

connected subgraph of the knowledge base graph, called the query graph, has to be determined which fully covers the set of mapped resources.

31 August 2015

Disambiguated resources

sider:sideEffect

diseasome:possibleDrug

diseasome:1154

SPARQL query SELECT ?v3 WHERE { diseasome:115 diseasome:possibleDrug ?v1 .?v1 owl:sameAs

?v2 .?v2 sider:sideEffect ?v3 .}

Page 61: Tutorial on Question Answering Systems

EIS research group - Bonn University

61

Data Fusion on Linked Data

Answer of a question may be spread among different datasets employing heterogeneous schemas.

Constructing a federated query from needs to exploit links between the different datasets on the schema and instance levels.

31 August 2015

Page 62: Tutorial on Question Answering Systems

EIS research group - Bonn University

62

Two different approaches

Formal query construction based on Template-based query construction Forward chaining based query construction

31 August 2015

Page 63: Tutorial on Question Answering Systems

EIS research group - Bonn University

63

Federated query construction using forward chaining

31 August 2015

Query What is the side effects of drugs used for Tuberculosis?

resources diseasome:1154 (type instance)diseasome:possibleDrug (type property)

sider:sideEffect (type property)

2. Incomplete query graph

1. Set of resources

1154 ?v0possibleDrug

Graph 1

?v1 ?v2sideEffect

Graph 2

3. Query graph

1154 ?v0possibleDrug

Template 1

?v1 ?v2sideEffect

Template 2

Owl:SameAs

1154 ?v0possibleDrug

?v1 ?v2sideEffectOwl:SameAs

Page 64: Tutorial on Question Answering Systems

EIS research group - Bonn University

64

Any question?

31 August 2015