22
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley / ICSI / Stanford University

QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

QuASI: Question Answering using

Statistics, Semantics, and Inference

Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan

Univ. of California-Berkeley / ICSI / Stanford University

Page 2: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

TREC Task 1: Overview Search 525,938 MedLine records

Titles, abstracts, MeSH category terms, citation information

Topics: Taken from the GeneRIF portion of the LocusLink

database We are supplied with a gene names Definition of a GeneRIF:

For gene X, find all MEDLINE references that focus on the basic biology of the gene or its protein products from the designated organism.  Basic biology includes isolation, structure, genetics and function of genes/proteins in normal and disease states.

Page 3: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

TREC Task 1: Sample Query

3 2120 Homo sapiens OFFICIAL_GENE_NAME ets variant gene 6 (TEL ncogene)

3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6 3 2120 Homo sapiens ALIAS_SYMBOL TEL 3 2120 Homo sapiens PREFERRED_PRODUCT ets variant gene 6 3 2120 Homo sapiens PRODUCT ets variant gene 6 3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene

The first column is the official topic number (1-50). The second column contains the LocusLink ID for the

gene. The third column contains the name of organism. The fourth column contains the gene name type. The fifth column contains the gene name.

Page 4: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

TREC Task 1: Approach

Two main components:

Retrieve relevant docs• May miss many because of variation in how

gene names are expressed

Rank order them

Page 5: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

TREC Task 1: Approach

Retrieval Normalization of query terms

• Special characters are replaced with spaces in both queries and documents. Term expansion

• A set of pattern based rules is applied to the original list of query terms, to expand the original set, and increase recall.

• Some rules with lower confidence get a lower weight in the ranking step. Stop word removal Organism identification

• Gene names are often shared across different organisms• Developed a method to automatically determine which MeSH terms

correspond to LocusLink Organism terms• Retrieved Medline docs indicated by LocusLink links corresponding to a given

organism• Organism terms were the most frequent MeSH categories among the selected docs• Used these terms to identify the organism term in Medline• An example of playing two databases off each other.

Mesh concepts• When an exact match is found between one of the query terms and a MeSH

term assigned to a document, the document is retrieved.

Page 6: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

Gene Name Expansion

Page 7: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

Organism Filtering

Page 8: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

TREC Task 1: Approach

Relevance ranking

IBM’s DB2 Net Search Extender was used as the text search engine. Scoring:

• Each query is a union of 5 different sub-queries - • titles, • abstracts, • titles using low confidence expansion rules, • abstracts using low confidence expansion rules, and• MeSH concepts.

• Each sub-query returns a set of documents with a relevance score from the text search engine (or a fixed value for MeSH matches)

• The aggregated score is the weighted SUM of the individual scores with optional weights applied to each sub-query score.

• SUM performs better than MAX, since it gives higher confidence to documents found in multiple sub-queries.

• Scores are normalized to be in the (0,1) range, by dividing the score by the highest aggregated score achieved for the query.

Page 9: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

TREC Task 1: Approach

GeneRIF classification• A Naïve Bayes model is used to assign to each document

the probability it is a GeneRIF. • MeSH terms are used as features.

Combination of text retrieval score and GeneRIF classification score.

• We tried both an additive and a multiplicative approach. Both behave similarly with a slightly better performance achieved with the additive one.

Page 10: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

TREC Task 1: Results Performance is measured using the standard

trec_eval program.

On training data: Best published result: 0.4125 With GeneRIF classifier: 0.5101 Without GeneRIF classifier: 0.5028

On testing data: (turned in 8/4/03)

With GeneRIF classifier – 0.3933 Without GeneRIF classifier – 0.3768

Page 11: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley

TREC Task 2 Problem Definition:

Given GeneRIFS formatted as:• 1    355    12107169    J Biol Chem 2002 Sep

13;277(37):34343-8.    the death effector domain of FADD is involved in interaction with Fas.

• 2    355    12177303    Nucleic Acids Res 2002 Aug 15;30(16):3609-14.    In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid-ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis w

… reproduce the GeneRIF from the MEDLINE record.  

Page 12: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 13: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 14: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 15: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 16: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 17: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 18: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 19: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 20: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 21: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley
Page 22: QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley