44
1 Understanding Text Meaning in Information Applications Ido Dagan Bar-Ilan University, Israel

1 Understanding Text Meaning in Information Applications Ido DaganBar-Ilan University, Israel

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

1

Understanding Text Meaning in Information Applications

Ido Dagan Bar-Ilan University, Israel

2

Outline – a Vision

• Why do we need “Text Understanding”?

• Capture understanding by Textual Entailment– Does one text entail another?

• Major challenge – knowledge acquisition

• Initial applications

• Looking 5 years ahead

3

Text Understanding

• Vision for improving information access• Common search engines

– Still: text processing mostly matches query keywords

• Deeper understanding:– Consider the meanings of words and the relationships

between them

• Relevant for applications– Question answering, information extraction, semantic

search, summarization

4

5

6

Towards text understanding: Question Answering

7

8

Information Extraction (IE)

• Identify information of pre-determined structure– Automatic filling of “forms”

• Example - extract product information:

CompanyProduct TypeProduct Name

HyundaiCarAccent

HyundaiCarElantra

SuzukiMotorcycleR-350

9

Search may benefit understanding

• Query: AIDS treatment

• Irrelevant document:

Hemophiliacs lack a protein, called factor VIII, that is essential for making blood clots. As a result, they frequently suffer internal bleeding and must receive infusions of clotting protein derived from human blood.During the early 1980s, these treatments were often tainted withthe AIDS virus. In 1984, after that was discovered, manufacturersbegan heating factor VIII to kill the virus. The strategy greatlyreduced the problem but was not foolproof. However, many expertsbelieve that adding detergents and other refinements to thepurification process has made natural factor VIII virtually free ofAIDS.

(AP890118-0146, TIPSTER Vol. 1)

• Many irrelevant documents mention AIDS and treatments for other diseases

10

Relevant Document

• Query: AIDS treatment

Federal health officials are recommending aggressive use of a newly approved drug that protects people infected with the AIDS virus against a form of pneumonia that is the No.1 killer of AIDS victims.The Food and Drug Administration approved the drug, aerosol pentamidine, on Thursday. The announcement came as the Centers for Disease Control issued greatly expanded treatment guidelines recommending wider use of the drug in people infected with the AIDS virus but who may show no symptoms.

(AP890616-0048, TIPSTER VOL. 1)

• Relevant documents may mention specific types of treatments for AIDS

11

12

13

14

Why is it difficult?

Meaning

Language

Ambiguity

Variability

15

Variability of Semantic Expression

Dow ends up

Dow climbs 255

The Dow Jones Industrial Average closed up 255

Stock market hits a record high

Dow gains 255 points

16

How to capture “understanding”?

Overture’s acquisition by Yahoo …

Yahoo bought Overture

Question Expected answer formWho bought Overture? >> X bought Overture

Key task is recognizing that one text entails another

• IE – extract buying events: Y’s acquisition by X X buy Y

• Search: Find Acquisitions by Yahoo

• Summarization (multi-document) – identify redundancy

text hypothesized answer

entails

17

Textual Entailment ≈ Human Reading Comprehension

• From a children’s English learning book(Sela and Greenberg):

• Reference Text: “…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …”

• Hypothesis (True/False?): The Bermuda Triangle is near the United States

???

18

PASCAL Recognizing Textual Entailment (RTE) Challenges

FP-6 Funded PASCAL NOE 2004-7

Bar-Ilan University ITC-irst and CELCT, TrentoMITRE Microsoft Research

19

Some Examples

TEXTHYPOTHESISTASKENTAIL-

MENT

1Regan attended a ceremony in Washington to commemorate the landings in Normandy.

Washington is located inNormandy.

IEFalse

2Google files for its long awaited IPO.Google goes public.IRTrue

3

…: a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others.

Cardinal Juan Jesus Posadas Ocampo died in 1993.

QATrue

4

The SPD got just 21.5% of the votein the European Parliament elections,while the conservative opposition partiespolled 44.5%.

The SPD is defeated by

the opposition parties.IETrue

20

Participation and Impact• Very successful challenges, world wide:

– RTE-1 – 17 groups – RTE-2 – 23 groups

• ~150 downloads!– RTE-3 25 groups– RTE-4 (2008) – moved to NIST (TREC organizers)

• High interest in the research community– Papers, conference keywords, sessions and areas,

PhD’s, influence on funded Projects– special issue at Journal of Natural Language

Engineering

21

Results

First Author (Group)AccuracyAverage Precision

Hickl (LCC)75.4%80.8%

Tatu (LCC)73.8%71.3%

Zanzotto (Milan & Rome)63.9%64.4%

Adams (Dallas)62.6%62.8%

Bos (Rome & Leeds)61.6%66.9%

11 groups58.1%-60.5%

7 groups52.9%-55.6%

Average: 60%Median: 59%

22

What is the main obstacle?

• System reports point at:– Lack of knowledge

• rules, paraphrases, lexical relations, etc.

• It seems that systems that coped better with these issues performed best

23

Research Directions at Bar-Ilan

Knowledge AcquisitionInference

Applications

Oren Glickman, Idan Szpektor, Roy Bar Haim, Maayan Geffet, Moshe Koppel Bar Ilan UniversityShachar Mirkin Hebrew University, IsraelHristo Tanev, Bernardo Magnini, Alberto Lavelli, Lorenza Romano

ITC-irst, ItalyBonaventura Coppola, Milen Kouylekov

University of Trento and ITC-irst, Italy

24

Distributional Word Similarity

“Similar words appear in similar contexts” Harris, 1968

Similar Word Meanings Similar Contexts

Similar Word Meanings Similar Context Features

Distributional Similarity Model:

25

Measuring Context Similarity

Country StateIndustry (genitive) Neighboring (modifier)

Neighboring (modifier) …

… Governor (modifier)

Visit (obj) Parliament (genitive)

… Industry (genitive)

Population (genitive) …

Governor (modifier) Visit (obj)

Parliament (genitive) President (genitive)

26

Incorporate Indicative Patterns

27

Acquisition Example

•Top-ranked entailments for “company”:

firm, bank, group, subsidiary, unit, business, supplier, carrier, agency, airline, division, giant,

entity, financial institution, manufacturer, corporation, commercial bank, joint venture, maker, producer, factory …

28

Learning Entailment Rules

Text:Aspirin prevents

Heart Attacks

Q: What reduces the risk of Heart Attacks?

Entailment Rule:X prevent Y ⇨ X reduce risk of Y

Hypothesis: Aspirin reduces the risk of

Heart Attacks

Need a large knowledge base of entailment rules

template template

29

TEASE – Algorithm Flow

WEB

LexiconInput template:

Xsubj-accuse-objY

Sample corpus for input template:Paula Jones accused Clinton…Sanhedrin accused St.Paul……

Anchor sets:{Paula Jonessubj; Clintonobj}{Sanhedrinsubj; St.Paulobj}…

Sample corpus for anchor sets:Paula Jones called Clinton indictable…St.Paul defended before the Sanhedrin …

Templates:X call Y indictableY defend before X…

TEASE

Anchor Set Extraction

(ASE)

Template Extraction

(TE)

iterate

30

Sample of ExtractedAnchor-Sets for X prevent Y

X=‘sunscreens’, Y=‘sunburn’

X=‘sunscreens’, Y=‘skin cancer’

X=‘vitamin e’, Y=‘heart disease’

X=‘aspirin’, Y=‘heart attack’

X=‘vaccine candidate’, Y=‘infection’

X=‘universal precautions’, Y=‘HIV’

X=‘safety device’, Y=‘fatal injuries’

X=‘hepa filtration’, Y=‘contaminants’

X=‘low cloud cover’, Y= ‘measurements’

X=‘gene therapy’, Y=‘blindness’

X=‘cooperation’, Y=‘terrorism’

X=‘safety valve’, Y=‘leakage’

X=‘safe sex’, Y=‘cervical cancer’

X=‘safety belts’, Y=‘fatalities’

X=‘security fencing’, Y=‘intruders’

X=‘soy protein’, Y=‘bone loss’

X=‘MWI’, Y=‘pollution’

X=‘vitamin C’, Y=‘colds’

31

Sample of Extracted Templates for X prevent Y

X reduce Y

X protect against Y

X eliminate Y

X stop Y

X avoid Y

X for prevention of Y

X provide protection against Y

X combat Y

X ward Y

X lower risk of Y

X be barrier against Y

X fight Y

X reduce Y risk

X decrease the risk of Y

relationship between X and Y

X guard against Y

X be cure for Y

X treat Y

X in war on Y

X in the struggle against Y

X a day keeps Y away

X eliminate the possibility of Y

X cut risk Y

X inhibit Y

32

Experiment and Evaluation

• 48 randomly chosen input verbs• 1392 templates extracted ; human judgments Encouraging Results:

• Future work: improve precision

Average Yieldper verb

29 correct templates per verb

Average Precisionper verb

45.3%

33

Syntactic Variability Phenomena

Template: X activate Y

PhenomenonExample

Passive formY is activated by X

AppositionX activates its companion, Y

ConjunctionX activates Z and Y

SetX activates two proteins: Y and Z

Relative clauseX, which activates Y

CoordinationX binds and activates Y

Transparent headX activates a fragment of Y

Co-referenceX is a kinase, though it activates Y

34

Takeaway

• Promising potential for creating huge entailment knowledge bases– Millions of rules

• Speculation: is it possible to have a public effort for knowledge acquisition?– Human Genome Project analogy– Community effort

35

Initial Applications:

Relation ExtractionSemantic Search

36

Dataset

• Recognizing interactions between annotated proteins pairs (Bunescu 2005)– 200 Medline abstracts– Gold standard dataset of protein pairs

• Input template : X interact with Y

37

Manual Analysis - Results• 93% of interacting protein pairs can be identified with lexical syntactic

templates

R(%)# templatesR(%)# templates

1026039

2047073

30680107

401190141

5021100175

Number of templates vs. recall (within 93%):

38

TEASE Output for X interact with Y

A sample of correct templates learned:

X bind to YX binding to Y

X activate YX Y interaction

X stimulate YX attach to Y

X couple to YX interaction with Y

interaction between X and YX trap Y

X become trapped in YX recruit Y

X Y complexX associate with Y

X recognize YX be linked to Y

X block YX target Y

39

• Iterative - taking the top 5 ranked templates as input

• Morph - recognizing morphological derivations

ExperimentRecall

input39%

input + iterative49%

input + iterative + morph

63%

TEASE algorithm - Potential Recall on Training Set

40

41

42

43

Integrating IE and Search (w. IBM Research Haifa)

44

Optimistic Conclusions

• Good prospects for better levels of text understanding– Enabling more sophisticated information access

• Textual entailment is an appealing framework– Boosts research on text understanding– Potential for vast knowledge acquisition

Thank you!