View
218
Download
1
Embed Size (px)
Citation preview
2
Outline – a Vision
• Why do we need “Text Understanding”?
• Capture understanding by Textual Entailment– Does one text entail another?
• Major challenge – knowledge acquisition
• Initial applications
• Looking 5 years ahead
3
Text Understanding
• Vision for improving information access• Common search engines
– Still: text processing mostly matches query keywords
• Deeper understanding:– Consider the meanings of words and the relationships
between them
• Relevant for applications– Question answering, information extraction, semantic
search, summarization
8
Information Extraction (IE)
• Identify information of pre-determined structure– Automatic filling of “forms”
• Example - extract product information:
CompanyProduct TypeProduct Name
HyundaiCarAccent
HyundaiCarElantra
SuzukiMotorcycleR-350
9
Search may benefit understanding
• Query: AIDS treatment
• Irrelevant document:
Hemophiliacs lack a protein, called factor VIII, that is essential for making blood clots. As a result, they frequently suffer internal bleeding and must receive infusions of clotting protein derived from human blood.During the early 1980s, these treatments were often tainted withthe AIDS virus. In 1984, after that was discovered, manufacturersbegan heating factor VIII to kill the virus. The strategy greatlyreduced the problem but was not foolproof. However, many expertsbelieve that adding detergents and other refinements to thepurification process has made natural factor VIII virtually free ofAIDS.
(AP890118-0146, TIPSTER Vol. 1)
• Many irrelevant documents mention AIDS and treatments for other diseases
10
Relevant Document
• Query: AIDS treatment
Federal health officials are recommending aggressive use of a newly approved drug that protects people infected with the AIDS virus against a form of pneumonia that is the No.1 killer of AIDS victims.The Food and Drug Administration approved the drug, aerosol pentamidine, on Thursday. The announcement came as the Centers for Disease Control issued greatly expanded treatment guidelines recommending wider use of the drug in people infected with the AIDS virus but who may show no symptoms.
(AP890616-0048, TIPSTER VOL. 1)
• Relevant documents may mention specific types of treatments for AIDS
15
Variability of Semantic Expression
Dow ends up
Dow climbs 255
The Dow Jones Industrial Average closed up 255
Stock market hits a record high
Dow gains 255 points
16
How to capture “understanding”?
Overture’s acquisition by Yahoo …
Yahoo bought Overture
Question Expected answer formWho bought Overture? >> X bought Overture
Key task is recognizing that one text entails another
• IE – extract buying events: Y’s acquisition by X X buy Y
• Search: Find Acquisitions by Yahoo
• Summarization (multi-document) – identify redundancy
text hypothesized answer
entails
17
Textual Entailment ≈ Human Reading Comprehension
• From a children’s English learning book(Sela and Greenberg):
• Reference Text: “…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …”
• Hypothesis (True/False?): The Bermuda Triangle is near the United States
???
18
PASCAL Recognizing Textual Entailment (RTE) Challenges
FP-6 Funded PASCAL NOE 2004-7
Bar-Ilan University ITC-irst and CELCT, TrentoMITRE Microsoft Research
19
Some Examples
TEXTHYPOTHESISTASKENTAIL-
MENT
1Regan attended a ceremony in Washington to commemorate the landings in Normandy.
Washington is located inNormandy.
IEFalse
2Google files for its long awaited IPO.Google goes public.IRTrue
3
…: a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others.
Cardinal Juan Jesus Posadas Ocampo died in 1993.
QATrue
4
The SPD got just 21.5% of the votein the European Parliament elections,while the conservative opposition partiespolled 44.5%.
The SPD is defeated by
the opposition parties.IETrue
20
Participation and Impact• Very successful challenges, world wide:
– RTE-1 – 17 groups – RTE-2 – 23 groups
• ~150 downloads!– RTE-3 25 groups– RTE-4 (2008) – moved to NIST (TREC organizers)
• High interest in the research community– Papers, conference keywords, sessions and areas,
PhD’s, influence on funded Projects– special issue at Journal of Natural Language
Engineering
21
Results
First Author (Group)AccuracyAverage Precision
Hickl (LCC)75.4%80.8%
Tatu (LCC)73.8%71.3%
Zanzotto (Milan & Rome)63.9%64.4%
Adams (Dallas)62.6%62.8%
Bos (Rome & Leeds)61.6%66.9%
11 groups58.1%-60.5%
7 groups52.9%-55.6%
Average: 60%Median: 59%
22
What is the main obstacle?
• System reports point at:– Lack of knowledge
• rules, paraphrases, lexical relations, etc.
• It seems that systems that coped better with these issues performed best
23
Research Directions at Bar-Ilan
Knowledge AcquisitionInference
Applications
Oren Glickman, Idan Szpektor, Roy Bar Haim, Maayan Geffet, Moshe Koppel Bar Ilan UniversityShachar Mirkin Hebrew University, IsraelHristo Tanev, Bernardo Magnini, Alberto Lavelli, Lorenza Romano
ITC-irst, ItalyBonaventura Coppola, Milen Kouylekov
University of Trento and ITC-irst, Italy
24
Distributional Word Similarity
“Similar words appear in similar contexts” Harris, 1968
Similar Word Meanings Similar Contexts
Similar Word Meanings Similar Context Features
Distributional Similarity Model:
25
Measuring Context Similarity
Country StateIndustry (genitive) Neighboring (modifier)
Neighboring (modifier) …
… Governor (modifier)
Visit (obj) Parliament (genitive)
… Industry (genitive)
Population (genitive) …
Governor (modifier) Visit (obj)
Parliament (genitive) President (genitive)
27
Acquisition Example
•Top-ranked entailments for “company”:
firm, bank, group, subsidiary, unit, business, supplier, carrier, agency, airline, division, giant,
entity, financial institution, manufacturer, corporation, commercial bank, joint venture, maker, producer, factory …
28
Learning Entailment Rules
Text:Aspirin prevents
Heart Attacks
Q: What reduces the risk of Heart Attacks?
Entailment Rule:X prevent Y ⇨ X reduce risk of Y
Hypothesis: Aspirin reduces the risk of
Heart Attacks
Need a large knowledge base of entailment rules
template template
29
TEASE – Algorithm Flow
WEB
LexiconInput template:
Xsubj-accuse-objY
Sample corpus for input template:Paula Jones accused Clinton…Sanhedrin accused St.Paul……
Anchor sets:{Paula Jonessubj; Clintonobj}{Sanhedrinsubj; St.Paulobj}…
Sample corpus for anchor sets:Paula Jones called Clinton indictable…St.Paul defended before the Sanhedrin …
Templates:X call Y indictableY defend before X…
TEASE
Anchor Set Extraction
(ASE)
Template Extraction
(TE)
iterate
30
Sample of ExtractedAnchor-Sets for X prevent Y
X=‘sunscreens’, Y=‘sunburn’
X=‘sunscreens’, Y=‘skin cancer’
X=‘vitamin e’, Y=‘heart disease’
X=‘aspirin’, Y=‘heart attack’
X=‘vaccine candidate’, Y=‘infection’
X=‘universal precautions’, Y=‘HIV’
X=‘safety device’, Y=‘fatal injuries’
X=‘hepa filtration’, Y=‘contaminants’
X=‘low cloud cover’, Y= ‘measurements’
X=‘gene therapy’, Y=‘blindness’
X=‘cooperation’, Y=‘terrorism’
X=‘safety valve’, Y=‘leakage’
X=‘safe sex’, Y=‘cervical cancer’
X=‘safety belts’, Y=‘fatalities’
X=‘security fencing’, Y=‘intruders’
X=‘soy protein’, Y=‘bone loss’
X=‘MWI’, Y=‘pollution’
X=‘vitamin C’, Y=‘colds’
31
Sample of Extracted Templates for X prevent Y
X reduce Y
X protect against Y
X eliminate Y
X stop Y
X avoid Y
X for prevention of Y
X provide protection against Y
X combat Y
X ward Y
X lower risk of Y
X be barrier against Y
X fight Y
X reduce Y risk
X decrease the risk of Y
relationship between X and Y
X guard against Y
X be cure for Y
X treat Y
X in war on Y
X in the struggle against Y
X a day keeps Y away
X eliminate the possibility of Y
X cut risk Y
X inhibit Y
32
Experiment and Evaluation
• 48 randomly chosen input verbs• 1392 templates extracted ; human judgments Encouraging Results:
• Future work: improve precision
Average Yieldper verb
29 correct templates per verb
Average Precisionper verb
45.3%
33
Syntactic Variability Phenomena
Template: X activate Y
PhenomenonExample
Passive formY is activated by X
AppositionX activates its companion, Y
ConjunctionX activates Z and Y
SetX activates two proteins: Y and Z
Relative clauseX, which activates Y
CoordinationX binds and activates Y
Transparent headX activates a fragment of Y
Co-referenceX is a kinase, though it activates Y
34
Takeaway
• Promising potential for creating huge entailment knowledge bases– Millions of rules
• Speculation: is it possible to have a public effort for knowledge acquisition?– Human Genome Project analogy– Community effort
36
Dataset
• Recognizing interactions between annotated proteins pairs (Bunescu 2005)– 200 Medline abstracts– Gold standard dataset of protein pairs
• Input template : X interact with Y
37
Manual Analysis - Results• 93% of interacting protein pairs can be identified with lexical syntactic
templates
R(%)# templatesR(%)# templates
1026039
2047073
30680107
401190141
5021100175
Number of templates vs. recall (within 93%):
38
TEASE Output for X interact with Y
A sample of correct templates learned:
X bind to YX binding to Y
X activate YX Y interaction
X stimulate YX attach to Y
X couple to YX interaction with Y
interaction between X and YX trap Y
X become trapped in YX recruit Y
X Y complexX associate with Y
X recognize YX be linked to Y
X block YX target Y
39
• Iterative - taking the top 5 ranked templates as input
• Morph - recognizing morphological derivations
ExperimentRecall
input39%
input + iterative49%
input + iterative + morph
63%
TEASE algorithm - Potential Recall on Training Set