Upload
hannah-owens
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Robust Semantic Processing for Information Extraction
Ann CopestakeComputer Laboratory,University of [email protected]
Outline Information Extraction Combining deep and shallow processing RMRS
MRS basic ideas of RMRS RASP-RMRS RMRS and IE in Deep Thought
SciBorg project
Acknowledgements Deep Thought (EU funded, 2002-2004)
Computer Lab: Ann Copestake, Anna Ritchie, Ben Waldron Sussex, Saarland, DFKI, Xtramind, CELI, NTNU
SciBorg (EPSRC, 2005-2009) Computer Lab: Ann Copestake, Simone Teufel, CJ Rupp,
Advaith Siddharthan Chemistry: Peter Murray-Rust, Peter Corbett CeSC: Mark Hayes, Andy Parker
DELPH-IN (informal ongoing collaboration) Boeing funding to Computer Lab: Ben Waldron especially Dan Flickinger, Alex Lascarides, Stephan Oepen,
John Carroll, Anette Frank
Information extraction Classic IE: MUC-style template filling,
gene/protein interactions IE in general: acquiring specific types of
knowledge from text via language processing: e.g., organic chemistry syntheses ontological relationships relationships between texts (for search)
IR, QA, I2E
IE from Chemistry texts
recipe expressed in CML
To a solution of aldimine1 (1.5mmol) in THF (5mL) was added LDA (1mL, 1.6 M in THF) at 0 °C under argon, the resulting mixture was stirred for 2h, then was cooled to -78 °C ...
... alkaloids and other complex polycyclic azacycles ...
Enamines have been used widely ... (citation Y), however, ... did not provide the desired products.
<owl:Class rdf:ID="Alkaloid"> <rdfs:subClassOf rdf:resource="#Azacycle" />
X cites Y (contrast)
Standard IE architecture1. Preprocessing of markup etc (specific to text type)2. Tokenisation (not domain-specific)3. Named Entity Recognition (domain-specific
ontologies, domain-specific patterns)4. Chunking: detection of noun and verb groups (not
domain-specific)5. Anaphora resolution (domain-specific ontologies)6. Relationship detection via patterns over chunks
(domain- and task- specific)7. DB instantiation (task-specific)
State of the art in IE Several options for whole IE systems and individual
components, especially for English Increasing integration of ontologies Commercial systems for some applications But, many IE-style tasks still done manually:
IE performance (especially when high precision required) IE robustness to different text types IE porting requirement (especially NER and relation patterns)
Performance of standard architecture may be reaching a plateau
More advanced IE tasks are not generally attempted e.g., organic synthesis example could be done with adaptation of
standard architecture, but would take substantial effort by highly trained people.
Skill set: substantial domain skills plus substantial NLP
Objectives Integrate and adapt tools for language processing in
general Eventual use by non-NLP people: black box for language
processing Incorporate deeper processing (DELPH-IN
technology): aim to get above plateau Integration with XML, semantic web Methodology:
Combine statistical and symbolic processing, machine learning and hand-crafting
Open Source where possible, collaborative development No toy systems, no artificial evaluations Multilingual via collaboration
Deep processing in IE Some early IE systems attempted to use deep processing: SRI
(and also NYU) FASTUS was originally shallow preprocessor for TACITUS but
TACITUS was dropped: much too slow, not sufficiently robust Often claimed: deep processing failed for IE, but:
only two serious attempts(?), both under time pressure, limited types of IE task
deep processing has improved since early 1990s:• speed• empirical coverage (note that hand-built deep grammars do scale, unlike
traditional AI knowledge bases)• integration of statistical techniques into deep processing
if existing IE architecture is approaching a plateau, we have to try something else – i.e., combined deep and shallow processing (DFKI Whiteboard project)
Integrating processing No single system can do everything: deep and shallow
processing have inherent strengths and weaknesses shallow: speed and robustness: e.g., POS tagging, chunking deep: detail, precision, potential for bidirectional processing: e.g.,
HPSG-based parsers and generators (DELPH-IN technology) also intermediate: RASP (Robust accurate statistical parser):
relatively detailed but no lexicon. Domain-dependent and domain-independent processing must
be linked Desirable to have a common representation language for
processing above sentence level (e.g., anaphora) Long-term solutions ...
Compositional semantics for component integration Need a common representation language for
systems: pairwise compatibility between systems is too limiting
Syntax is theory-specific and unnecessarily language-specific
Eventual goal of sentence analysis should be semantics
Core idea: shallow processing gives underspecified semantic representation, so deep and shallow systems can be integrated
Full interlingua / common lexical semantics is too difficult (certainly currently), but can link predicates to ontologies, etc.
Integration via underspecified semantics Integrated parsing:
shallow parsed phrases incorporated into deep parsed structures deep parsing invoked incrementally in response to information
needs Knowledge sources expressed via semantics can be used by
multiple components: e.g., NER, IE templates, anaphora resolution
Advantages over ad-hoc representation approaches: Ability to link with detailed lexical semantics as it becomes
available Language generation from semantic representation Explicit logic: formal properties clearer, representations more
generally usable Deep semantics taken as normative: extensibility
Robust Minimal Recursion Semantics Minimal Recursion Semantics: MRS. Compositional
semantics for deep processing:• Copestake, Flickinger, Sag and Pollard (1999, in press)• adopted for DELPH-IN and other HPSG work• also compatible with LFG etc
logically well-defined flat semantics (easier to process, allows information to be
ignored) underspecification of quantifier scope (avoid ambiguity) novel approach to composition (monostratal)
Robust MRS: adaptation of MRS allowing processing without a subcategorization lexicon
RMRS: Extreme underspecification Goal is to split up semantic representation
into minimal components (cf Verbmobil VITs) Scope underspecification (MRS) Splitting up predicate argument structure Explicit equalities Hierarchies for predicates and sorts
Compatibility with deep grammars: Sorts and (some) closed class word information in
SEM-I (API for grammar, more later) No lexicon for shallow processing (apart from POS
tags and possibly closed class words)
Semantics from POS tagging every_AT1 cat_NN1 chase_VVD
some_AT1 dog_NN1 _every_q(x1), _cat_n(x2sg),
_chase_v(epast), _some_q(x3), _dog_n(x4sg)
Tag lexicon: AT1 _lemma_q(x)NN1 _lemma_n(xsg)VVD _lemma_v(epast)
Deep parser output Conventional semantic representation
Every dog chased some catevery(x,cat(xsg),some(ysg,dog1(ysg),chase(esp,xsg,ysg)))some(ysg,dog1(ysg),every(xsg,cat(xsg),chase(esp,xsg,ysg)))
Compositional: reflects morphology and syntax
Scope ambiguity is explicit May be awkward to process if you don’t
care about quantifier scope
Modifying syntax of deep grammar semantics: overview
1. Underspecification of quantifier scope: Minimal Recursion Semantics (MRS) – next 6 slides ...
2. Robust MRS• Separating arguments• Explicit equalities• Conventions for predicate names and sense
distinctions• Hierarchy of sorts on variables
PC trees
every
x cat
x
some
y dog1 chase
y x y
some
y dog1
y
every
x cat chase
x
Every cat chased some dog
e x ye
PC trees share structure
every
x cat
x
some
y dog1 chase
y
some
y dog1
y
every
x cat chase
xx ye x ye
Bits of trees
every
x cat
x
some
y dog1
y
chase
Reconstruction conditions:tree-nessvariable binding
x ye
Label nodes and holes
lb1:every
x lb2:cat
x
lb4:some
y lb5:dog1
y
lb3:chase
h6
h7
h0
h0 – hole correspondingto the top of the tree
Valid solutions:equate holes and labels
x ye
Maximize splitting
lb1:every
x
lb2:cat
x
lb4:some
y
lb5:dog1
y
lb3:chase
h6
h7
h0
h8
Constraints:h8=lb5h9=lb2
h9
x ye
MRS: flat representation
elementary predications: lb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y,h8,h7), lb3:chase(e,x,y),
scope constraints: h9=lb2,h8=lb5(actually qeqs)
easy to ignore quantification when not relevant for application: cat(x), dog1(y), chase(e,x,y)
RMRS: Separating argumentslb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y),
lb4:some(y,h8,h7), lb3:chase(e,x,y), h9=lb2,h8=lb5
goes to:
lb1:every(x), RSTR(lb1,h9), BODY(lb1,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y), RSTR(lb4,h8), BODY(lb4,h7), lb3:chase(e),ARG1(lb3,x),ARG2(lb3,y), h9=lb2,h8=lb5
Naming conventions:predicate names without a lexiconlb1:_every_q(x1sg),RSTR(lb1,h9),BODY(lb1,h6),
lb2:_cat_n(x2sg),
lb5:_dog_n_1(x4sg),
lb4:_some_q(x3sg),RSTR(lb4,h8),BODY(lb4,h7),
lb3:_chase_v(esp),ARG1(lb3,x2sg),ARG2(lb3,x4sg)h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg
note also explicit equalities
POS output as underspecificationDEEP –
lb1:_every_q(x1sg), RSTR(lb1,h9), BODY(lb1,h6), lb2:_cat_n(x2sg), lb5:_dog_n_1(x4sg), lb4:_some_q(x3sg), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(esp), ARG1(lb3,x2sg),ARG2(lb3,x4sg), h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg
POS –
lb1:_every_q(x1), lb2:_cat_n(x2sg), lb3:_chase_v(epast), lb4:_some_q(x3), lb5:_dog_n(x4sg)
POS output as underspecificationDEEP –
lb1:_every_q(x1sg), RSTR(lb1,h9),BODY(lb1,h6), lb2:_cat_n(x2sg), lb5:_dog_n_1(x4sg), lb4:_some_q(x3sg), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(esp), ARG1(lb3,x2sg),ARG2(lb3,x3sg), h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg
POS –
lb1:_every_q(x1), lb2:_cat_n(x2sg), lb3:_chase_v(epast), lb4:_some_q(x3), lb5:_dog_n(x4sg)
RMRS principles Split up information content as much as
possible Accumulate information monotonically
by simple operations Don’t represent what you don’t know
but preserve everything you do know Use a flat representation to allow pieces
to be accessed individually
Semantics from RASP RASP: robust, domain-independent, statistical
parsing (Briscoe and Carroll) can’t produce conventional semantics
because no subcategorization can often identify arguments:
S -> NP VP NP supplies ARG1 for V potential for partial identification:
VP -> V NP S -> NP S NP might be ARG2 or ARG3
RMRS construction deep grammars: MRS <-> RMRS converter. POS-RMRS: tag lexicon. RASP-RMRS: tag lexicon plus semantic rules associated with
RASP rules. no lexical subcategorization, so rely on grammar rules to provide
the ARGs output aims to match deep grammar (ERG) developed on basis of ERG semantic test suite default composition principles when no rule RMRS specified
Composition algebra: MRS composition assumes a lexicalized approach: algebra defined
in Copestake, Lascarides and Flickinger (2001) RMRS with non-lexicalised grammars has similar basic algebra
All approaches have common composition principles, so there is compatibility at a phrasal level.
Some cat sleeps (in RASP)[h3,e], <h3>, {h3:_sleep(e)}sleeps[h,x], <h1>, {h1:_some(x),RSTR(h1,h2),h2:_cat(x)}some cat
S->NP VP: Head=VP, ARG1(<VP anchor>,<NP hook.index>)[h3,e], <h3>, {h3:_sleep(e), ARG1(h3,x),
h1:_some(x),RSTR(h1,h2),h2:_cat(x)}some cat sleeps
ERG-RMRS / RASP-RMRS
Inchoative
Infinitival subject (unbound in RASP-RMRS)
Mismatch: Expletive it
SEM-I: semantic interface Meta-level: manually specified `grammar’
relations (constructions and closed-class) Object-level: linked to lexical database for
deep grammars Object-level SEM-I auto-generated from expanded
lexical entries in deep grammars (because type can contribute relations)
Validation of other lexicons Need closed class items for RMRS
construction from shallow processing
Alignment and XML Comparing RMRSs for same text
efficiently requires `characterization’ labels RMRSs according to their source in
the text currently characters, but also XPath plus
characters RMRS-XML RMRS seen as levels of mark-up:
standoff annotation
RMRS approach: current and planned applications Question answering:
Cambridge CSTIT: deep parse questions, shallow parse answers
QA from structured knowledge: Frank et al (QUETAL project) Information extraction:
emails (Deep Thought) Chemistry texts (SciBorg)
Dictionary definition parsing for Japanese and English (Bond and Flickinger)
Rhetorical structure, multi-document summarization ...
also LOGON: semantic transfer. MRSs from LFG used in HPSG generator.
RMRS in Deep Thought Different systems integrated via the HoG:
Invoke shallow or deep parsing, full or partial results, all expressed in RMRS.
Also shallow parsing as precursor to deep parsing: NER, unknown words.
Preliminary test on email response application (Xtramind Mailminder): email categorized, then category-specific
templates built from RMRS increase in precision of automatically instantiated
templates (up to 29%) with the addition of deep parser to the system
IE architecture using deeper processing and RMRS1. Preprocessing of markup etc2. Tokenisation3. Named Entity Recognition: delivers RMRS4. Shallow processing (including chunking): delivers RMRS5. Deep parsing: uses shallow processing and NER, delivers
RMRS6. Word sense disambiguation: uses RMRS from best available
source, further instantiates RMRS according to ontology7. Anaphora resolution: uses RMRS from best available source,
further instantiates RMRS8. Relationship detection via patterns over deepest possible
RMRSs9. DB instantiation
SciBorg: Chemistry texts eScience project started in October at Cambridge
Computer Laboratory, Chemistry, CeSC Partners: Nature Publishing, Royal Society of Chemistry,
International Union of Crystallography (supplying papers and publishing expertise)
Aims:1. Develop an NL markup language which will act as a platform for
extraction of information. Link to semantic web languages.2. Develop IE technology and core ontologies for use by publishers,
researchers, readers, vendors and regulatory organisations.3. Model scientific argumentation and citation purpose in order to
support novel modes of information access.4. Demonstrate the applicability of this infrastructure in a real-world
eScience environment.
Outline architecture
RSC papers
Nature papers base XML
IUCr papers
Biology and CL(pdf)
POS tagging
NER
RASPsentencesplitting
ERG/PET
WSD
anaphora tasks
standoff annotation
rhetoricalanalysis
RMRSmerge
Research markup Chemistry: The primary aims of the present study are
(i) the synthesis of an amino acid derivative that can be incorporated into proteins /via/ standard solid-phase synthesis methods, and (ii) a test of the ability of the derivative to function as a photoswitch in a biological environment.
Computational Linguistics: The goal of the work reported here is to develop a method that can automatically refine the Hidden Markov Models to produce a more accurate language model.
RMRS and research markup Specify cues in RMRS: e.g.,
l1:objective(x), ARG1(l1,y), l2:research(y) The concept objective generalises the predicates for aim,
goal etc and research generalises study, work etc. Ontology for rhetorical structure.
Deep process possible cue phrases to get RMRSs: feasible because domain-independent more general and reliable than shallow techniques allows for complex interrelationships e.g., our goal is not to ... but to ...
Use zones for advanced citation maps (e.g., X cites Y (contrast)) and other enhancements to repositories
Conclusions Information Extraction is more than company mergers or gene-
protein interactions! Combined deep-shallow processing techniques have potential
for IE RMRS is a representation language that allows for deep-shallow
compatibility via extreme underspecification various systems adapted to output RMRS and further work ongoing RMRS offers detailed compatibility at a phrasal level RMRS processing can be integrated with ontologies in various ways RMRS tools are distributed as Open Source via DELPH-IN
SciBorg will further develop this approach for eScience applications using a generic standoff architecture
Further work on RASP-RMRS Fast enough (time not significant compared to RASP
processing time because no ambiguity) Too many RASP rules! Need to generalise over
classes. Requires SEM-I: i.e., API for MRS/RMRS from deep
grammar RASP and ERG may change:
compatible test suites – semi-automatic rule update? alternative technique for composition?
Parse selection – need to generalise over RMRSs weighted intersections of RMRSs (cf RASP grammatical
relations)