Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge [email protected]

Robust Semantic Processing for Information Extraction

Ann CopestakeComputer Laboratory,University of [email protected]

Outline Information Extraction Combining deep and shallow processing RMRS

MRS basic ideas of RMRS RASP-RMRS RMRS and IE in Deep Thought

SciBorg project

Acknowledgements Deep Thought (EU funded, 2002-2004)

Computer Lab: Ann Copestake, Anna Ritchie, Ben Waldron Sussex, Saarland, DFKI, Xtramind, CELI, NTNU

SciBorg (EPSRC, 2005-2009) Computer Lab: Ann Copestake, Simone Teufel, CJ Rupp,

Advaith Siddharthan Chemistry: Peter Murray-Rust, Peter Corbett CeSC: Mark Hayes, Andy Parker

DELPH-IN (informal ongoing collaboration) Boeing funding to Computer Lab: Ben Waldron especially Dan Flickinger, Alex Lascarides, Stephan Oepen,

John Carroll, Anette Frank

Information extraction Classic IE: MUC-style template filling,

gene/protein interactions IE in general: acquiring specific types of

knowledge from text via language processing: e.g., organic chemistry syntheses ontological relationships relationships between texts (for search)

IR, QA, I2E

IE from Chemistry texts

recipe expressed in CML

To a solution of aldimine1 (1.5mmol) in THF (5mL) was added LDA (1mL, 1.6 M in THF) at 0 °C under argon, the resulting mixture was stirred for 2h, then was cooled to -78 °C ...

... alkaloids and other complex polycyclic azacycles ...

Enamines have been used widely ... (citation Y), however, ... did not provide the desired products.

<owl:Class rdf:ID="Alkaloid"> <rdfs:subClassOf rdf:resource="#Azacycle" />

X cites Y (contrast)

Standard IE architecture1. Preprocessing of markup etc (specific to text type)2. Tokenisation (not domain-specific)3. Named Entity Recognition (domain-specific

ontologies, domain-specific patterns)4. Chunking: detection of noun and verb groups (not

domain-specific)5. Anaphora resolution (domain-specific ontologies)6. Relationship detection via patterns over chunks

(domain- and task- specific)7. DB instantiation (task-specific)

State of the art in IE Several options for whole IE systems and individual

components, especially for English Increasing integration of ontologies Commercial systems for some applications But, many IE-style tasks still done manually:

IE performance (especially when high precision required) IE robustness to different text types IE porting requirement (especially NER and relation patterns)

Performance of standard architecture may be reaching a plateau

More advanced IE tasks are not generally attempted e.g., organic synthesis example could be done with adaptation of

standard architecture, but would take substantial effort by highly trained people.

Skill set: substantial domain skills plus substantial NLP

Objectives Integrate and adapt tools for language processing in

general Eventual use by non-NLP people: black box for language

processing Incorporate deeper processing (DELPH-IN

technology): aim to get above plateau Integration with XML, semantic web Methodology:

Combine statistical and symbolic processing, machine learning and hand-crafting

Open Source where possible, collaborative development No toy systems, no artificial evaluations Multilingual via collaboration

Deep processing in IE Some early IE systems attempted to use deep processing: SRI

(and also NYU) FASTUS was originally shallow preprocessor for TACITUS but

TACITUS was dropped: much too slow, not sufficiently robust Often claimed: deep processing failed for IE, but:

only two serious attempts(?), both under time pressure, limited types of IE task

deep processing has improved since early 1990s:• speed• empirical coverage (note that hand-built deep grammars do scale, unlike

traditional AI knowledge bases)• integration of statistical techniques into deep processing

if existing IE architecture is approaching a plateau, we have to try something else – i.e., combined deep and shallow processing (DFKI Whiteboard project)

Integrating processing No single system can do everything: deep and shallow

processing have inherent strengths and weaknesses shallow: speed and robustness: e.g., POS tagging, chunking deep: detail, precision, potential for bidirectional processing: e.g.,

HPSG-based parsers and generators (DELPH-IN technology) also intermediate: RASP (Robust accurate statistical parser):

relatively detailed but no lexicon. Domain-dependent and domain-independent processing must

be linked Desirable to have a common representation language for

processing above sentence level (e.g., anaphora) Long-term solutions ...

Compositional semantics for component integration Need a common representation language for

systems: pairwise compatibility between systems is too limiting

Syntax is theory-specific and unnecessarily language-specific

Eventual goal of sentence analysis should be semantics

Core idea: shallow processing gives underspecified semantic representation, so deep and shallow systems can be integrated

Full interlingua / common lexical semantics is too difficult (certainly currently), but can link predicates to ontologies, etc.

Integration via underspecified semantics Integrated parsing:

shallow parsed phrases incorporated into deep parsed structures deep parsing invoked incrementally in response to information

needs Knowledge sources expressed via semantics can be used by

multiple components: e.g., NER, IE templates, anaphora resolution

Advantages over ad-hoc representation approaches: Ability to link with detailed lexical semantics as it becomes

available Language generation from semantic representation Explicit logic: formal properties clearer, representations more

generally usable Deep semantics taken as normative: extensibility

Robust Minimal Recursion Semantics Minimal Recursion Semantics: MRS. Compositional

semantics for deep processing:• Copestake, Flickinger, Sag and Pollard (1999, in press)• adopted for DELPH-IN and other HPSG work• also compatible with LFG etc

logically well-defined flat semantics (easier to process, allows information to be

ignored) underspecification of quantifier scope (avoid ambiguity) novel approach to composition (monostratal)

Robust MRS: adaptation of MRS allowing processing without a subcategorization lexicon

RMRS: Extreme underspecification Goal is to split up semantic representation

into minimal components (cf Verbmobil VITs) Scope underspecification (MRS) Splitting up predicate argument structure Explicit equalities Hierarchies for predicates and sorts

Compatibility with deep grammars: Sorts and (some) closed class word information in

SEM-I (API for grammar, more later) No lexicon for shallow processing (apart from POS

tags and possibly closed class words)

Semantics from POS tagging every_AT1 cat_NN1 chase_VVD

some_AT1 dog_NN1 _every_q(x1), _cat_n(x2sg),

_chase_v(epast), _some_q(x3), _dog_n(x4sg)

Tag lexicon: AT1 _lemma_q(x)NN1 _lemma_n(xsg)VVD _lemma_v(epast)

Deep parser output Conventional semantic representation

Every dog chased some catevery(x,cat(xsg),some(ysg,dog1(ysg),chase(esp,xsg,ysg)))some(ysg,dog1(ysg),every(xsg,cat(xsg),chase(esp,xsg,ysg)))

Compositional: reflects morphology and syntax

Scope ambiguity is explicit May be awkward to process if you don’t

care about quantifier scope

Modifying syntax of deep grammar semantics: overview

1. Underspecification of quantifier scope: Minimal Recursion Semantics (MRS) – next 6 slides ...

2. Robust MRS• Separating arguments• Explicit equalities• Conventions for predicate names and sense

distinctions• Hierarchy of sorts on variables

PC trees

every

x cat

x

some

y dog1 chase

y x y

some

y dog1

y

every

x cat chase

x

Every cat chased some dog

e x ye

PC trees share structure

every

x cat

x

some

y dog1 chase

y

some

y dog1

y

every

x cat chase

xx ye x ye

Bits of trees

every

x cat

x

some

y dog1

y

chase

Reconstruction conditions:tree-nessvariable binding

x ye

Label nodes and holes

lb1:every

x lb2:cat

x

lb4:some

y lb5:dog1

y

lb3:chase

h6

h7

h0

h0 – hole correspondingto the top of the tree

Valid solutions:equate holes and labels

x ye

Maximize splitting

lb1:every

x

lb2:cat

x

lb4:some

y

lb5:dog1

y

lb3:chase

h6

h7

h0

h8

Constraints:h8=lb5h9=lb2

h9

x ye

MRS: flat representation

elementary predications: lb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y,h8,h7), lb3:chase(e,x,y),

scope constraints: h9=lb2,h8=lb5(actually qeqs)

easy to ignore quantification when not relevant for application: cat(x), dog1(y), chase(e,x,y)

RMRS: Separating argumentslb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y),

lb4:some(y,h8,h7), lb3:chase(e,x,y), h9=lb2,h8=lb5

goes to:

lb1:every(x), RSTR(lb1,h9), BODY(lb1,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y), RSTR(lb4,h8), BODY(lb4,h7), lb3:chase(e),ARG1(lb3,x),ARG2(lb3,y), h9=lb2,h8=lb5

Naming conventions:predicate names without a lexiconlb1:_every_q(x1sg),RSTR(lb1,h9),BODY(lb1,h6),

lb2:_cat_n(x2sg),

lb5:_dog_n_1(x4sg),

lb4:_some_q(x3sg),RSTR(lb4,h8),BODY(lb4,h7),

lb3:_chase_v(esp),ARG1(lb3,x2sg),ARG2(lb3,x4sg)h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg

note also explicit equalities

POS output as underspecificationDEEP –

lb1:_every_q(x1sg), RSTR(lb1,h9), BODY(lb1,h6), lb2:_cat_n(x2sg), lb5:_dog_n_1(x4sg), lb4:_some_q(x3sg), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(esp), ARG1(lb3,x2sg),ARG2(lb3,x4sg), h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg

POS –

lb1:_every_q(x1), lb2:_cat_n(x2sg), lb3:_chase_v(epast), lb4:_some_q(x3), lb5:_dog_n(x4sg)

POS output as underspecificationDEEP –

lb1:_every_q(x1sg), RSTR(lb1,h9),BODY(lb1,h6), lb2:_cat_n(x2sg), lb5:_dog_n_1(x4sg), lb4:_some_q(x3sg), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(esp), ARG1(lb3,x2sg),ARG2(lb3,x3sg), h9=lb2,h8=lb5, x1sg=x2sg,x3sg=x4sg

POS –

lb1:_every_q(x1), lb2:_cat_n(x2sg), lb3:_chase_v(epast), lb4:_some_q(x3), lb5:_dog_n(x4sg)

RMRS principles Split up information content as much as

possible Accumulate information monotonically

by simple operations Don’t represent what you don’t know

but preserve everything you do know Use a flat representation to allow pieces

to be accessed individually

Semantics from RASP RASP: robust, domain-independent, statistical

parsing (Briscoe and Carroll) can’t produce conventional semantics

because no subcategorization can often identify arguments:

S -> NP VP NP supplies ARG1 for V potential for partial identification:

VP -> V NP S -> NP S NP might be ARG2 or ARG3

RMRS construction deep grammars: MRS <-> RMRS converter. POS-RMRS: tag lexicon. RASP-RMRS: tag lexicon plus semantic rules associated with

RASP rules. no lexical subcategorization, so rely on grammar rules to provide

the ARGs output aims to match deep grammar (ERG) developed on basis of ERG semantic test suite default composition principles when no rule RMRS specified

Composition algebra: MRS composition assumes a lexicalized approach: algebra defined

in Copestake, Lascarides and Flickinger (2001) RMRS with non-lexicalised grammars has similar basic algebra

All approaches have common composition principles, so there is compatibility at a phrasal level.

Some cat sleeps (in RASP)[h3,e], <h3>, {h3:_sleep(e)}sleeps[h,x], <h1>, {h1:_some(x),RSTR(h1,h2),h2:_cat(x)}some cat

S->NP VP: Head=VP, ARG1(<VP anchor>,<NP hook.index>)[h3,e], <h3>, {h3:_sleep(e), ARG1(h3,x),

h1:_some(x),RSTR(h1,h2),h2:_cat(x)}some cat sleeps

ERG-RMRS / RASP-RMRS

Inchoative

Infinitival subject (unbound in RASP-RMRS)

Mismatch: Expletive it

SEM-I: semantic interface Meta-level: manually specified `grammar’

relations (constructions and closed-class) Object-level: linked to lexical database for

deep grammars Object-level SEM-I auto-generated from expanded

lexical entries in deep grammars (because type can contribute relations)

Validation of other lexicons Need closed class items for RMRS

construction from shallow processing

Alignment and XML Comparing RMRSs for same text

efficiently requires `characterization’ labels RMRSs according to their source in

the text currently characters, but also XPath plus

characters RMRS-XML RMRS seen as levels of mark-up:

standoff annotation

RMRS approach: current and planned applications Question answering:

Cambridge CSTIT: deep parse questions, shallow parse answers

QA from structured knowledge: Frank et al (QUETAL project) Information extraction:

emails (Deep Thought) Chemistry texts (SciBorg)

Dictionary definition parsing for Japanese and English (Bond and Flickinger)

Rhetorical structure, multi-document summarization ...

also LOGON: semantic transfer. MRSs from LFG used in HPSG generator.

RMRS in Deep Thought Different systems integrated via the HoG:

Invoke shallow or deep parsing, full or partial results, all expressed in RMRS.

Also shallow parsing as precursor to deep parsing: NER, unknown words.

Preliminary test on email response application (Xtramind Mailminder): email categorized, then category-specific

templates built from RMRS increase in precision of automatically instantiated

templates (up to 29%) with the addition of deep parser to the system

IE architecture using deeper processing and RMRS1. Preprocessing of markup etc2. Tokenisation3. Named Entity Recognition: delivers RMRS4. Shallow processing (including chunking): delivers RMRS5. Deep parsing: uses shallow processing and NER, delivers

RMRS6. Word sense disambiguation: uses RMRS from best available

source, further instantiates RMRS according to ontology7. Anaphora resolution: uses RMRS from best available source,

further instantiates RMRS8. Relationship detection via patterns over deepest possible

RMRSs9. DB instantiation

SciBorg: Chemistry texts eScience project started in October at Cambridge

Computer Laboratory, Chemistry, CeSC Partners: Nature Publishing, Royal Society of Chemistry,

International Union of Crystallography (supplying papers and publishing expertise)

Aims:1. Develop an NL markup language which will act as a platform for

extraction of information. Link to semantic web languages.2. Develop IE technology and core ontologies for use by publishers,

researchers, readers, vendors and regulatory organisations.3. Model scientific argumentation and citation purpose in order to

support novel modes of information access.4. Demonstrate the applicability of this infrastructure in a real-world

eScience environment.

Outline architecture

RSC papers

Nature papers base XML

IUCr papers

Biology and CL(pdf)

POS tagging

NER

RASPsentencesplitting

ERG/PET

WSD

anaphora tasks

standoff annotation

rhetoricalanalysis

RMRSmerge

Research markup Chemistry: The primary aims of the present study are

(i) the synthesis of an amino acid derivative that can be incorporated into proteins /via/ standard solid-phase synthesis methods, and (ii) a test of the ability of the derivative to function as a photoswitch in a biological environment.

Computational Linguistics: The goal of the work reported here is to develop a method that can automatically refine the Hidden Markov Models to produce a more accurate language model.

RMRS and research markup Specify cues in RMRS: e.g.,

l1:objective(x), ARG1(l1,y), l2:research(y) The concept objective generalises the predicates for aim,

goal etc and research generalises study, work etc. Ontology for rhetorical structure.

Deep process possible cue phrases to get RMRSs: feasible because domain-independent more general and reliable than shallow techniques allows for complex interrelationships e.g., our goal is not to ... but to ...

Use zones for advanced citation maps (e.g., X cites Y (contrast)) and other enhancements to repositories

Conclusions Information Extraction is more than company mergers or gene-

protein interactions! Combined deep-shallow processing techniques have potential

for IE RMRS is a representation language that allows for deep-shallow

compatibility via extreme underspecification various systems adapted to output RMRS and further work ongoing RMRS offers detailed compatibility at a phrasal level RMRS processing can be integrated with ontologies in various ways RMRS tools are distributed as Open Source via DELPH-IN

SciBorg will further develop this approach for eScience applications using a generic standoff architecture

Further work on RASP-RMRS Fast enough (time not significant compared to RASP

processing time because no ambiguity) Too many RASP rules! Need to generalise over

classes. Requires SEM-I: i.e., API for MRS/RMRS from deep

grammar RASP and ERG may change:

compatible test suites – semi-automatic rule update? alternative technique for composition?

Parse selection – need to generalise over RMRSs weighted intersections of RMRSs (cf RASP grammatical

relations)

Documents

Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge [email protected]