Upload
bryan-mckinnon
View
227
Download
0
Tags:
Embed Size (px)
Citation preview
Jakub Piskorski Warszawa, 6.10 .2003
SProUT Shallow Processing with Unification
and Typed Feature Structures
Jakub PiskorskiLanguage Technology Lab
DFKI GmbH
Jakub Piskorski Warszawa, 6.10 .2003
Information Extraction
PRODUCT/SERVICE:
Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC), London, have merged their UK private communication systems and networks activities to form a new company, Siemens GEC Communication Systems Limited.
Munich
Siemens GEC Communication Systems LimitedSiemens AG, The General Electric
February 18 1997communication systems, networks activities
VENTURE: PARTNERS:TIME:
LOCATION:
Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC), London, have merged their UK private communication systems and networks activities to form a new company, Siemens GEC Communication Systems Limited.
JOINT-VENTURE FOUNDATION EVENT
Jakub Piskorski Warszawa, 6.10 .2003
Finite-State based approaches
SPPC - pure finite-state based STP, small number of basic predicates
SMES – predciates inspect arbitrary properties of the input tokens/fragments
FASTUS – uses CPSL (Common Pattern Specification Language)
GATE – uses JAPE (Java Annotation Patterns Engine)
Jakub Piskorski Warszawa, 6.10 .2003
Motivation for SProUT
One System for Multilingual and Domain Adaptive Shallow Text Processing
Trade-off between efficiency and expressiveness
Modularity
Flexible integration of different processing modules
Portability
Industrial standards
Jakub Piskorski Warszawa, 6.10 .2003
SProUT is a joint work by:
Witold Drożdżyński,Ulrich Krieger,
Jakub Piskorski,Ulrich Schäfer,
Feiyu Xu
Credits
Jakub Piskorski Warszawa, 6.10 .2003
FINITE-STATEMACHINETOOLKIT
XTDLINTERPRETER
REGULARCOMPILER
XTDLGRAMMAR
EXTENDEDOPTIMIZED
FINITE-STATENETWORK
LEXICALRESOURCES
INPUT DATA
STRUCTURED
OUTPUT DATA
G R A M M A R D E V E L O P M E N T E N V I R O N M E N T
O N L I N E P R O C E S S I N G
STREAM OFTEXT ITEMS
…. [..] [..] [..] ….
LINGUISTICPROCESSINGRESOURCES
JTFS
SProUT Architecture
Jakub Piskorski Warszawa, 6.10 .2003
Core Components – FSM Toolkit
Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices
Finite-state Machine model: FSA, WFSA, FST, WFST
Arbitrary real-valued semirings
Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs)
Various memory models
Functionality similar to AT&T tools
Jakub Piskorski Warszawa, 6.10 .2003
Core Components – Regular Compiler
Definition and configuration via XML Unicode compatible
Extendible set of circa 20 operations
Scanner definitions vs. general regular expressions
Biasing optimization process
Various ways of handling ambiguities
Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation
Regular expressions over TFSs (SProUT) with restrictions
Jakub Piskorski Warszawa, 6.10 .2003
Core Components – Typed Feature Structure Package
JAVA implementation of TFSs Efficient unification operations
Dynamic extension of the type hierarchy
Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers
Jakub Piskorski Warszawa, 6.10 .2003
XTDL Formalism
Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application
XTDL grammar rules – production part on LHS, and output description on RHS
TDL used for establishment of a type hierarchy of linguistic entities
morph := sign & [POS atom,
STEM atom,
INFL infl]
*top*
atom *avm* *rule*
tense sign infl index-avm
present token morph lang tokentype
de en separator url
Jakub Piskorski Warszawa, 6.10 .2003
XTDL Formalism
Couple of standard regular operators:
concatenation optionality ?disjunction | Kleene star *Kleene plus + n-fold repetition {n}m-n span repetition {m,n}
Unidirectional coreference under Kleene star (and restricted iteration)
[POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]
Jakub Piskorski Warszawa, 6.10 .2003
XTDL Formalism
loc-pp :>
morph & [POS Prep & #preposition,
INFL [CASE #1, NUMBER #2, GENDER #3]]
morph & [POS Determiner,
INFL [CASE #1, NUMBER #2, GENDER #3]] ?
morph & [POS Adjective,
INFL [CASE #1, NUMBER #2, GENDER #3]] *
gazetteer & [TYPE general-location,
SURFACE #location]
-> [CAT location-pp,
PREP #preposition
LOCATION #location].
Jakub Piskorski Warszawa, 6.10 .2003
XTDL Interpreter
1. Matching of regular patterns using unifiability (LHS)
2. LHS Pattern instance creation
3. Unfication of the rule instance and matched input
Longest match strategy
Ambiguities allowed
Interpreter generates TFSs as output (cascaded architecture)
Jakub Piskorski Warszawa, 6.10 .2003
location-general TYPE
Rom SURFACE,
gender GENDER
number NUMBER
case CASE
INFL
Adjective POS
sonnig STEM
sonnigen SURFACE
,
fem GENDER
plural NUMBER
nom CASE
INFL
Prep POS
im STEM
im SURFACE
INgazetteer
inflmorphinflmorphrule
Matched input sequence “im sonnigen Rom” (in sunny Rome)
XTDL Interpreter
Jakub Piskorski Warszawa, 6.10 .2003
4 LOCATION
5 PREP
np-location CAT
OUT
location-general TYPE
4 SURFACE,
3 GENDER
2 NUMBER
1 CASE
INFL
Adjective POS
,
3 GENDER
2 NUMBER
1 CASE
INFL
Prep 5 POS
IN
phrase
gazetteer
inflmorphinflmorph
rule
Rule with an instantiated pattern on the LHS
XTDL Interpreter
Jakub Piskorski Warszawa, 6.10 .2003
4 LOCATION
5 PREP
np-location CAT
OUT
location-general TYPE
Rom 4 SURFACE,
3 GENDER
2 NUMBER
1 CASE
INFL
Adjective POS
sonnig STEM
sonnigen SURFACE
,
neut 3 GENDER
sing 2 NUMBER
dat 1 CASE
INFL
Prep 5 POS
im STEM
im SURFACE
IN
phrase
gazetteer
inflmorphinflmorph
rule
Unified result
XTDL formalism
Jakub Piskorski Warszawa, 6.10 .2003
Linguistic Processing Resources
Tokenization
Gazetteer
Extended Gazetteer
Morphology
Sentence Splitter
Reference Matcher
Jakub Piskorski Warszawa, 6.10 .2003
Tokenization
Text segmentation into tokens
Fine-grained token classification (ca. 30 types)
complex_compound_first_capital : AT&T-Chief
Token postsegmentation
‘<a,b>’ ‘<‘ ‘a’ ‘,’ ‘b’ ‘>’
Token Subclassification
Information
contains_position_sufix: AT&T-Chief
ndinghas_noun_e
any
english
endingnounhas
any
german
capitalfirst
:SPEC
:DOM
:LANG
,
__ :SPEC
:DOM
:LANG
:SUB
_ :MAIN
34 :END
25 :START
Jakub Piskorski Warszawa, 6.10 .2003
Gazetteer/Extended Gazetteer
for storing static named-entities (eg. locations) or keywords (eg. company| designators, month names, etc.)
Extended Gazetteer allows for associating entries with a list of arbitrary attribute-value pairs (and uses path compression)
... Warsaw | gaz_type:city | concept:Warsaw Warszawa | gaz_type:city | concept:Warsaw Varsovie | gaz_type:city | concept:Warsaw ...
Case Sensitivie/Insensitive Modus
Unicode compatibility
Jakub Piskorski Warszawa, 6.10 .2003
Morphology
Full-form lexica obtained from ‘compactified’ MMORPH:
English 200,000 entriesGerman 830,000 entries + Shallow Compound RecognitionFrench 225,000 entriesSpanish 570,000 entriesItalian 330,000 entriesDutch ? Entries (under development)
Asian Languages:
Chinese – ShanxiJapanese – Chasen
Other:
Czech – 600,000 entries + HMM-based Part-of-Speech Tagging Polish – 120,000 lexemes (Morfeusz)Lithuanian – LemouklisRussian – under acquisition
compactification of available full-form lexica
external components implemented as server
Jakub Piskorski Warszawa, 6.10 .2003
Compound Recognition & Segmentation for German
“Biergartenfest” “Wein“ + “sorten“ (wine types) [Bier [garten fest]] vs. [[Bier garten] fest] “Wein” + “s“ + “orten“ (wine places)
Morphology
(„Autoradiozubehör“ – radio car equipment)
Autoradiozubehör
Autoradiozubehör
Autoradiozubehör
Autoradiozubehör
Autoradiozubehör
Autoradiozubehör
Next: Adoptation for processing Dutch compounds
Jakub Piskorski Warszawa, 6.10 .2003
System Description Language
Construction of a concrete system instance via definition of a regular expression of module specifications
modulest independen ofn computatio parallel-quasi
ncomputatiofixpoint *
input to theas serves ofoutput
21
2121
MM
M
MMMM
All lingusitic modules must implement a specific JAVA interface
Automatic compilation of system description into a single JAVA class
Jakub Piskorski Warszawa, 6.10 .2003
System Description Language
(M1 M2)(input)
M1.clearState(); M1.setInput(input); M1.setOutput(M1.computeOutput(M1.getInput())); M2.clearState(); M2.setInput(mediateSeq(M1,M2)); M2.setOutput(M2.computeOutput(M2.getInput())); return M2.getOutput();
(M*)(input)
M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();
Jakub Piskorski Warszawa, 6.10 .2003
Optimization of Grammar processing
Problem: TFSs treated as symbolic values by FSM Toolkit Sorting outgoing transitions from slected states (transition hierarchy under subsumption)
- flat trees for bad-style grammars
Extending transition hierarchy via additional nodes
[ TOP ]
[TOKEN][MORPH stem: ‘Prof.’] [GAZETTEER type: X]
Jakub Piskorski Warszawa, 6.10 .2003
Optimization of Grammar processing
Input text consisting of 32 520 words, 157 080 characters, 22 pages + English Grammar for NE (circa 700 transitions from the initial state)
Run-time behaviour with Tokenizer/Gazetter/Morphology:
before: overall: 17.7 seconds candidate pattern selection: 11.6
now: overall: 13.2 seconds candidate pattern selection: 6.9
Jakub Piskorski Warszawa, 6.10 .2003
Optimization of Grammar processing
Using restrictions during compilation of XTDL grammars into FS-format
’Determinization under subsumption’ -> Approximation
’Expansion’ techniques for highly recursive grammars
Jakub Piskorski Warszawa, 6.10 .2003
Adapting SProUT to processing Polish
Tokenization – trivial
Morphology – integration of Morfeusz (Marcin Woliński)
Part-of-speech Disambiguation - ?
Gazetteer - several strategies:
- list all inflectional variants with additional morphological information- interplay between gazetteer and morphology- component for guessing morphological information of unknown words
Grammar Adaptation
- provide additional information to control inflection by using STEM attribute instead of SURFACE
Jakub Piskorski Warszawa, 6.10 .2003
Future Work
Further work concerning optimization of grammar processing
Various search strategies
Additional linguistic processing resources
Adopting to processing new languages
Real data testing: large grammars and real-world texts
Utilization in research and industrial projects
Jakub Piskorski Warszawa, 6.10 .2003
Examples – Simple grammar for person names
;; dummy rule for title
title :/ gazetteer & [SURFACE #title, GTYPE gaz_title] -> #title.
;; dummy rule for position
position :/ gazetteer & [SURFACE #position, GTYPE gaz_position] -> #position.
;; dummy rule for complex position, zB. Dierktor und CEO
complex_position :/
(gazetteer & [GTYPE gaz_position, SURFACE #pos1]
token & [SURFACE "und"]
gazetteer & [GTYPE gaz_position, SURFACE #pos2])
-> #position, where #position = Append(#pos1," ","und"," ",#pos2).
Jakub Piskorski Warszawa, 6.10 .2003
Examples – Simple grammar for person names
;; dummy rule for given name
given_name :/ gazetteer & [SURFACE #name, GTYPE gaz_given_name] -> #name.
;; dummy rule for name-suffix such as "Jr."
name_suffix :/
(token & [ SURFACE ","] ?)
token & [ SURFACE "Jr" & #suffix ] | token & [ SURFACE "jr" & #suffix ]
(token & [ SURFACE "." ] ?)
-> #suffix.
;; dummy rule for initial "M." and middle name
initial :/
(gazetteer & [GTYPE gaz_initial, SURFACE #initial]
token & [SURFACE "."] ?)-> #middle, where #middle = Append(#initial, ".").
Jakub Piskorski Warszawa, 6.10 .2003
Examples – Simple grammar for person names
;; dummy rule for infix like "van", "van der"
infix :/ gazetteer & [GTYPE gaz_name_infix, SURFACE #infix] -> #infix.
;; dummy rule for last name
last_name :/
token & [TYPE first_capital_word, SURFACE #name]
| token & [TYPE mixed_word_first_capital, SURFACE #name]
| token & [TYPE word_with_hyphen_first_capital, SURFACE #name]
| token & [TYPE word_with_apostrophee_first_capital, SURFACE #name]
-> #name.
;; dummy rule for last name with infix
last_name_with_infix :/
@seek(infix) & #infix
@seek(last_name) & #last_name-> #last, where #last=Append(#infix," ",#last_name).
Jakub Piskorski Warszawa, 6.10 .2003
Examples – Simple grammar for person names
;; rule for person names, example: Direktor und CTO Prof. Dr. hab. Witold P. van der Berg, Jr.
person :>
((@seek(position) & #pos | @seek(complex_position) & #pos) token & [TYPE comma] ?)?
@seek(title) & #title ?
(@seek(given_name) & #given_name (@seek(given_name) & #given_name_extra ?)
| (@seek(initial) & #given_name))
@seek(initial) & #middle1 ?
@seek(initial) & #middle2 ?
(@seek(last_name) & #last_name | @seek(last_name_with_infix) & #last_name)
@seek(name_suffix) & #suffix ?
-> ne-person & [GIVEN_NAME #first_name,
TITLE #title,
SURNAME #last_name,
P-POSITION #position,
NAME-SUFFIX #suffix],where #first_name = ConcWithBlanks(#given_name,#given_name_extra,#middle1,#middle2).
Jakub Piskorski Warszawa, 6.10 .2003
simple_noun_phrase :> .................
-> phrase & [CAT np,
SURFACE #info,
AGR [N #n,
C #c,
G #g]], where #info=..........
simple_event :> @seek(person) & #person
morph & [POS verb, STEM #action]
@seek(simple_noun_phrase) & [SURFACE #info]
-> [PERSON #person, ACTION #action, OBJECT #info].
Examples – Embedding rules