Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

SProUT Shallow Processing with Unification

and Typed Feature Structures

Jakub PiskorskiLanguage Technology Lab

DFKI GmbH


Information Extraction

PRODUCT/SERVICE:

Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC), London, have merged their UK private communication systems and networks activities to form a new company, Siemens GEC Communication Systems Limited.

Munich

Siemens GEC Communication Systems LimitedSiemens AG, The General Electric

February 18 1997communication systems, networks activities

VENTURE: PARTNERS:TIME:

LOCATION:

Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC), London, have merged their UK private communication systems and networks activities to form a new company, Siemens GEC Communication Systems Limited.

JOINT-VENTURE FOUNDATION EVENT


Finite-State based approaches

SPPC - pure finite-state based STP, small number of basic predicates

SMES – predciates inspect arbitrary properties of the input tokens/fragments

FASTUS – uses CPSL (Common Pattern Specification Language)

GATE – uses JAPE (Java Annotation Patterns Engine)


Motivation for SProUT

One System for Multilingual and Domain Adaptive Shallow Text Processing

Trade-off between efficiency and expressiveness

Modularity

Flexible integration of different processing modules

Portability

Industrial standards


SProUT is a joint work by:

Witold Drożdżyński,Ulrich Krieger,

Jakub Piskorski,Ulrich Schäfer,

Feiyu Xu

Credits


FINITE-STATEMACHINETOOLKIT

XTDLINTERPRETER

REGULARCOMPILER

XTDLGRAMMAR

EXTENDEDOPTIMIZED

FINITE-STATENETWORK

LEXICALRESOURCES

INPUT DATA

STRUCTURED

OUTPUT DATA

G R A M M A R D E V E L O P M E N T E N V I R O N M E N T

O N L I N E P R O C E S S I N G

STREAM OFTEXT ITEMS

…. [..] [..] [..] ….

LINGUISTICPROCESSINGRESOURCES

JTFS

SProUT Architecture


Core Components – FSM Toolkit

Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices

Finite-state Machine model: FSA, WFSA, FST, WFST

Arbitrary real-valued semirings

Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs)

Various memory models

Functionality similar to AT&T tools


Core Components – Regular Compiler

Definition and configuration via XML Unicode compatible

Extendible set of circa 20 operations

Scanner definitions vs. general regular expressions

Biasing optimization process

Various ways of handling ambiguities

Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation

Regular expressions over TFSs (SProUT) with restrictions


Core Components – Typed Feature Structure Package

JAVA implementation of TFSs Efficient unification operations

Dynamic extension of the type hierarchy

Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers


XTDL Formalism

Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application

XTDL grammar rules – production part on LHS, and output description on RHS

TDL used for establishment of a type hierarchy of linguistic entities

morph := sign & [POS atom,

STEM atom,

INFL infl]

*top*

atom *avm* *rule*

tense sign infl index-avm

present token morph lang tokentype

de en separator url


XTDL Formalism

Couple of standard regular operators:

concatenation optionality ?disjunction | Kleene star *Kleene plus + n-fold repetition {n}m-n span repetition {m,n}

Unidirectional coreference under Kleene star (and restricted iteration)

[POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]


XTDL Formalism

loc-pp :>

morph & [POS Prep & #preposition,

INFL [CASE #1, NUMBER #2, GENDER #3]]

morph & [POS Determiner,

INFL [CASE #1, NUMBER #2, GENDER #3]] ?

morph & [POS Adjective,

INFL [CASE #1, NUMBER #2, GENDER #3]] *

gazetteer & [TYPE general-location,

SURFACE #location]

-> [CAT location-pp,

PREP #preposition

LOCATION #location].


XTDL Interpreter

1. Matching of regular patterns using unifiability (LHS)

2. LHS Pattern instance creation

3. Unfication of the rule instance and matched input

Longest match strategy

Ambiguities allowed

Interpreter generates TFSs as output (cascaded architecture)


location-general TYPE

Rom SURFACE,

gender GENDER

number NUMBER

case CASE

INFL

Adjective POS

sonnig STEM

sonnigen SURFACE

,

fem GENDER

plural NUMBER

nom CASE

INFL

Prep POS

im STEM

im SURFACE

INgazetteer

inflmorphinflmorphrule

Matched input sequence “im sonnigen Rom” (in sunny Rome)

XTDL Interpreter


4 LOCATION

5 PREP

np-location CAT

OUT


4 SURFACE,

3 GENDER

2 NUMBER

1 CASE

INFL

Adjective POS

,

3 GENDER

2 NUMBER

1 CASE

INFL

Prep 5 POS

IN

phrase

gazetteer

inflmorphinflmorph

rule

Rule with an instantiated pattern on the LHS

XTDL Interpreter


4 LOCATION

5 PREP

np-location CAT

OUT


Rom 4 SURFACE,

3 GENDER

2 NUMBER

1 CASE

INFL

Adjective POS

sonnig STEM

sonnigen SURFACE

,

neut 3 GENDER

sing 2 NUMBER

dat 1 CASE

INFL

Prep 5 POS

im STEM

im SURFACE

IN

phrase

gazetteer

inflmorphinflmorph

rule

Unified result

XTDL formalism


Linguistic Processing Resources

Tokenization

Gazetteer

Extended Gazetteer

Morphology

Sentence Splitter

Reference Matcher


Tokenization

Text segmentation into tokens

Fine-grained token classification (ca. 30 types)

complex_compound_first_capital : AT&T-Chief

Token postsegmentation

‘<a,b>’ ‘<‘ ‘a’ ‘,’ ‘b’ ‘>’

Token Subclassification

Information

contains_position_sufix: AT&T-Chief

ndinghas_noun_e

any

english

endingnounhas

any

german

capitalfirst

:SPEC

:DOM

:LANG

,

__ :SPEC

:DOM

:LANG

:SUB

_ :MAIN

34 :END

25 :START


Gazetteer/Extended Gazetteer

for storing static named-entities (eg. locations) or keywords (eg. company| designators, month names, etc.)

Extended Gazetteer allows for associating entries with a list of arbitrary attribute-value pairs (and uses path compression)

... Warsaw | gaz_type:city | concept:Warsaw Warszawa | gaz_type:city | concept:Warsaw Varsovie | gaz_type:city | concept:Warsaw ...

Case Sensitivie/Insensitive Modus

Unicode compatibility


Morphology

Full-form lexica obtained from ‘compactified’ MMORPH:

English 200,000 entriesGerman 830,000 entries + Shallow Compound RecognitionFrench 225,000 entriesSpanish 570,000 entriesItalian 330,000 entriesDutch ? Entries (under development)

Asian Languages:

Chinese – ShanxiJapanese – Chasen

Other:

Czech – 600,000 entries + HMM-based Part-of-Speech Tagging Polish – 120,000 lexemes (Morfeusz)Lithuanian – LemouklisRussian – under acquisition

compactification of available full-form lexica

external components implemented as server


Compound Recognition & Segmentation for German

“Biergartenfest” “Wein“ + “sorten“ (wine types) [Bier [garten fest]] vs. [[Bier garten] fest] “Wein” + “s“ + “orten“ (wine places)

Morphology

(„Autoradiozubehör“ – radio car equipment)

Autoradiozubehör

Autoradiozubehör

Autoradiozubehör

Autoradiozubehör

Autoradiozubehör

Autoradiozubehör

Next: Adoptation for processing Dutch compounds


System Description Language

Construction of a concrete system instance via definition of a regular expression of module specifications

modulest independen ofn computatio parallel-quasi

ncomputatiofixpoint *

input to theas serves ofoutput

21

2121

MM

M

MMMM

All lingusitic modules must implement a specific JAVA interface

Automatic compilation of system description into a single JAVA class


System Description Language

(M1 M2)(input)

M1.clearState(); M1.setInput(input); M1.setOutput(M1.computeOutput(M1.getInput())); M2.clearState(); M2.setInput(mediateSeq(M1,M2)); M2.setOutput(M2.computeOutput(M2.getInput())); return M2.getOutput();

(M*)(input)

M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();


Optimization of Grammar processing

Problem: TFSs treated as symbolic values by FSM Toolkit Sorting outgoing transitions from slected states (transition hierarchy under subsumption)

- flat trees for bad-style grammars

Extending transition hierarchy via additional nodes

[ TOP ]

[TOKEN][MORPH stem: ‘Prof.’] [GAZETTEER type: X]



Input text consisting of 32 520 words, 157 080 characters, 22 pages + English Grammar for NE (circa 700 transitions from the initial state)

Run-time behaviour with Tokenizer/Gazetter/Morphology:

before: overall: 17.7 seconds candidate pattern selection: 11.6

now: overall: 13.2 seconds candidate pattern selection: 6.9



Using restrictions during compilation of XTDL grammars into FS-format

’Determinization under subsumption’ -> Approximation

’Expansion’ techniques for highly recursive grammars


Adapting SProUT to processing Polish

Tokenization – trivial

Morphology – integration of Morfeusz (Marcin Woliński)

Part-of-speech Disambiguation - ?

Gazetteer - several strategies:

- list all inflectional variants with additional morphological information- interplay between gazetteer and morphology- component for guessing morphological information of unknown words

Grammar Adaptation

- provide additional information to control inflection by using STEM attribute instead of SURFACE


Future Work

Further work concerning optimization of grammar processing

Various search strategies

Additional linguistic processing resources

Adopting to processing new languages

Real data testing: large grammars and real-world texts

Utilization in research and industrial projects


Examples – Simple grammar for person names

;; dummy rule for title

title :/ gazetteer & [SURFACE #title, GTYPE gaz_title] -> #title.

;; dummy rule for position

position :/ gazetteer & [SURFACE #position, GTYPE gaz_position] -> #position.

;; dummy rule for complex position, zB. Dierktor und CEO

complex_position :/

(gazetteer & [GTYPE gaz_position, SURFACE #pos1]

token & [SURFACE "und"]

gazetteer & [GTYPE gaz_position, SURFACE #pos2])

-> #position, where #position = Append(#pos1," ","und"," ",#pos2).



;; dummy rule for given name

given_name :/ gazetteer & [SURFACE #name, GTYPE gaz_given_name] -> #name.

;; dummy rule for name-suffix such as "Jr."

name_suffix :/

(token & [ SURFACE ","] ?)

token & [ SURFACE "Jr" & #suffix ] | token & [ SURFACE "jr" & #suffix ]

(token & [ SURFACE "." ] ?)

-> #suffix.

;; dummy rule for initial "M." and middle name

initial :/

(gazetteer & [GTYPE gaz_initial, SURFACE #initial]

token & [SURFACE "."] ?)-> #middle, where #middle = Append(#initial, ".").



;; dummy rule for infix like "van", "van der"

infix :/ gazetteer & [GTYPE gaz_name_infix, SURFACE #infix] -> #infix.

;; dummy rule for last name

last_name :/

token & [TYPE first_capital_word, SURFACE #name]

| token & [TYPE mixed_word_first_capital, SURFACE #name]

| token & [TYPE word_with_hyphen_first_capital, SURFACE #name]

| token & [TYPE word_with_apostrophee_first_capital, SURFACE #name]

-> #name.

;; dummy rule for last name with infix

last_name_with_infix :/

@seek(infix) & #infix

@seek(last_name) & #last_name-> #last, where #last=Append(#infix," ",#last_name).



;; rule for person names, example: Direktor und CTO Prof. Dr. hab. Witold P. van der Berg, Jr.

person :>

((@seek(position) & #pos | @seek(complex_position) & #pos) token & [TYPE comma] ?)?

@seek(title) & #title ?

(@seek(given_name) & #given_name (@seek(given_name) & #given_name_extra ?)

| (@seek(initial) & #given_name))

@seek(initial) & #middle1 ?

@seek(initial) & #middle2 ?

(@seek(last_name) & #last_name | @seek(last_name_with_infix) & #last_name)

@seek(name_suffix) & #suffix ?

-> ne-person & [GIVEN_NAME #first_name,

TITLE #title,

SURNAME #last_name,

P-POSITION #position,

NAME-SUFFIX #suffix],where #first_name = ConcWithBlanks(#given_name,#given_name_extra,#middle1,#middle2).


simple_noun_phrase :> .................

-> phrase & [CAT np,

SURFACE #info,

AGR [N #n,

C #c,

G #g]], where #info=..........

simple_event :> @seek(person) & #person

morph & [POS verb, STEM #action]

@seek(simple_noun_phrase) & [SURFACE #info]

-> [PERSON #person, ACTION #action, OBJECT #info].

Examples – Embedding rules

Documents

Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI