Information Extraction Grammars

Context-Free Languages Regular Languages

Information Extraction Grammars

ECIR 2015 Vienna, March 30th

Mónica Marrero National Supercomputing Center, Spain

Julián Urbano Universitat Pompeu Fabra, Spain

Problem: Grammar-based Named Entity (NE) Recognition Patterns

Features

Part of speech

Case

Gazetteers

Stem

[etc.]

(Semi-)automatic Learning Method

More than one feature?

Regular Cascade Context-free

Natural/Markup Lang. expressiveness?


Avoid extra ambiguity?


Regular Expressions

Cascade Grammars

Context-Free Grammars

Human-readable and based on standards

NE: Person NE: Time NE: Location

Information Extraction systems should be capable of adapting to different entities and domains.

How can we decide what is the best model for a Named Entity Recognition system?

Proposal: Information Extraction Grammars for Named Entity Recognition

Formally, 𝐼𝐸𝐺 = (𝒱, 𝑆, Σ, 𝒫, 𝒞) 𝒱: set of non-terminals 𝑆 ∈ 𝒱: initial symbol Σ: input alphabet 𝒫: set of production rules 𝒞: set of condition sets assigned to non-terminals, expressed as function-value pairs 𝑓, 𝑦

All derivations must meet:

𝐴∗ 𝐼𝐸𝐺

𝜔 ≔ 𝐴∗ 𝐺

𝜔 and ∀ 𝑓, 𝑦 ∈ 𝒞𝐴 ∶ 𝑓 𝜔 = 𝑦

Context-Free Grammar 𝐺

IEG for the recognition of full person names using First/Last name gazetteers

𝑆 → 𝐹𝐿𝐿 𝑆 → 𝐹𝐿 𝑆 → 𝐹

𝐹 → 𝑇 𝐿 → 𝑇 𝑇 → [a-zA-Z0-9]+

𝒞𝐹 = 𝐹𝑖𝑟𝑠𝑡𝐺𝑎𝑧, 𝑡𝑟𝑢𝑒 , 𝐶𝑎𝑠𝑒, 𝑢𝑝𝑝𝑒𝑟 , 𝑃𝑂𝑆, 𝑁𝑃

𝒞𝐿 = 𝐹𝑖𝑟𝑠𝑡𝐺𝑎𝑧, 𝑡𝑟𝑢𝑒 , 𝐶𝑎𝑠𝑒, 𝑢𝑝𝑝𝑒𝑟 , 𝑃𝑂𝑆, 𝑁𝑃

Lisa Brown Smith will present at 4 pm in Foyer room

Similar to synthesized attributes in S-attributed grammars, but in this case the values of the attributes are given upfront and they are used to constrain the parsing

Computational Complexity

Regular Expression

O(ns2)

Cascade Grammar

O(mns2)

IEG

O(n(tm+s2))

Context-Free Grammar

O(n3)

IEG

O(n3)

Sizes of n: input, m: features, s: states in the automata, t: non-terminals with conditions associated

Summary and Future Work

• Information Extraction Grammars - Based on standards - Expressiveness of context-free grammars - Support for custom features - Competitive complexity using standard

recognition methods

• Contributes to the flexibility of Information Extraction tools that can work independently of the kind of features and the expressiveness of the language to recognize

• Future work: optimization of the recognition methods and use of probabilities in the conditions

Science

Information Extraction Grammars