1
Context-Free Languages Regular Languages Information Extraction Grammars ECIR 2015 Vienna, March 30 th Mónica Marrero National Supercomputing Center, Spain Julián Urbano Universitat Pompeu Fabra, Spain Problem: Grammar-based Named Entity (NE) Recognition Patterns Features Part of speech Case Gazetteers Stem [etc.] (Semi-)automatic Learning Method More than one feature? Regular Cascade Context-free Natural/Markup Lang. expressiveness? Regular Cascade Context-free Avoid extra ambiguity? Regular Cascade Context-free Regular Expressions Cascade Grammars Context-Free Grammars Human-readable and based on standards NE: Person NE: Time NE: Location Information Extraction systems should be capable of adapting to different entities and domains. How can we decide what is the best model for a Named Entity Recognition system? Proposal: Information Extraction Grammars for Named Entity Recognition Formally, = (, , Σ, , ) : set of non-terminals : initial symbol Σ: input alphabet : set of production rules : set of condition sets assigned to non-terminals, expressed as function-value pairs , All derivations must meet: and ∀ , = Context-Free Grammar IEG for the recognition of full person names using First/Last name gazetteers [a-zA-Z0-9]+ = , , , , , = , , , , , Lisa Brown Smith will present at 4 pm in Foyer room Similar to synthesized attributes in S-attributed grammars, but in this case the values of the attributes are given upfront and they are used to constrain the parsing Computational Complexity Regular Expression O(ns 2 ) Cascade Grammar O(mns 2 ) IEG O(n(tm+s 2 )) Context-Free Grammar O(n 3 ) IEG O(n 3 ) Sizes of n: input, m: features, s: states in the automata, t: non-terminals with conditions associated Summary and Future Work Information Extraction Grammars - Based on standards - Expressiveness of context-free grammars - Support for custom features - Competitive complexity using standard recognition methods Contributes to the flexibility of Information Extraction tools that can work independently of the kind of features and the expressiveness of the language to recognize Future work: optimization of the recognition methods and use of probabilities in the conditions

Information Extraction Grammars

Embed Size (px)

Citation preview

Page 1: Information Extraction Grammars

Context-Free Languages Regular Languages

Information Extraction Grammars

ECIR 2015 Vienna, March 30th

Mónica Marrero National Supercomputing Center, Spain

Julián Urbano Universitat Pompeu Fabra, Spain

Problem: Grammar-based Named Entity (NE) Recognition Patterns

Features

Part of speech

Case

Gazetteers

Stem

[etc.]

(Semi-)automatic Learning Method

More than one feature?

Regular Cascade Context-free

Natural/Markup Lang. expressiveness?

Regular Cascade Context-free

Avoid extra ambiguity?

Regular Cascade Context-free

Regular Expressions

Cascade Grammars

Context-Free Grammars

Human-readable and based on standards

NE: Person NE: Time NE: Location

Information Extraction systems should be capable of adapting to different entities and domains.

How can we decide what is the best model for a Named Entity Recognition system?

Proposal: Information Extraction Grammars for Named Entity Recognition

Formally, 𝐼𝐸𝐺 = (𝒱, 𝑆, Σ, 𝒫, 𝒞) 𝒱: set of non-terminals 𝑆 ∈ 𝒱: initial symbol Σ: input alphabet 𝒫: set of production rules 𝒞: set of condition sets assigned to non-terminals, expressed as function-value pairs 𝑓, 𝑦

All derivations must meet:

𝐴∗ 𝐼𝐸𝐺

𝜔 ≔ 𝐴∗ 𝐺

𝜔 and ∀ 𝑓, 𝑦 ∈ 𝒞𝐴 ∶ 𝑓 𝜔 = 𝑦

Context-Free Grammar 𝐺

IEG for the recognition of full person names using First/Last name gazetteers

𝑆 → 𝐹𝐿𝐿 𝑆 → 𝐹𝐿 𝑆 → 𝐹

𝐹 → 𝑇 𝐿 → 𝑇 𝑇 → [a-zA-Z0-9]+

𝒞𝐹 = 𝐹𝑖𝑟𝑠𝑡𝐺𝑎𝑧, 𝑡𝑟𝑢𝑒 , 𝐶𝑎𝑠𝑒, 𝑢𝑝𝑝𝑒𝑟 , 𝑃𝑂𝑆, 𝑁𝑃

𝒞𝐿 = 𝐹𝑖𝑟𝑠𝑡𝐺𝑎𝑧, 𝑡𝑟𝑢𝑒 , 𝐶𝑎𝑠𝑒, 𝑢𝑝𝑝𝑒𝑟 , 𝑃𝑂𝑆, 𝑁𝑃

Lisa Brown Smith will present at 4 pm in Foyer room

Similar to synthesized attributes in S-attributed grammars, but in this case the values of the attributes are given upfront and they are used to constrain the parsing

Computational Complexity

Regular Expression

O(ns2)

Cascade Grammar

O(mns2)

IEG

O(n(tm+s2))

Context-Free Grammar

O(n3)

IEG

O(n3)

Sizes of n: input, m: features, s: states in the automata, t: non-terminals with conditions associated

Summary and Future Work

• Information Extraction Grammars - Based on standards - Expressiveness of context-free grammars - Support for custom features - Competitive complexity using standard

recognition methods

• Contributes to the flexibility of Information Extraction tools that can work independently of the kind of features and the expressiveness of the language to recognize

• Future work: optimization of the recognition methods and use of probabilities in the conditions