21
What’s “NEXT”? What’s “NEXT”? Navigating through Navigating through Dense Annotation Spaces Dense Annotation Spaces Branimir K. Boguraev Branimir K. Boguraev Mary S. Neff Mary S. Neff Language Engineering Language Engineering for Content Analysis for Content Analysis IBM T.J. Watson Research Center IBM T.J. Watson Research Center Yorktown Heights, NY Yorktown Heights, NY

What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Embed Size (px)

Citation preview

Page 1: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

What’s “NEXT”?What’s “NEXT”?

Navigating throughNavigating through Dense Annotation Spaces Dense Annotation Spaces

Branimir K. BoguraevBranimir K. BoguraevMary S. NeffMary S. Neff

Language Engineering Language Engineering for Content Analysisfor Content Analysis

IBM T.J. Watson Research CenterIBM T.J. Watson Research CenterYorktown Heights, NYYorktown Heights, NY

Page 2: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Dense Annotation SpacesDense Annotation Spaces

Service Reps can read customer name, in order to contact the customer. Service Reps can read customer name, in order to contact the customer.

{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}

[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]

[PP][PP][PP][PP]

[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]

[SC][SC][SC][SC]

[SENT][SENT][SENT][SENT]

{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}

[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]

[PP][PP][PP][PP]

[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]

[SC][SC][SC][SC]

[SENT][SENT][SENT][SENT]

Page 3: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Annotation ‘trees’Annotation ‘trees’

Service Reps can read customer name, in order to contact the customer. Service Reps can read customer name, in order to contact the customer.

{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}

[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]

[PP][PP][PP][PP]

[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]

[SC][SC][SC][SC]

[SENT][SENT][SENT][SENT]

Page 4: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Annotation latticeAnnotation lattice

Service Reps can read customer name, in order to contact the customer. Service Reps can read customer name, in order to contact the customer.

{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}

[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]

[PP][PP][PP][PP]

[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]

[SC][SC][SC][SC]

[SENT][SENT][SENT][SENT]

Page 5: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Navigational ChallengesNavigational Challenges

[PNAME ][PNAME ][Title][Name ][Title][Name ] [First] [Middle] [Last][First] [Middle] [Last]

What is visible to the lattice traversal What is visible to the lattice traversal engine?engine?

Page 6: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Annotation-Based Finite Annotation-Based Finite State Transducer (AFst)State Transducer (AFst)

UIMA-basedUIMA-based A finite state calculus over typed feature A finite state calculus over typed feature

structuresstructures Cf. “grep” over a sequence of annotations, Cf. “grep” over a sequence of annotations,

specified as types and featuresspecified as types and features

np = <E>/[NP .np = <E>/[NP .Token[pos=~”DT”] | <E> .Token[pos=~”DT”] | <E> .Token[pos=~”JJ”]* .Token[pos=~”JJ”]* . ( Token[pos=~”NN”] | Token[pos=~”NNS”] ) .( Token[pos=~”NN”] | Token[pos=~”NNS”] ) .

<E>/]NP ;<E>/]NP ;

Page 7: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Pitching the Iterator: support Pitching the Iterator: support for navigational controlfor navigational control

Service Reps can read customer name, in order to contact the customer. Service Reps can read customer name, in order to contact the customer.

{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}

[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]

[PP][PP][PP][PP]

[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]

[SC][SC][SC][SC]

[SENT][SENT][SENT][SENT]

Page 8: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Defining a particular path through Defining a particular path through the annotation space requires a the annotation space requires a lattice traversal engine that can lattice traversal engine that can focus on—simultaneously—focus on—simultaneously—

o Sequential constraints ~ pattern matchingSequential constraints ~ pattern matching Horizontal—prenominal mod and nominal headHorizontal—prenominal mod and nominal head

o Structural constraintsStructural constraints Vertical—iterate over NP with specific Vertical—iterate over NP with specific

configurational relationship – e.g. not sentence configurational relationship – e.g. not sentence initial, not in a PPinitial, not in a PP

o Configurational constraintsConfigurational constraints Type prioritizationType prioritization

Afst Traversal RegimeAfst Traversal Regime

Page 9: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Linearizing the Lattice: Linearizing the Lattice: what’s “next”?what’s “next”?

Unambiguous Typeset iterator, inferred Unambiguous Typeset iterator, inferred from grammar: from grammar: …… [SUB] . [VG] . [OBJ] . [PP] …[SUB] . [VG] . [OBJ] . [PP] …

UIMA natural annotation sort order:UIMA natural annotation sort order:o Start position ascendingStart position ascendingo Length descendingLength descendingo Type priority, defined in UIMA descriptorsType priority, defined in UIMA descriptors

[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]

[PP][PP][PP][PP]

[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]

Page 10: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Linearizing the Lattice: Linearizing the Lattice: what’s “next”?what’s “next”?

Grammar-wide declarationsGrammar-wide declarations boundary % Sentence[];boundary % Sentence[];

honour % Address[] ;honour % Address[] ;month = Token[lemma=~”January”] |month = Token[lemma=~”January”] |

Token[lemma=~”February”]|Token[lemma=~”February”]| … … ;;

date = <E>/[Year . date = <E>/[Year . :month | <E> .:month | <E> . Token[string=~:^[12]\d[{3}$:] Token[string=~:^[12]\d[{3}$:]

<E>/]Year;<E>/]Year;

Page 11: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Focus:Focus:Selecting Nested Boundary Selecting Nested Boundary

AnnotationsAnnotations<nameValuePair><nameValuePair> <name><name>FocusFocus</name></name>

<value><array><value><array> <string><string>Section[label~=:Education:]Section[label~=:Education:]

</string></string><string><string>Sentence[number==1]Sentence[number==1]

</string></string></array></value></array></value>

</nameValuePair></nameValuePair>

Page 12: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Linearizing the Lattice: Linearizing the Lattice: what’s “next”?what’s “next”?

Grammar-wide declarationsGrammar-wide declarations

match % first, last, longesr, match % first, last, longesr, shortest, allshortest, all

advance % skip, stepadvance % skip, step

Page 13: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

What’s “next”?:What’s “next”?:Switching Levels, Mixed Switching Levels, Mixed

IteratorIteratorRefocus the iterator to examine Refocus the iterator to examine

inner contour: inner contour: @descend, @ascend@descend, @ascend

findDrSmith =findDrSmith =<E>/PName[@descend] .<E>/PName[@descend] .

Title[string=~”Dr.”Title[string=~”Dr.” ..<E>/Name[@descend] .<E>/Name[@descend] .

First[]|<E> . First[]|<E> . Last[string==“Smith”] .Last[string==“Smith”] .

<E>/Name[@ascend] .<E>/Name[@ascend] .<E>/PName[@ascend] ;<E>/PName[@ascend] ;

Page 14: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Alternate Multiple Level Alternate Multiple Level AccessAccess

Upper/lower context without Upper/lower context without switching levelsswitching levels

Token[_costarts=~Sentence[number==1];Token[_costarts=~Sentence[number==1];

Subject[_covers=~PName[];Subject[_covers=~PName[];

PName[_costarts=~NP[],_coends=~NP[]];PName[_costarts=~NP[],_coends=~NP[]];

Page 15: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Grammar cascadingGrammar cascading

From simpler to more complex analysesFrom simpler to more complex analyses Lower levels of output feed as inputs Lower levels of output feed as inputs

into higher levelsinto higher levels

Small noun phrases & verb groupsSmall noun phrases & verb groups Prepositional, possessive & adjectival Prepositional, possessive & adjectival

phrasesphrases More complex noun phrasesMore complex noun phrases Variety of clause typesVariety of clause types Grammatical relations (subject, object)Grammatical relations (subject, object)

Page 16: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

ImplementationsImplementations

Shallow ParsingShallow Parsing Named Entity Detection interleaved Named Entity Detection interleaved

with shallow parsingwith shallow parsing Terminology identification in new Terminology identification in new

domainsdomains Temporal expression parsing Temporal expression parsing Privacy policy rulesPrivacy policy rules Information extraction from resumesInformation extraction from resumes Information extraction from contact Information extraction from contact

center telephone callscenter telephone calls

Page 17: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Future work listFuture work list

Alternate (semi-ambiguous) Alternate (semi-ambiguous) iterator, useful for “disambiguator” iterator, useful for “disambiguator” grammarsgrammars Actor[] Director[]Actor[] Director[]

Tree-walk iterator for tree Tree-walk iterator for tree representations where children are representations where children are explicitly referenced in featuresexplicitly referenced in features

Page 18: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Performance NotesPerformance Notes

Performance is a function ofPerformance is a function of How grammar is writtenHow grammar is written Optimisation of fst graph (grammar Optimisation of fst graph (grammar

compiler)compiler) Optimisation of symbol compilerOptimisation of symbol compiler Optimisation of executorOptimisation of executor

However … for the benefit of the curious However … for the benefit of the curious ……IBM Software Group (Dublin) IBM Software Group (Dublin) optimised the last two, and …optimised the last two, and …

Page 19: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

IBM LanguageWare (Dublin) IBM LanguageWare (Dublin) text analysis performance text analysis performance

resultsresultsThe analysis:The analysis:

- AFST rules and FST - AFST rules and FST dictionarydictionary- 26 rules, 7 - 26 rules, 7 dictionaries (things dictionaries (things like first names, like first names, indicators like Corp. indicators like Corp. etc)etc)

- creating Person and - creating Person and Company annotationsCompany annotations

The TestThe Test- test set: Enron- test set: Enron- 924 files - 924 files - (4.5Mb)- (4.5Mb)

The Results:The Results:

Precision for Company Precision for Company Annotations only: 0.81Annotations only: 0.81

Recall for Company Recall for Company Annotations only: 0.67Annotations only: 0.67

Precision for Person Precision for Person Annotations only: 0.93Annotations only: 0.93

Recall for Person Recall for Person Annotations only: 0.91Annotations only: 0.91

Processing time: 3.4 Processing time: 3.4 secondsseconds

These numbers are 10 These numbers are 10 times faster than the times faster than the best of breed internal best of breed internal reference annotators.reference annotators.

Page 20: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

Perpetrators … er…Perpetrators … er…Responsible partiesResponsible parties

Bran BoguraevBran Boguraev Mary NeffMary Neff Bran LambovBran Lambov D.J. McCloskeyD.J. McCloskey

Thilo GoetzThilo Goetz Thomas Hampp Thomas Hampp Oliver SuhreOliver Suhre

Roy ByrdRoy Byrd Herb ChongHerb Chong Albert EskenaziAlbert Eskenazi Paul Kaye Paul Kaye Son Bao PhamSon Bao Pham Lokesh ShrestaLokesh Shresta Max SilberzteinMax Silberztein

Page 21: What’s “NEXT”? Navigating through Dense Annotation Spaces Branimir K. Boguraev Mary S. Neff Language Engineering for Content Analysis IBM T.J. Watson Research

For more on AFst and tools For more on AFst and tools ----

Tomorrow, 12:25 in Fez 1:Tomorrow, 12:25 in Fez 1:

A Development Environment for A Development Environment for Configurable Meta-Annotators in a Configurable Meta-Annotators in a Pipelined NLP EnvironmentPipelined NLP Environment

Youssef Drissi, Branimir Boguraev, Youssef Drissi, Branimir Boguraev, David Ferrucci, Paul Keyser, and David Ferrucci, Paul Keyser, and Anthony LevasAnthony Levas