View
226
Download
0
Category
Preview:
DESCRIPTION
© 2008 IBM Corporation 3 Where is the party? Hi guys, We are planning a salsa party tonight starting at 10:00pm for our class at Miami Beach Club, 175 San Pedro Square San Jose, CA Whoever who is interested, please let me know so we can organize some car-pooling. -Juan PS: you can call me at if needed. salsa address 0 found salsa 100 s found address 0 found The address of the party! But the itself does not contain the word “address”!
Citation preview
Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation
Declarative Information Extraction
The Avatar Group IBM Almaden Research Center
Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu
© 2008 IBM Corporation2
MotivationWhere is the party?
Hmmm…I don’t know. Let me
check my email.
John and Jane are going to a salsa party tonight! But …
© 2008 IBM Corporation3
Where is the party?
Hi guys,
We are planning a salsa party tonight starting at
10:00pm for our class at Miami Beach Club,
175 San Pedro Square
San Jose, CA 95109
Whoever who is interested, please let me know
so we can organize some car-pooling.
-Juan
PS: you can call me at 408.123.4567 if needed.
salsa address 0 email found
salsa 100 emails found
address 0 email found
The address of the party!
But the email itself does not contain the word “address”!
© 2008 IBM Corporation4
Information Extraction Distill structured data from unstructured and semi-structured text
– E.g. extracting phone numbers from emails, extracting person names from the web
Hi guys,
We are planning a salsa party tonight starting at 10:00pm for our salsa class at Miami Beach Club,
175 San Pedro Square San Jose, CA 95109
Whoever who is interested, please let me know so we can organize some car-pooling.
-Juan
PS: you can call me at 408.123.4567 if needed.
Event Address salsa party 175 San Pedro Square ...... ...
Select Address From EVENTS Where event = ‘salsa party’
175 San Pedro Square …
Exploit the extracted data in your applications– E.g. for search, for advertisement
© 2008 IBM Corporation5
Revisit: Where is the Party?
salsa address
San Jose, CA 95109
Lotus Notes 8.01 Live Text
© 2008 IBM Corporation6
Other Commercial Applications
© 2008 IBM Corporation7
And many others
Literature Citations/ Research Communities– DBLife– Google Scholar
Terminology Extraction Document Summarization Life Science
– Eg. Gene Sequence Extraction, Protein Interaction Extraction
… …
As the amount of data in text explodes,
information extraction is becoming
increasing important!
© 2008 IBM Corporation8
Basic Terminology
Annotator
Annotator
Annotator
…
annotations
annotations
annotations
documents Data Repository
Higher Level ApplicationsPrograms used to extract
structured data
Structured data extracted by annotators
© 2008 IBM Corporation9
Background: Avatar
Working on information extraction (IE) since 2003 Main goals:
– Extract structured information from text– Build a system that can scale IE to real enterprise apps – Build new enterprise applications that leverage IE
© 2008 IBM Corporation10
Large number of annotators
System T(algebraic information
extraction system)2007
2004
2005
2006
Evolution of the Avatar IE System
Performance, Expressivity
Custom Code
Diverse data sets, Complex extraction tasks
RAP(CPSL-style cascading
grammar system)
Evolutionary Triggers
RAP++(RAP + Extensions outside the
scope of grammars)
2008
Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation
The Custom Code Era
© 2008 IBM Corporation12
Extracting Information with Custom Code
“It’s just pattern matching” – Use scripts and regular expressions
Then reality sets in…– Dozens of rules, even for simple concepts– Many special cases– Convoluted logic– Painfully slow code
Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation
The Age of Cascading Grammars
© 2008 IBM Corporation14
Historical Perspective MUC (Message Understanding Conference) – 1987 to 1997
– Competition-style conferences organized by DARPA– Shared data sets and performance metrics
• News articles, Radio transcripts, Military telegraphic messages
Classical IE Tasks– Entity and Relationship/Link extraction– Event detection, sentiment mining etc.– Entity resolution/matching
Several IE systems were built– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS
[Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]
© 2008 IBM Corporation15
Cascading Finite-state Grammars
Most IE systems share a common formalism– Input text viewed as a sequence of tokens– Rules expressed as regular expression patterns
over the lexical features of these tokens Several levels of processing Cascading
Grammars
CPSL– A standard language for specifying cascading grammars– Created in 1998
Several known implementations– TextPro: reference implementation of CPSL by Doug Appelt– JAPE (Java Annotation Pattern Engine)
• Part of the GATE NLP framework• Under active consideration for commercial use by several companies
© 2008 IBM Corporation16
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Cascading Grammars By Example
Name Token[~ “at”] Phone PersonPhone
Token[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 0 (Tokenize)
Level 2
Level 1
© 2008 IBM Corporation17
Experiences with Cascading Grammars
Benefits– Big step forward from custom code– Can express many simple concepts
Drawbacks– Expressiveness
• Dealing with overlap• Building complex structures
– Performance
© 2008 IBM Corporation18
Sequencing Overlapping Input Annotations
ProperNoun Instrument
John Pipe plays the guitar
Instrument
<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match
<[A-Z]\w+(\s[A-Z]\w+)?> <d1|d2|…dn>
Example rule from the Band Review
ProperNoun
Marco Doe on the Hammond organ
Instrument
ProperNoun
© 2008 IBM Corporation19
Sequencing Overlapping Input Annotations Possible options
– Pre-specified disambiguation rules (e.g., pick earlier annotation)– Supply tie-breaking rules for every possible overlap scenario– Let implementation make an internal non-deterministic choice (as in JAPE, RAP, ..)
ProperNoun
Marco Doe on the Hammond organ
Instrument
ProperNoun
Instrument
John Pipe plays the guitar
ProperNoun
Instrument
Marco Doe on the Hammond organ ProperNoun Token Token Instrument
Which of the two should we pick?
John Pipe plays the guitarProperNoun Token Token Instrument
John Pipe plays the guitarToken Instrument Token Token Instrument
Marco Doe on the Hammond organ ProperNoun Token Token PoperNoun token
Prefer ProperNoun over InstrumentOver 4.5M blog entries a choice one way or another on a single rule would change the number of annotations by +/- 25%.
There is no magic!
© 2008 IBM Corporation20
Complex Structures Example: Signature Annotator
Laura Haas, PhDDistinguished Engineer and Director, Computer
ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs
Person
OrganizationPhone
URL
Person Organizati
onPhone
URL
At least 1 Phone
At least 2 of {Phone, Organization, URL}End with one of these.
Start with Person
Within 50 tokens
© 2008 IBM Corporation21
Complex Structures: Existing Solutions
Approximate using regular expressions Example: Signature
– Rule: (Person Token{,25} Phone (Token{,25} Contact)+) | (Person (Token{,25} Contact)+ Token{,25} Phone
(Token{,25} Contact)*)– Problems:
• Need to enumerate all possible orders of sub-annotations– What if you want at least one phone and one email?
• Does not restrict total token count
© 2008 IBM Corporation22
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Performance
Name Token[~ “at”] Phone PersonPhone
Token[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 0
Level 2
Level 1
Each level in a cascading grammar looks at each character in each document
Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation
Dawn of Declarative Information Extraction
© 2008 IBM Corporation24
System-T ArchitectureAQL Language
Optimizer
OperatorRuntime
Specify annotator semantics declaratively
Choose an efficient execution plan that implements semantics
Annotation Algebra
© 2008 IBM Corporation25
Declarative Information Extraction: AQL
SQL-like language for defining annotators Declarative
– Define basic patterns and the relationships between them
– Let the system worry about the order of operations
© 2008 IBM Corporation26
AQL Example
select CombineSpans(name.match, instrument.match) as annotfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);
<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match
© 2008 IBM Corporation27
Annotation Algebra
Each Operator in the algebra…– …operates on one or more tuples of annotations – …produces tuples of annotations
“Document at a time” execution model– Algebra expression is defined over
• the current document d • annotations defined over d
Algebra expression is evaluated over each document in the corpus individually
© 2008 IBM Corporation28
Basic Single-Argument Operator
Annotation 1
Operator
Output Tuple 1
Parameters
DocumentInput Tuple
Document
Annotation 2Output Tuple 2 Document
© 2008 IBM Corporation29
Comparison with Cascading Grammars
Apply Name Rule
Apply Phone Rule
Apply PersonPhone
…John Smith at 555-1212…
…<Name> at <Phone>…
…<PersonPhone>…
…John Smith at 555-1212…
555-1212
John Smith at 555-1212
Grammar
Dictionary Regex
Join
Algebra
Block
JohnSmith
John Smith
Fewer passes over the documents
© 2008 IBM Corporation30
Revisit Problem of Sequencing Annotations
ProperNoun Instrument ProperNoun
John Pipe plays the guitar Marco Benevento on the Hammond organ
Instrument
InstrumentProperNoun
© 2008 IBM Corporation31
<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match
Algebra expression for the Rule from Band Review(Reiss, Raghavan, Krishnamurthy, Zhu and Vaithyanathan, ICDE 2008)
ProperNoun Instrument
(followed within 30 characters)
DictionaryRegular
expression
Join
© 2008 IBM Corporation32
DictionaryRegex
Join
John PipedocMarco Beneventodoc
Hammonddoc
docdoc
Pipeguitar
doc Hammond organ
ProperNoun Instrument ProperNoun
John Pipe plays the guitar Marco Benevento on the Hammond organ
Instrument
InstrumentProperNoun
John PipedocMarco Beneventodoc
guitarHammond organ
ProperNoun Instrument
ProperNoun <0-30 chars> Instrument
© 2008 IBM Corporation33
How is aggregation handled
Laura Haas, PhDDistinguished Engineer and Director, Computer
ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs
Person
OrganizationPhone
URL
Person Organizati
onPhone
URL
At least 1 Phone
At least 2 of {Phone, Organization, URL}End with one of these.
Start with Person
Within 50 tokens
© 2008 IBM Corporation34
Back to signature
Org Phone URL
Person
Organization PhoneURL
Block
Union
Organization
PhoneURLPerson
Join
Organization PhoneURL
PersonSignature
Cleaner and potentially faster
© 2008 IBM Corporation35
Performance
Performance issues with grammars– Complete pass through tokens for each rule– Many of these passes are wasted work
Dominant approach: Make each pass go faster– Doesn’t solve root problem!
Algebraic approach: Build a query optimizer!
© 2008 IBM Corporation36
Optimizations
Query optimization is a familiar topic in databases What’s different in text?
– Operations over sequences and texts– Document boundaries– Costs concentrated in extraction operators (dictionary,
regular expression) Can leverage these characteristics
– Text-specific optimizations– Significant performance improvements
© 2008 IBM Corporation37
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elentum non ante. John Pipe played the guitar. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, arcu augue rutrum ve
Optimization Example
Regex match Dictionary match
0-30 characters
<ProperNoun> <within 30 characters> <Instrument>
© 2008 IBM Corporation38
<ProperNoun> <Instrument>
(Followed within 30 characters)
<ProperNoun>
Find <Instrument> within 30 characters
<Instrument>
Find <ProperNoun> within 30 characters
Consider text to the rightConsider text to the left
Plan B Plan C
Plan A
Join
Classic Query Optimization
© 2008 IBM Corporation39
Example of Text-Specific Optimization:
Conditional Evaluation (CE)– Leverage document-at-a-
time processing– Don’t evaluate the inner
operand of a join if the outer has no results
– Costing plans is challenging
…John Smith at 555-1212…
John Smith 555-1212
John Smith at 555-1212
Dictionary Regex
CEJoin
Don’t evaluate this Regex when there are no dictionary
matches.
© 2008 IBM Corporation40
Experimental Results (Band Review Annotator)
Annotator Running Time
0
5000
10000
15000
20000
25000
30000
GRAMMAR ALGEBRA (Baseline) ALGEBRA (Optimized)
Run
ning
Tim
e (s
ec)
Classical query
optimization
Text-specific optimizations
© 2008 IBM Corporation41
IOPES: Extracting Relationships and Composite Entities
IOPES = IBM Omnifind Personal Email Search Extract entities such as email address, url Associations such as name ↔ phone number Complex entities like conference schedules, directions, signature
blocks
© 2008 IBM Corporation42
Thank you!
For more information…– Try out IOPES
• http://www.alphaworks.ibm.com/tech/emailsearch– Avatar Project home page
• http://almaden.ibm.com/cs/projects/avatar/– Contact me
• yunyaoli@us.ibm.com
Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation
Backup Slides
© 2008 IBM Corporation44
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In augue mi, scelerisque non, dictum non, vestibulum congue, erat. Donec non felis. Maecenas urna nunc, pulvinar et, fringilla a, porta at, diam. In iaculis dignissim erat. Quisque pharetra. Suspendisse cursus viverra urna. Aliquam erat volutpat. Donec quis sapien et metus molestie eleifend. Maecenas sit amet metus eleifend nibh semper fringilla. Pellentesque habitant morbi tristique senectus et netus et malesuada
Block Operator ()
Input InputInput
Input
Constraint on distance between inputs
Constraint on number of inputs
Blo
ck
© 2008 IBM Corporation45
Conditional Evaluation (CE)
Leverage document-at-a-time processing
Don’t evaluate the inner operand of a join if the outer has no results
Costing plans is challenging
…John Smith at 555-1212…
John Smith 555-1212
John Smith at 555-1212
Dictionary Regex
CEJoin
Don’t evaluate this Regex when there are no dictionary
matches.
© 2008 IBM Corporation46
Restricted Span Evaluation
Leverage the sequential nature of text
Only evaluate the inner on the relevant portions of the document
Limited applicability (compared with CE)
– Only certain operands and predicates
…John Smith at 555-1212…
John Smith555-1212
John Smith at 555-1212
DictionaryRegex
RSEJoin
Only look for dictionary matches in the vicinity of a
phone number.
© 2008 IBM Corporation47
Implementing Restricted Span Evaluation (RSE)
RSE join operator RSE extraction operator Pass join bindings down to
the inner of a join Requires special physical
operators at edges of plan
s1
R1
p(s1,s2)Dict(D,s2)
RSEDict
s1 binding
s2’s that satisfyp(binding, s2) RSE
DictionaryOperator
Dp
© 2008 IBM Corporation48
RSE Dictionary Operator
RSE version of an operator must produce the exact same answer
– Ongoing work: RSE Regular Expression operator
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin tincidunt eleifend quam. Aliquam ut pede ut enim dapibus venenatis.
To find dictionary matches that end in this range…
…need to examine this range.
Length of longest dictionary entry
© 2008 IBM Corporation49
Closely related work (Shen, Doan, Naughton, Ramakrishnan, VLDB 2007)
Regular Expressions and
Custom Code
Cascading Grammars
CPSL, AFST UIMA, GATE
Workflows
System T DBLifeIn the context of Project Cimple.
Search for “cimple wisc”
© 2008 IBM Corporation50
Delving deeper into System T versus DBLife
Restricted Span
Evaluation
Shared Dictionary Matching
Conditional Evaluation
Pushing Down Text
Properties
Scoping Extractions
Pattern MatchingSy
stem
TD
BLife
© 2008 IBM Corporation51
Cascading Grammar Reality Set of simple grammar rules for person name recognition
PersonDict PersonDict Person
Salutation CapsWord CapsWord Person
CapsWord CapsWord Token[~“,”]? Qualification Person
Level 1: Rules that look for patterns in each token to produce corresponding annotations
Tokenize(Document Text) Sequence of <Token>
Token[~ “Mr. | Mrs. | Dr. | …”] Salutation
Token[~ “Ph.D | MBA | …”] Qualification
Token[~ “[A-Z][a-z]*”] CapsWord
Token[~ “Michael | Richard | Smith| …”] PersonDict
Richard Smith
Dr. Laura Haas
Laura Haas, Ph.D
Pre-processing step: Tokenization of the document text
Level 2: Rules that look for patterns involving Level-1 annotations to identify Persons
© 2008 IBM Corporation52
IOPES: Extracting Relationships and Composite Entities
IOPES = IBM Omnifind Personal Email Search Entities like addresses, person names Relationships like name ↔ phone number Complex entities like conference schedules, directions, signature
blocks
© 2008 IBM Corporation53
Extracting Entities in Notes 8.01 Live Text
Leverages Information Extraction Techniques
Names, addresses, phone numbers…
Ships with Lotus Notes 8.01
Recommended