Upload
truong
View
24
Download
0
Embed Size (px)
DESCRIPTION
Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick Reiss, Rajasekar Krishnamurthy, Sriram Raghavan and HuaiyuZhu. Lots of Text, Many Applications!. Free-text, semi-structured, streaming … - PowerPoint PPT Presentation
Citation preview
© 2006 IBM Corporation
Towards Declarative Information ExtractionThe Almaden Story
Shivakumar VaithyanathanIBM Almaden
Acknowledgements to:Frederick Reiss, Rajasekar Krishnamurthy, Sriram Raghavan and
HuaiyuZhu
© 2007 IBM Corporation2
Lots of Text, Many Applications!
Free-text, semi-structured, streaming …– Web pages, emails, news articles, call-center records,
business reports, spreadsheets, research papers, blogs, wikis, tags, instant messages, …
High-impact applications– Business intelligence, personal information
management, enterprise search, Web communities, Web search and advertising, scientific data management, e-government, medical records management, …
Growing rapidly– Just look at your inbox!
(Adapted from SIGMOD ’06 tutorialby Ramakrishnan, Doan, and Vaithyanathan)
© 2007 IBM Corporation3
Information Extraction Distill structured data from unstructured and semi-structured text
For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..
(from Cohen’s IE tutorial, 2003)
Select Name From PEOPLE Where Organization = ‘Microsoft’
Bill GatesBill Veghte
Exploit the extracted data in your applications
© 2007 IBM Corporation4
Historical Perspective MUC (Message Understanding Conference) – 1987 to 1997
– Competition-style conferences organized by DARPA– Shared data sets and performance metrics
• News articles, Radio transcripts, Military telegraphic messages
Classical IE Tasks– Entity and Relationship/Link extraction– Event detection, sentiment mining etc.– Entity resolution/matching
Several IE systems were built during this period– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS
[Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]
Not Focus of this talk
© 2007 IBM Corporation5
Custom Code
RAP(CPSL-style cascading
grammar system)
RAP++(RAP + Extensions outside
the scope of grammars)
System T(algebraic information
extraction system)2007
2004
2005
2006
Large number of annotators
Diverse data sets, Complex extraction tasks
Performance, Expressivity
Project Avatar -Evolution of IE: The Almaden StoryEvolutionary triggers
© 2007 IBM Corporation6
From: Michael D. Baselice <[email protected]>
…. Enron is in search of support with the electric deregulation fight out there. Dick Woodward or Dave Fogarty can be reached at 650-340-0470. If you have any questions, please call. .…
From: Tom Briggs <[email protected]>
call Jeff Dasovich at 415-782-7822. He is an excellent source and has a long, sordid histroy in california's Energy market.
<Person> “(\s+at\s+)|((\w+\s+){1,3}reached\s+at\s+)” <PhoneNumber>
Circa 2004: Custom Code
Extract Person’s Phone Number
Example emails from the Enron collection
© 2007 IBM Corporation7
Custom Code
RAP(CPSL-style cascading
grammar system)
RAP++(RAP + Extensions outside
the scope of grammars)
System T(algebraic information
extraction system)
Large number of annotators
Circa 2005: Moving to RAP
CPSL: Nice abstraction for specifying rules over annotations and text.
RAP: Almaden CPSL execution engine
As the number of annotators increased, custom code was not an option. We
needed a cleaner abstraction
RAP(CPSL-style cascading
grammar system)
© 2007 IBM Corporation8
Multiple levels of grammar rules to perform extraction
At each level, rules are written in the formalism of finite-state grammars– Input text viewed as a sequence of tokens– Rules expressed as regular expression patterns over the lexical features of
these tokens Common Pattern Specification Language (Appelt & Onyshkevych, 1998)
– A language specification to abstract rules and matching semantics from specific implementations
– Several known implementations• TextPro: reference implementation of CPSL by Doug Appelt• JAPE (Java Annotation Pattern Engine)
– Part of the GATE NLP framework– Under active consideration for commercial use by several companies
Cascading Grammars
© 2007 IBM Corporation9
Cascading Grammar Reality Set of simple grammar rules for person name recognition
PersonDict PersonDict Person
Salutation CapsWord CapsWord Person
CapsWord CapsWord Token[~“,”]? Qualification Person
Level 1: Rules that look for patterns in each token to produce corresponding annotations
Tokenize(Document Text) Sequence of <Token>
Token[~ “Mr. | Mrs. | Dr. | …”] Salutation
Token[~ “Ph.D | MBA | …”] Qualification
Token[~ “[A-Z][a-z]*”] CapsWord
Token[~ “Michael | Richard | Smith| …”] PersonDict
Richard Smith
Dr. Laura Haas
Laura Haas, Ph.D
Pre-processing step: Tokenization of the document text
Level 2: Rules that look for patterns involving Level-1 annotations to identify Persons
© 2007 IBM Corporation10
Custom Code
RAP(CPSL-style cascading
grammar system)
RAP++(RAP + Extensions outside
the scope of grammars)
System T(algebraic information
extraction system)
Applying RAP to more complex tasks
Complex tasks– Extracting signature from
emails
© 2007 IBM Corporation11
RAP : Problems encountered
1. Lack of Support for complex operations Aggregation, Dictionary evaluation, Combining
annotations with character-level regular expressions 2. Handling overlapping output annotations
Multiple rules for a concept may generate overlapping annotations
Sometimes, final results may overlap
3. Overlapping input annotations : Input to a grammar is a linear sequence of annotations
4. Performance Annotators for complex tasks are very expensive to execute
© 2007 IBM Corporation12
Complex operations
Laura Haas, PhDDistinguished Engineer and Director, Computer
ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs
Person
OrganizationPhone
URL
Person Organizati
onPhone
URL
At least 1 Phone
At least 2 of {Phone, Organization, URL, Email, Address}
End with one of these.
Start with Person
Within 250 characters
© 2007 IBM Corporation13
Approximate Signature Extraction using a Grammar
Macro: Contact = Phone|Organization|URL| Email|Address
Rule:
Person (.{,125} Contact){2,?} Signature
Problems:– Cannot guarantee at least 1 Phone.
• Leads to false positives– Cannot restrict limit on total character count of 250.
• Leads to both false positives and false negatives.
Expressing aggregations is hard in grammars
© 2007 IBM Corporation14
Custom Code
RAP(CPSL-style cascading
grammar system)
RAP++(RAP + Extensions outside
the scope of grammars)
System T(algebraic information
extraction system)
Circa 2006: Moved to RAP++
Modules for aggregation (e.g.,
Count)
© 2007 IBM Corporation15
Grammar-based systems : Problems encountered1. Lack of Support for complex operations
Aggregation, Dictionary evaluation, Combining annotations with character-level regular expressions
2. Handling overlapping output annotations All overlapping annotations need to be retained Overlapping annotations need to be consolidated
3. Overlapping input annotations : Input to a CPSL grammar is a linear sequence of annotations
4. Performance Annotators for complex tasks are expensive to execute Takes 9 hours to execute Band Review annotator over 4.5 million blogs
© 2007 IBM Corporation16
Issue 2 – All overlapping annotations need to be retained
Valid results may overlap Dick Woodward or Dave Fogarty can be reached at 650-340-0470.
Work-around: Multiple rules
The number of rules can explode !
© 2007 IBM Corporation17
Custom Code
RAP(CPSL-style cascading
grammar system)
RAP++(RAP + Extensions outside
the scope of grammars)
System T(algebraic information
extraction system)
Circa 2006: Moved to RAP++
Modules for aggregation (e.g.,
Count)
Workarounds to retain all
overlapping annotations
© 2007 IBM Corporation18
Overlapping annotations may need to be consolidated
… Dr. John Smith …
Consolidation cannot be expressed in Grammar
Two spans need to be merged
Salutation PersonDict
PersonDict PersonDict
© 2007 IBM Corporation19
Custom Code
RAP(CPSL-style cascading
grammar system)
RAP++(RAP + Extensions outside
the scope of grammars)
System T(algebraic information
extraction system)
Circa 2006: Moved to RAP++
Modules for aggregation (e.g.,
Count)
Consolidation modules
Workarounds to generate
overlapping output
© 2007 IBM Corporation20
Apply RAP++ to diverse data-sets
Two severe problems surfaced– Due to complex workflows results largely dictated
by low-level decisions– Performance
© 2007 IBM Corporation21
…….I went to see the OTIS concert last night. T’ was SO MUCH FUN I really had a blast …
….there were a bunch of other bands …. I loved STAB (….). they were a really weird ska band and people were running around and …
Weblogs: Identify Bands and Reviews
© 2007 IBM Corporation22
Concert
Review
review pattern review pattern
review pattern
review patternreview pattern
Block
Concert instance pattern
© 2007 IBM Corporation23
BandReview
went to the Switchfoot concert at the Roxy. It was pretty fun,… The lead singer/guitarist was really good, and even though there was another guitarist (an Asian guy), he ended up playing most of the guitar parts, which was really impressive. The biggest surprise though is that I actually liked the opening bands. …I especially liked the first band
ReviewInstance
ReviewGroup
“lead singer/guitarist was really good”
“Liked the opening bands”
“Liked the first band”
“Kurt Ralske played guitar”
“Lead singer/guitarist was really good, and even … I actually liked the opening bands. … Well they were none of those. I especially liked the first band”
ConcertInstance
“went to the Switchfoot concert at the Roxy”
“went to AJCO n Band concert”
“performance by local funk band Saaraba”
Outline of the BandReview Extraction Task
© 2007 IBM Corporation24
Example low-level decision
ProperNoun Instrument ProperNoun
John Pipe plays the guitar Marco Benevento on the Hammond organ
Instrument
Instrument ProperNoun
<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match
<[A-Z]\w+(\s[A-Z]\w+)?> <d1|d2|…dn>
Example rule from the Band Review
© 2007 IBM Corporation25
Sequencing Overlapping Annotations
Possible options– Pre-specified disambiguation rules (e.g., pick earlier annotation)– Supply tie-breaking rules for every possible overlap scenario– Let implementation make an internal non-deterministic choice (as in JAPE, RAP, ..)
John Pipe plays the guitar
ProperNoun Instrument
Instrument John Pipe plays the guitarProperNoun Token Token Instrument
John Pipe plays the guitarToken Instrument Token Token Instrument
Which of the two should we pick?
Over 4.5M blog entries a choice one way or another on a single rule would change the number of annotations by +/- 25%.
© 2007 IBM Corporation26
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at <Phone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, John Smith at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <PersonPhone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Performance
Name Token[~ “at”] Phone PersonPhone
Token[~ “John | Smith| …”]+ Name
Token[~ “[1-9]\d{2}-\d{4}”] Phone
Each level in a cascading grammar looks at each character in each document
© 2007 IBM Corporation27
Custom Code
RAP(CPSL-style cascading
grammar system)
RAP++(RAP + Extensions outside
the scope of grammars)
System T(algebraic information
extraction system)
2007: System T
Scalable algebraic information extraction
system
© 2007 IBM Corporation28
Brief introduction to algebra for Intra-document IE
Each Operator in the algebra…– …operates on one or more tuples of annotations – …produces tuples of annotations
“Document at a time” execution model– Algebra expression is defined over
• the current document d • annotations defined over d
Algebra expression is evaluated over each document in the corpus individually
© 2007 IBM Corporation29
Basic Single-Argument Operator
Annotation 1
Operator
Output Tuple 1
Parameters
DocumentInput Tuple
Document
Annotation 2Output Tuple 2 Document
© 2007 IBM Corporation30
<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match
Algebra expression for the Rule from Band Review(Reiss, Raghavan, Krishnamurthy, Zhu and Vaithyanathan, ICDE 2008)
ProperNoun Instrument
(followed within 30 characters)
DictionaryRegular
expression
Join
© 2007 IBM Corporation31
Revisit Problem of Sequencing Annotations
ProperNoun Instrument ProperNoun
John Pipe plays the guitar Marco Benevento on the Hammond organ
Instrument
InstrumentProperNoun
© 2007 IBM Corporation32
DictionaryRegex
Join
John PipedocMarco Beneventodoc
Hammonddoc
docdoc
Pipeguitar
doc Hammond organ
ProperNoun Instrument ProperNoun
John Pipe plays the guitar Marco Benevento on the Hammond organ
Instrument
InstrumentProperNoun
John PipedocMarco Beneventodoc
guitarHammond organ
ProperNoun Instrument
ProperNoun <0-30 chars> Instrument
© 2007 IBM Corporation33
Custom code of RAP++ are also addressed in the Algebra
Interaction between custom modules and grammar hard to maintain
In the algebra world they are addressed more cleanly
– Aggregation– Consolidation
© 2007 IBM Corporation34
How is aggregation handled
Laura Haas, PhDDistinguished Engineer and Director, Computer
ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs
Person
OrganizationPhone
URL
Person Organizati
onPhone
URL
At least 1 Phone
At least 2 of {Phone, Organization, URL, Email, Address}
End with one of these.
Start with Person
Within 250 characters
© 2007 IBM Corporation35
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In augue mi, scelerisque non, dictum non, vestibulum congue, erat. Donec non felis. Maecenas urna nunc, pulvinar et, fringilla a, porta at, diam. In iaculis dignissim erat. Quisque pharetra. Suspendisse cursus viverra urna. Aliquam erat volutpat. Donec quis sapien et metus molestie eleifend. Maecenas sit amet metus eleifend nibh semper fringilla. Pellentesque habitant morbi tristique senectus et netus et malesuada
Block Operator ()
Input InputInput
Input
© 2007 IBM Corporation36
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In augue mi, scelerisque non, dictum non, vestibulum congue, erat. Donec non felis. Maecenas urna nunc, pulvinar et, fringilla a, porta at, diam. In iaculis dignissim erat. Quisque pharetra. Suspendisse cursus viverra urna. Aliquam erat volutpat. Donec quis sapien et metus molestie eleifend. Maecenas sit amet metus eleifend nibh semper fringilla. Pellentesque habitant morbi tristique senectus et netus et malesuada
Block Operator ()
Input InputInput
Input
Constraint on distance between inputs
Constraint on number of inputs
Blo
ck
© 2007 IBM Corporation37
Back to signature
Org Phone URL
Person
Join
Union
Organization PhoneURL
Organization PhoneURL
Person
Block Organization
PhoneURLPerson
Signature
Cleaner and potentially faster
© 2007 IBM Corporation38
Cost Modeling
Our experience: Extraction operators dominate– 90+ percent of execution time for most plans
Most of the costs that matter in databases don’t matter in information extraction
– I/O costs– Join costs– Sorting costs– Aggregation costs
© 2007 IBM Corporation39
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at <Phone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, John Smith at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <PersonPhone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
And finally performance
Name Token[~ “at”] Phone PersonPhone
Token[~ “John | Smith| …”]+ Name
Token[~ “[1-9]\d{2}-\d{4}”]+ Phone
© 2007 IBM Corporation40
Why performance should improve
Apply Name Rule
Apply Phone Rule
Apply PersonPhone
…John Smith at 555-1212…
…<Name> at 555-1212…
…<Name> <Name> at <Phone>…
…<PersonPhone>…
…John Smith at 555-1212…
SmithJohn555-1212
John Smith at 555-1212
RAP
Dictionary Regex
Join
AlgebraSmaller number of passes over the data
© 2007 IBM Corporation41
More Optimization
The algebraic approach allows further performance improvements via query optimization
– Common technique in relational databases We extend the formalism to the text domain
– Text-specific algebraic transformations, cost model, and selectivity estimation
© 2007 IBM Corporation42
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elentum non ante. John Pipe played the guitar. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, arcu augue rutrum ve
Optimization Example
Regex match Dictionary match
0-30 characters
<ProperNoun> <within 30 characters> <Instrument>
© 2007 IBM Corporation43
<ProperNoun> <Instrument>
(Followed within 30 characters)
<ProperNoun>
Find <Instrument> within 30 characters
<Instrument>
Find <ProperNoun> within 30 characters
Consider text to the rightConsider text to the left
Plan B Plan C
Plan A
Join
© 2007 IBM Corporation44
Optimization Example
020406080
100120140160180
Running Time (sec)
Plan A Plan C
© 2007 IBM Corporation45
Improvements even on named-entity annotators
0
100
200
300
400
500
600
Running Time (sec)
RAP System-T
© 2007 IBM Corporation46
Full Band Review Annotator
Annotator Running Time
0
5000
10000
15000
20000
25000
30000
GRAMMAR ALGEBRA (Baseline) ALGEBRA (Optimized)
Run
ning
Tim
e (s
ec)
© 2007 IBM Corporation47
Closely related work (Shen, Doan, Naughton, Ramakrishnan, VLDB 2007)
Regular Expressions and
Custom Code
Cascading Grammars
CPSL, AFST UIMA, GATE
Workflows
System T DBLifeIn the context of Project Cimple.
Search for “cimple wisc”
© 2007 IBM Corporation48
Delving deeper into System T versus DBLife
Restricted Span
Evaluation
Shared Dictionary Matching
Conditional Evaluation
Pushing Down Text
Properties
Scoping Extractions
Pattern MatchingSy
stem
TD
BLife
© 2007 IBM Corporation49
We have started: AQL
SQL-like language for defining annotators Declarative
– Define basic patterns and the relationships between them
– Let the system worry about the order of operations
© 2007 IBM Corporation50
AQL Example
select CombineSpans(name.match, instrument.match) as annot, name.match as name, instrument.match as instrfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);
<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match
© 2007 IBM Corporation51
AQL System Architecture
AQL Language
Optimizer
OperatorRuntime
Specify annotator semantics declaratively
Choose an efficient execution plan that implements semantics
© 2007 IBM Corporation52
Where we are ?
Preliminary indications that at least two (seemingly) different directions are coming together
How do applications interact with such a system ? What are the next steps ?
© 2006 IBM Corporation
Backup Slides