Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy...

Preview:

Citation preview

Ontology-Based InformationExtraction and Structuring

Stephen W. Liddle†

School of Accountancy and Information Systems

Brigham Young University

Douglas M. Campbell, David W. Embley,‡ and Randy D. SmithResearch funded in part by †Faneuil Research and ‡Novell, Inc.

Copyright 1998

Motivation

Database-style queries are effective– Find red cars, 1993 or newer, < $5,000

• Select * From Car Where Color=“red” And Year >= 1993 And Price < 5000

Web is not a database– Uses keyword search– Retrieves documents, not records– Assuming we have a range operator:

• “red” and (1993 to 1998) and (1 to 5000)

Solutions

Web query languages Wait for XML to emerge

– Interoperation/Standards?– XML query language?

Wrappers– Hand-written or semi-automatically

generated parsers– Specific to source site, subject to change

Our Approach

Automatic wrapper generation Based on application ontology

– Augmented conceptual model– Defines constants, keywords, their

relationships Best for:

– Narrow ontological breadth– Data-rich documents

Car-Ad Ontology Object-Relationship Model + Data Frames

Year Price

MakeMileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has

is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

Graphical

Car [0:1] has Year [1:*];Year {regexp[2]: “\d{2} : \b’\d{2}\b, … };Car [0:1] has Make [1:*];Make {regexp[10]: “\bchev\b”, “\bchevy\b”, … };Car [0:1] has Model [1:*];Model {…};Car [0:1] has Mileage [1:*];Mileage {regexp[8] “\b[1-9]\d{1,2}k”, “1-9]\d?,\d{3} : [^\$\d][1-9]\d?,\d{3}[^\d]” } {context: “\bmiles\b”, “\bmi\.”, “\bmi\b”};Car [0:*] has Feature [1:*];Feature {regexp[20]: -- Colors “\baqua\s+metallic\b”, “\bbeige\b”, … -- Transmission “(5|6)\s*spd\b”, “auto : \bauto(\.|,)”, -- Accessories “\broof\s+rack\b”, “\bspoiler\b”, …...

Textual

(See Figures 2 & 3 of Paper)

Fixed Processes ApplicationOntology

OntologyParser

Constant/KeywordRecognizer

Database-InstanceGenerator

UnstructuredDocument

Constant/KeywordMatching Rules

Data-Record Table

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

(See Figure 1 of Paper)

Constant/KeywordRecognizer

Database-InstanceGenerator

UnstructuredDocument

Data-Record Table

PopulatedDatabase

Make : \bchev\b…KEYWORD(Mileage) : \bmiles\bKEYWORD(Mileage) : \bmi\....

create table Car ( Car integer, Year varchar(2), … );create table CarFeature ( Car integer, Feature varchar(10)); ...

Object: Car;...Car: Year [0:1];Car: Make [0:1];…CarFeature: Car [0:*] has Feature [1:*];

Ontology Parser ApplicationOntology

OntologyParser

Constant/KeywordMatching Rules

List of Objects, Relation-ships, and Constraints

DatabaseScheme

ApplicationOntology

OntologyParser

Database-InstanceGenerator

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

Constant/Keyword Recognizer

Descriptor/String/Position(start/end)Year|97|1|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|108|114PhoneNr|556-3800|146|153

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! Asking only $11,995. #1415.JERRY SEINER MIDVALE, 566-3800

Constant/KeywordRecognizer

UnstructuredDocument

Constant/KeywordMatching Rules

Data-Record Table

ApplicationOntology

OntologyParser

Constant/KeywordRecognizer

UnstructuredDocument

Constant/KeywordMatching Rules

Database-Instance Generator

insert into Car values(1001, “97”, “CHEV”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)

Database-InstanceGenerator

Data-Record Table

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

Heuristics

Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint

violation

Keyword Proximity

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Subsumed/Overlapping Constants

Make|CHEV|5|8Make|CHEVROLET|5|13Model|Cavalier|15|22Feature|Red|25|27Feature|5 spd|30|34Mileage|7,000|42|46KEYWORD(Mileage)|miles|48|52Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Functional Relationships

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Nonfunctional Relationships

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

First Occurrence without Constraint Violation

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147PhoneNr|566-3802|149|156

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Recall & Precision

N

CRecall

IC

C

Precision

N = number of facts in sourceC = number of facts declared correctlyI = number of facts declared incorrectly

(of facts available, how many did we find?)

(of facts retrieved, how many were relevant?)

Experimental ResultsSalt Lake Tribune

Tuning set: 100Test set: 116

Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Extension 50 100Feature 91 99

(See Table 1 of Paper)

Trouble Spots Unbounded sets

– missed: MERC, Town Car, 98 Royale– could use lexicon of makes and models

Unspecified variation in lexical patterns– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)– could adjust lexical patterns

Misidentification of attributes– classified AUTO in AUTO SALES as automatic transmission– could adjust exceptions in lexical patterns

Typographical errors– “Chrystler”, “DODG ENeon”, “I-15566-2441”– could look for spelling variations and common typos

Contributions

Fully automatic technique for wrapper generation

Uses syntactic, not semantic constant-recognition techniques

Adapts readily to different unstructured document formats

Good precision & recall ratios Implemented (Perl, C++, Lex/Yacc, Java)

Limitations

Works best for data-rich documents, narrow ontological domains

Ontology creation is still manual– Domain expert– Trained in our conceptual model & tools

Future Work

Graphical ontology editor Improve automatic record-boundary

recognition– Make suitable for broader domains

(obituaries, university catalog, etc.) Improve heuristics

– Use a declarative language– Employ more of OSM’s rich constraints

Future Work (cont.)

Add operations to data frames– General constraints– Canonical representations– Inferred information

Develop ontology libraries Finish porting to 100% Java Incorporate learning/feedback Ontology-enabled agents

Our Web Site

I have a demo on my laptop Can download from our Web site BYU Data Extraction Group

http://osm7.cs.byu.edu/deg

(See Reference 13 of Paper)

Other Domains

Job Listings Obituaries University Course Catalogs

Job Listings ResultsLos Angeles Times

Tuning set: 50Test set: 50

Recall % Precision %Degree 100 100Skill 74 100Contact 100 100Email 91 83Fax 91 100Voice 79 92

(See Table 2 of Paper)

Obituaries ResultsSalt Lake Tribune

Tuning set: ~40Test set: 38

Recall % Precision %Deceased Name 100 100Age 91 95Birth Date 100 97Death Date 94 100Funeral Date 92 100Funeral Address 96 96Funeral Time 97 100Interment Address 100 100Viewing 93 96Viewing Date 70 100Viewing Address 76 100Beginning Time 88 100Ending Time 90 100Relationship 81 93Relative Name 88 71

(See our forthcoming ER’98 paper for details.)

Obituaries ResultsArizona Daily Star

Tuning set: ~40Test set: 90

Recall % Precision %Deceased Name 100 100Age 86 98Birth Date 96 96Death Date 84 99Funeral Date 96 93Funeral Address 82 82Funeral Time 92 87Interment Address 100 100Viewing 97 100Viewing Date 100 100Viewing Address 95 100Beginning Time 93 96Ending Time 95 100Relationship 92 97Relative Name 95 74

(See our forthcoming ER’98 paper for details.)

Recommended