Automatic Creation and Automatic Creation and Simplified Querying of Simplified Querying of Semantic Web ContentSemantic Web Content
An Approach Based on An Approach Based on Information-Extraction OntologiesInformation-Extraction Ontologies
Yihong Ding, David W. Embley, and Stephen W. LiddleBrigham Young University
Fundamental ProblemsFundamental Problems
Lack of semantic web contentLack of semantic web content Difficulty of content creationDifficulty of content creation Inability to use semantic web content easilyInability to use semantic web content easily
Proposed SolutionsProposed Solutions
Automatically annotate data-rich web pages Automatically annotate data-rich web pages (turning them into semantic web pages)(turning them into semantic web pages)
Provide for free-form, textual queries of Provide for free-form, textual queries of semantic web contentsemantic web content
A Show-Case VisionA Show-Case Vision
Find me the price and Find me the price and mileage of red Nissans – mileage of red Nissans – I want a 1990 or newer.I want a 1990 or newer.
Demo I: Data ExtractionDemo I: Data Extraction
Demo II: Semantic AnnotationDemo II: Semantic Annotation
Demo III: Free-Form QueryDemo III: Free-Form Query
Explanation: How it WorksExplanation: How it Works
Extraction OntologiesExtraction Ontologies Semantic AnnotationSemantic Annotation Free-Form Query InterpretationFree-Form Query Interpretation
Extraction OntologiesExtraction Ontologies
Object sets
Relationship sets
Participation constraints
Lexical
Non-lexical
Primary object set
Aggregation
Generalization/Specialization
Formalism & Extraction OntologiesFormalism & Extraction Ontologies
Fully formalized in predicate calculusFully formalized in predicate calculus Object set ~ 1-place predicateObject set ~ 1-place predicate N-ary relationship set ~ n-place predicateN-ary relationship set ~ n-place predicate Constraint ~ closed predicate-calculus formulaConstraint ~ closed predicate-calculus formula
As a description logic ~ As a description logic ~ ALCN ALCN (Attributive (Attributive Language with Complement and Numeric Language with Complement and Numeric Restrictions)Restrictions)
(a quick side note)
Extraction OntologiesExtraction Ontologies
External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?
Key Word Phrase
Left Context: $
Data Frame:
Internal Representation: float
Values
Key Words: ([Pp]rice)|([Cc]ost)| …
Operators
Operator: >
Key Words: (more\s*than)|(more\s*costly)|…
Data-Extraction Results: Car AdsData-Extraction Results: Car Ads
Training set for tuning ontology: 100Test set: 116
Salt Lake Tribune
Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Feature 91 99
Car Ads: CommentsCar Ads: Comments Dynamic setsDynamic sets
Missed: MERC, Town Car, 98 RoyaleMissed: MERC, Town Car, 98 Royale Could use lexicon of makes and modelsCould use lexicon of makes and models
Unspecified variation in lexical patternsUnspecified variation in lexical patterns Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) could adjust lexical patternscould adjust lexical patterns
Misidentification of attributesMisidentification of attributes Classified AUTO in AUTO SALES as automatic transmissionClassified AUTO in AUTO SALES as automatic transmission Could adjust exceptions in lexical patternsCould adjust exceptions in lexical patterns
Typographical errorsTypographical errors ““Chrystler”, “DODG ENeon”, “I-15566-2441”Chrystler”, “DODG ENeon”, “I-15566-2441” Could look for spelling variations and common typos Could look for spelling variations and common typos
General Extraction ResultsGeneral Extraction Results
~ 20 Domains (cars, obituaries, cameras, jobs, ~ 20 Domains (cars, obituaries, cameras, jobs, games, prescription drugs, …)games, prescription drugs, …)
Simple, unified domains: nearly 100% recall Simple, unified domains: nearly 100% recall and precisionand precision
Complex, loosely defined domains (e.g. Complex, loosely defined domains (e.g. obituaries: 82% recall and 74% precision)obituaries: 82% recall and 74% precision)
Typical: 80%+ recall and precisionTypical: 80%+ recall and precision
Generality & Resiliency ofGenerality & Resiliency ofExtraction OntologiesExtraction Ontologies
Assumptions about web pages (generality)Assumptions about web pages (generality) Data richData rich Narrow domainNarrow domain Document typesDocument types
Simple multiple-record documents (easiest)Simple multiple-record documents (easiest) Single-record documents (harder)Single-record documents (harder) Records with scattered components (even harder)Records with scattered components (even harder)
Declarative (resiliency)Declarative (resiliency) Still works when web pages changeStill works when web pages change Works for new, unseen pages in the same domainWorks for new, unseen pages in the same domain Scalable, but takes work to declare the extraction ontology Scalable, but takes work to declare the extraction ontology
(another quick side note)
Semantic AnnotationSemantic Annotation
Free-Form Query InterpretationFree-Form Query Interpretation
Parse Free-Form Query Parse Free-Form Query (with data extraction ontology)(with data extraction ontology)
Select OntologySelect Ontology Formulate Query ExpressionFormulate Query Expression Run Query Over Semantically Annotated DataRun Query Over Semantically Annotated Data
Parse Free-Form Query Parse Free-Form Query “Find me the and of all s – I want a ”
price
mileage
red
Nissan
1996
or newer
>= Operator
Select OntologySelect Ontology
Similarity value: 5
Similarity value: 2
“Find me the price and mileage of all red Nissans – I want a 1996 or newer”
Conjunctive queries and aggregate queriesConjunctive queries and aggregate queries Mentioned object sets are all of interest in the result.Mentioned object sets are all of interest in the result. Values and operator keywords determine conditions.Values and operator keywords determine conditions.
Color = “red”Color = “red” Make = “Nissan”Make = “Nissan” Year >= 1996Year >= 1996
>= Operator
Formulate Query ExpressionFormulate Query Expression
For
Let
Where
Return
Formulate Query ExpressionFormulate Query Expression
Run QueryRun QueryOver Semantically Annotated DataOver Semantically Annotated Data
Query Interpretation Results:Query Interpretation Results:Pilot Experiment with Car AdsPilot Experiment with Car Ads
15 car-ads free-form queries from 3 volunteer CS students15 car-ads free-form queries from 3 volunteer CS students ResultsResults
Recognizing object sets of interestRecognizing object sets of interest Recall: 85%Recall: 85% Precision: 90%Precision: 90%
Recognizing constraintsRecognizing constraints Recall: 61%Recall: 61% Precision: 79%Precision: 79%
ProblemsProblems Regular expressions not tuned up and lexicons incompleteRegular expressions not tuned up and lexicons incomplete Ambiguities: “Are there any Ford mustangs, 2002, that are red?” Ambiguities: “Are there any Ford mustangs, 2002, that are red?”
(Is 2002 a year, mileage, or price?)(Is 2002 a year, mileage, or price?) CaveatsCaveats
No disjunctionNo disjunction No negationNo negation
GeneralGeneralQuery Interpretation ResultsQuery Interpretation Results
AskOntosAskOntos ((Pilot Experiment on 5 domains: cars, real estate, countries, movies, Pilot Experiment on 5 domains: cars, real estate, countries, movies,
diamonds)diamonds)
Object sets of interest recognizedObject sets of interest recognized Recall: 90%Recall: 90% Precision: 90%Precision: 90%
Conditions recognizedConditions recognized Recall: 71%Recall: 71% Precision: 88%Precision: 88%
PragmaticsPragmatics
Technical problemsTechnical problems Extraction and query-interpretation accuracyExtraction and query-interpretation accuracy Execution speedExecution speed HarvestingHarvesting
Crawling?!Crawling?! Information behind forms on the hidden webInformation behind forms on the hidden web
Social problemsSocial problems Cooperation from web site developersCooperation from web site developers End-user concernsEnd-user concerns
MotivationMotivation TrustTrust
All is not rosy …
ConclusionsConclusions Automatically create semantic-web contentAutomatically create semantic-web content
Do data extraction over an ordinary web pageDo data extraction over an ordinary web page Create semantic-web pageCreate semantic-web page
Cache pageCache page Store external semantic annotation wrt an ontologyStore external semantic annotation wrt an ontology
Query semantic web pagesQuery semantic web pages Free-form queriesFree-form queries Return resultsReturn results
TableTable Link to original web page (scrolled and highlighted)Link to original web page (scrolled and highlighted)
Pragmatic considerationsPragmatic considerations
www.deg.byu.edu