1
Cui TaoPhD Dissertation Defense
Ontology Generation, Information Harvesting and Semantic Annotation For Machine-
Generated Web Pages
2
MotivationBirth date of my great
grandpa
Price and mileage of red Nissans, 1990 or newer
Protein and amino acids information of gene cdk-4?
US states with property crime rates above 1%
5
Query for Data
• The Hidden Web:– Hidden behind forms– Hard to query
Find the protein and the animo-acids
information for gene “cdk-4"
6
A Web of Pages A Web of Knowledge
• Web of Knowledge– Machine-“understandable”– Publicly accessible– Queriable by standard query languages
• Semantic annotation– Domain ontologies– Populated conceptual model
• Problems to resolve– How do we create ontologies?– How do we annotate pages for ontologies?
Contributions of Dissertation Work
• Web of Pages Web of Knowledge– Knowledge & meta-knowledge extraction– Reformulation as machine-“understandable”
knowledge
• Automatic & semi-automatic solutions via:– Sibling tables (TISP/TISP++)– User-created forms (FOCIH)
7
8
Automatic Annotation with TISP(Table Interpretation with Sibling Pages)
• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations
10
Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918
12
15
Technique Details
• Unnest tables• Match tables in sibling pages
– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)
• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment
17
Regularity Expectations:
• (<tr><(td|th)> {L} <(td|th)> {V})n
• <tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
• …
Pattern combinations are also possible.
Table Structure Patterns
22
Ontology Generation – OSM
• Object set: table labels– Lexical: labels that associate with actual values– Non-lexical: labels that associate with other tables
• Relationship set: table nesting• Constraints: updates based on observation
23
Ontology Generation – OWL
• Object set: OWL class• Relationship set: OWL object property• Lexical object set:
– OWL data type property– Different annotation properties to keep track of
the provenance
28
TISP Evaluation• Applications
– Commercial: car ads– Scientific: molecular biology– Geopolitical: US states and countries
• Data: > 2,000 tables in 35 sites• Evaluation
– Initial two sibling pages• Correct separation of data tables from layout tables?• Correct pattern recognition?
– Remaining tables in site• Information properly extracted?• Able to detect and adjust for pattern variations?
29
Experimental Results• Table recognition: correctly discarded 157 of
158 layout tables
• Pattern recognition: correctly found 69 of 72 structure patterns
• Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
30
TISP++ Performance
• Performance depends on TISP• TISP test set
– Generates all ontologies correctly– Annotates all information in tables correctly
31
Form-based Ontology Creation and Information Harvesting (FOCIH)
• Personalized ontology creation by form– General familiarity– Reasonable conceptual framework– Appropriate correspondence
• Transformable to ontological descriptions• Capable of accepting source data
• Automated ontology creation • Automated information harvesting
39
Almost Ready to Harvest
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Pattern recognition– Instance recognition
45
Pattern & Instance Recognition
list pattern, delimiter is regular expression for percentage numbers and a comma
46
Pattern & Instance Recognition
list pattern, delimiter is regular expression for percentage numbers and a comma
56
FOCIH Performance
• Ontology creation• Semantic annotation
– Depends on TISP performance– Depends on pattern and instance recognition
performance
57
FOCIH Performance
• Pattern and instance recognition:– Works with highly regular data– Tested 71 mappings– 25 full-string values (25/25 correct)– 38 substring values (29/38 correct)– 8 list patterns (6/8 correct)
65
Contributions
• TISP: automatic sibling table interpretation• TISP++:
– Automatic ontology generation based on interpreted tables
– Automatic semantic annotation for interpreted tables• FOCIH:
– Semi-automatic personalized ontology creation– Automatic personalized information harvesting and
semantic annotation• All together: contributes to turning the current web
of pages into a web of Knowledge