Semiautomatic Generation of
Resilient Data Extraction Ontologies
Yihong DingData Extraction Group
Brigham Young University
Sponsored by NSF
Data Extraction Ontology Goal: extract data from web pages Components
concepts relations between the concepts participation constraints
Resilient Difficulty: manual ontology generation is costly
Generation Procedure
Knowledge Sources
Data-ExtractionData-ExtractionOntologyOntology
Knowledge Selection Processing
Extraction Processing
DatabaseDatabase
Train Test
Knowledge Collection Assumptions about knowledge base
general contains meaningful relationships pre-existing XML or easy to transfer to XML
Current input Mikrokosmos ontology [Mik] auxiliary data frame library
Selection of ConceptsPROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc);
PrimarySelectedConceptsList = MikroSelection(M-Ontology);
SecondarySelectedConceptsList = DataFrameSelection(DF-Library);
ConflictHandling();
SelectedSubgraphGeneration();
MANY ISSUES selection strategies, conflict resolution, …
Basic Selection Strategy Select from Mikrokosmos
Ontology
Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar
Mazar-e-Sharif Konduz Terrain: Landlocked;
mostly mountains and desert.
Climate: Dry, with cold winters and hot summers.
Population:17.7 million. Agriculture: Wheat, corn,
barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy Select from Mikrokosmos
Ontology concept names and their
synonyms
Afghanistan smaller than Texas. Area<GeographicalArea>:
648,000 sq. km. Capital<CapitalCity><Financi
alCapital>--Kabul, Other cities--Kandahar Mazar-e-Sharif
Konduz Terrain: Landlocked; mostly mountains
and desert. Climate: Dry, with cold winters and hot
summers.
Population<Population>:17.7 million.
Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy Select from Mikrokosmos
Ontology concept names and their
synonyms concept values and their
synonyms
Afghanistan<Nation> smaller than
Texas<USState>. Area<GeographicalArea>:
648,000 sq. km. Capital<CapitalCity><Financi
alCapital>--Kabul<CapitalCity>,
Other cities--Kandahar Mazar-e-Sharif Konduz
Terrain: Landlocked; mostly mountains and desert.
Climate: Dry, with cold winters and hot summers.
Population<Population>:17.7 million.
Agriculture:Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy Select from Mikrokosmos
Ontology concept names and their
synonyms concept values and their
synonyms Select from Data Frame
Libraries
Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar
Mazar-e-Sharif Konduz Terrain: Landlocked;
mostly mountains and desert.
Climate: Dry, with cold winters and hot summers.
Population:17.7 million. Agriculture: Wheat, corn,
barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy Select from Mikrokosmos
Ontology concept names and their
synonyms concept values and their
synonyms Select from Data Frame
Libraries extract result based on the
data frames
Afghanistan smaller than Texas. Area:
648,000<Area><Mileage> sq. km.
Capital--Kabul, Other cities--Kandahar
Mazar-e-Sharif Konduz Terrain: Landlocked;
mostly mountains and desert.
Climate: Dry, with cold winters and hot summers.
Population:17.7<Time> million<Population><Price>.
Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Document-Level Conflict Afghanistan smaller than Texas. Area: 648,000<Area><Mileage> sq. km. Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7<Time> million<Population><Price>. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,
karakul pelts, wool, mutton.
Concept-Level Conflict Afghanistan smaller than Texas. Area<GeographicalArea>: 648,000<Area> sq. km. Capital--Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population<Population>: 17.7 million<Population>. Agriculture: Wheat<FoodStuff><AgriculturalProduct>, corn,
barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Relation Retrieval Theoretical solution
all paths in the subgraph too expensive: NP-Complete
Heuristic solution find the shortest path between any two nodes set a threshold distance
Participation Constraints Afghanistan<Nation> smaller than Texas. Area: 648,000 sq. km. Capital—Kabul<CapitalCity>, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,
karakul pelts, wool, mutton.
CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1]
Participation Constraints (cont.) Afghanistan<Nation> smaller than Texas. Area: 648,000 sq. km. Capital--Kabul<City>, Other cities<City>--Kandahar<City> Mazar-e-Sharif<City>
Konduz<City> Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,
karakul pelts, wool, mutton.
City [1:1] PartOf Nation [1:*]
Performance Evaluation Speed of generation Precision and recall of the generation
process Precision and recall of the generated
ontology
Generation Time with Distance Threshold
0:00
2:24
4:48
7:12
9:36
12:00
14:24
0 1 2 3 4 5 6 7 8
Distance Threshold
Tim
e(m
in:s
ec)
P&R of Generation Process
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 1 2 3 4 5 6 7 8
Distance Threshold
P &
R
Recall
Precision
Conclusion Data Extraction Ontology generated Knowledge sources exploited Many issues applied Many more to explore