19
Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Embed Size (px)

Citation preview

Page 1: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Semiautomatic Generation of

Resilient Data Extraction Ontologies

Yihong DingData Extraction Group

Brigham Young University

Sponsored by NSF

Page 2: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Data Extraction Ontology Goal: extract data from web pages Components

concepts relations between the concepts participation constraints

Resilient Difficulty: manual ontology generation is costly

Page 3: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Generation Procedure

Knowledge Sources

Data-ExtractionData-ExtractionOntologyOntology

Knowledge Selection Processing

Extraction Processing

DatabaseDatabase

Train Test

Page 4: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Knowledge Collection Assumptions about knowledge base

general contains meaningful relationships pre-existing XML or easy to transfer to XML

Current input Mikrokosmos ontology [Mik] auxiliary data frame library

Page 5: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Selection of ConceptsPROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc);

PrimarySelectedConceptsList = MikroSelection(M-Ontology);

SecondarySelectedConceptsList = DataFrameSelection(DF-Library);

ConflictHandling();

SelectedSubgraphGeneration();

MANY ISSUES selection strategies, conflict resolution, …

Page 6: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Basic Selection Strategy Select from Mikrokosmos

Ontology

Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar

Mazar-e-Sharif Konduz Terrain: Landlocked;

mostly mountains and desert.

Climate: Dry, with cold winters and hot summers.

Population:17.7 million. Agriculture: Wheat, corn,

barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

Page 7: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Basic Selection Strategy Select from Mikrokosmos

Ontology concept names and their

synonyms

Afghanistan smaller than Texas. Area<GeographicalArea>:

648,000 sq. km. Capital<CapitalCity><Financi

alCapital>--Kabul, Other cities--Kandahar Mazar-e-Sharif

Konduz Terrain: Landlocked; mostly mountains

and desert. Climate: Dry, with cold winters and hot

summers.

Population<Population>:17.7 million.

Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

Page 8: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Basic Selection Strategy Select from Mikrokosmos

Ontology concept names and their

synonyms concept values and their

synonyms

Afghanistan<Nation> smaller than

Texas<USState>. Area<GeographicalArea>:

648,000 sq. km. Capital<CapitalCity><Financi

alCapital>--Kabul<CapitalCity>,

Other cities--Kandahar Mazar-e-Sharif Konduz

Terrain: Landlocked; mostly mountains and desert.

Climate: Dry, with cold winters and hot summers.

Population<Population>:17.7 million.

Agriculture:Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

Page 9: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Basic Selection Strategy Select from Mikrokosmos

Ontology concept names and their

synonyms concept values and their

synonyms Select from Data Frame

Libraries

Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar

Mazar-e-Sharif Konduz Terrain: Landlocked;

mostly mountains and desert.

Climate: Dry, with cold winters and hot summers.

Population:17.7 million. Agriculture: Wheat, corn,

barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

Page 10: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Basic Selection Strategy Select from Mikrokosmos

Ontology concept names and their

synonyms concept values and their

synonyms Select from Data Frame

Libraries extract result based on the

data frames

Afghanistan smaller than Texas. Area:

648,000<Area><Mileage> sq. km.

Capital--Kabul, Other cities--Kandahar

Mazar-e-Sharif Konduz Terrain: Landlocked;

mostly mountains and desert.

Climate: Dry, with cold winters and hot summers.

Population:17.7<Time> million<Population><Price>.

Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

Page 11: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Document-Level Conflict Afghanistan smaller than Texas. Area: 648,000<Area><Mileage> sq. km. Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7<Time> million<Population><Price>. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,

karakul pelts, wool, mutton.

Page 12: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Concept-Level Conflict Afghanistan smaller than Texas. Area<GeographicalArea>: 648,000<Area> sq. km. Capital--Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population<Population>: 17.7 million<Population>. Agriculture: Wheat<FoodStuff><AgriculturalProduct>, corn,

barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

Page 13: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Relation Retrieval Theoretical solution

all paths in the subgraph too expensive: NP-Complete

Heuristic solution find the shortest path between any two nodes set a threshold distance

Page 14: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Participation Constraints Afghanistan<Nation> smaller than Texas. Area: 648,000 sq. km. Capital—Kabul<CapitalCity>, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,

karakul pelts, wool, mutton.

CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1]

Page 15: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Participation Constraints (cont.) Afghanistan<Nation> smaller than Texas. Area: 648,000 sq. km. Capital--Kabul<City>, Other cities<City>--Kandahar<City> Mazar-e-Sharif<City>

Konduz<City> Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,

karakul pelts, wool, mutton.

City [1:1] PartOf Nation [1:*]

Page 16: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Performance Evaluation Speed of generation Precision and recall of the generation

process Precision and recall of the generated

ontology

Page 17: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Generation Time with Distance Threshold

0:00

2:24

4:48

7:12

9:36

12:00

14:24

0 1 2 3 4 5 6 7 8

Distance Threshold

Tim

e(m

in:s

ec)

Page 18: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

P&R of Generation Process

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 1 2 3 4 5 6 7 8

Distance Threshold

P &

R

Recall

Precision

Page 19: Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Conclusion Data Extraction Ontology generated Knowledge sources exploited Many issues applied Many more to explore