66
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages

1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

1

Cui TaoPhD Dissertation Defense

Ontology Generation, Information Harvesting and Semantic Annotation For Machine-

Generated Web Pages

2

MotivationBirth date of my great

grandpa

Price and mileage of red Nissans, 1990 or newer

Protein and amino acids information of gene cdk-4?

US states with property crime rates above 1%

3

Search by Search Engine

4

Search the Hidden Web

• The Hidden Web:– Hidden behind forms– Hard to query “cdk-4"

5

Query for Data

• The Hidden Web:– Hidden behind forms– Hard to query

Find the protein and the animo-acids

information for gene “cdk-4"

6

A Web of Pages A Web of Knowledge

• Web of Knowledge– Machine-“understandable”– Publicly accessible– Queriable by standard query languages

• Semantic annotation– Domain ontologies– Populated conceptual model

• Problems to resolve– How do we create ontologies?– How do we annotate pages for ontologies?

Contributions of Dissertation Work

• Web of Pages Web of Knowledge– Knowledge & meta-knowledge extraction– Reformulation as machine-“understandable”

knowledge

• Automatic & semi-automatic solutions via:– Sibling tables (TISP/TISP++)– User-created forms (FOCIH)

7

8

Automatic Annotation with TISP(Table Interpretation with Sibling Pages)

• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations

9

Recognize Tables

Data Table

Layout Tables (discard)

NestedData Tables

10

Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

12

11

Interpretation Technique:Sibling Page Comparison

12

Interpretation Technique:Sibling Page Comparison

Same

13

Interpretation Technique:Sibling Page Comparison

Almost Same

14

Interpretation Technique:Sibling Page Comparison

Different

Same

15

Technique Details

• Unnest tables• Match tables in sibling pages

– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)

• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment

16

Table Unnesting

17

Regularity Expectations:

• (<tr><(td|th)> {L} <(td|th)> {V})n

• <tr>(<(td|th)> {L})n

(<tr>(<(td|th)> {V})n)+

• …

Pattern combinations are also possible.

Table Structure Patterns

18

<tr>(<(td|th)> {L})n

(<tr>(<(td|th)> {V})n)+

Table Structure Patterns

19

Pattern Usage

20

Dynamic Pattern Adjustment

21

TISP++

• Automatic ontology generation

• Automatic information annotation

22

Ontology Generation – OSM

• Object set: table labels– Lexical: labels that associate with actual values– Non-lexical: labels that associate with other tables

• Relationship set: table nesting• Constraints: updates based on observation

23

Ontology Generation – OWL

• Object set: OWL class• Relationship set: OWL object property• Lexical object set:

– OWL data type property– Different annotation properties to keep track of

the provenance

Generated Ontology

Generated Ontology

26

RDF Graph

27

Query the DataFind the protein

and the animo-acids information for gene “cdk-4"

28

TISP Evaluation• Applications

– Commercial: car ads– Scientific: molecular biology– Geopolitical: US states and countries

• Data: > 2,000 tables in 35 sites• Evaluation

– Initial two sibling pages• Correct separation of data tables from layout tables?• Correct pattern recognition?

– Remaining tables in site• Information properly extracted?• Able to detect and adjust for pattern variations?

29

Experimental Results• Table recognition: correctly discarded 157 of

158 layout tables

• Pattern recognition: correctly found 69 of 72 structure patterns

• Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct

30

TISP++ Performance

• Performance depends on TISP• TISP test set

– Generates all ontologies correctly– Annotates all information in tables correctly

31

Form-based Ontology Creation and Information Harvesting (FOCIH)

• Personalized ontology creation by form– General familiarity– Reasonable conceptual framework– Appropriate correspondence

• Transformable to ontological descriptions• Capable of accepting source data

• Automated ontology creation • Automated information harvesting

32

Form Creation

33

Created Sample Form

34

Generated Ontology View

35

Source-to-Form Mapping

36

Source-to-Form Mapping

37

Source-to-Form Mapping

38

Source-to-Form Mapping

39

Almost Ready to Harvest

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Pattern recognition– Instance recognition

40

Reading Path

41

Pattern & Instance Recognition

42

Pattern & Instance Recognition

43

Pattern & Instance Recognitionregular expression

for decimal numberleft context

right context

44

Pattern & Instance Recognition

list pattern, delimiter is “,”

45

Pattern & Instance Recognition

list pattern, delimiter is regular expression for percentage numbers and a comma

46

Pattern & Instance Recognition

list pattern, delimiter is regular expression for percentage numbers and a comma

47

Can Now Harvest

48

Can Now Harvest

49

Can Now Harvest

50

Semantic Annotation

51

Semantic Annotation

52

Semantic Annotation

53

Semantic Annotation

54

Semantic Annotation

55

Semantic Query

56

FOCIH Performance

• Ontology creation• Semantic annotation

– Depends on TISP performance– Depends on pattern and instance recognition

performance

57

FOCIH Performance

• Pattern and instance recognition:– Works with highly regular data– Tested 71 mappings– 25 full-string values (25/25 correct)– 38 substring values (29/38 correct)– 8 list patterns (6/8 correct)

58

FOCIH Difficulties

59

FOCIH Difficulties

60

FOCIH Difficulties

No selection

61

WoK via TISP

62

WoK via TISP

63

WoK via FOCIH

64

WoK via FOCIH

65

Contributions

• TISP: automatic sibling table interpretation• TISP++:

– Automatic ontology generation based on interpreted tables

– Automatic semantic annotation for interpreted tables• FOCIH:

– Semi-automatic personalized ontology creation– Automatic personalized information harvesting and

semantic annotation• All together: contributes to turning the current web

of pages into a web of Knowledge

66

Future Work

• Sibling pages in addition to sibling tables

• Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.