Transcript
Page 1: Cui Tao PhD Dissertation Defense

1

Cui TaoPhD Dissertation Defense

Ontology Generation, Information Harvesting and Semantic Annotation For Machine-

Generated Web Pages

Page 2: Cui Tao PhD Dissertation Defense

2

MotivationBirth date of my great

grandpa

Price and mileage of red Nissans, 1990 or newer

Protein and amino acids information of gene cdk-4?

US states with property crime rates above 1%

Page 3: Cui Tao PhD Dissertation Defense

3

Search by Search Engine

Page 4: Cui Tao PhD Dissertation Defense

4

Search the Hidden Web• The Hidden Web:

– Hidden behind forms– Hard to query “cdk-4"

Page 5: Cui Tao PhD Dissertation Defense

5

Query for Data• The Hidden Web:

– Hidden behind forms– Hard to query

Find the protein and the animo-acids

information for gene “cdk-4"

Page 6: Cui Tao PhD Dissertation Defense

6

A Web of Pages A Web of Knowledge

• Web of Knowledge– Machine-“understandable”– Publicly accessible– Queriable by standard query languages

• Semantic annotation– Domain ontologies– Populated conceptual model

• Problems to resolve– How do we create ontologies?– How do we annotate pages for ontologies?

Page 7: Cui Tao PhD Dissertation Defense

Contributions of Dissertation Work

• Web of Pages Web of Knowledge– Knowledge & meta-knowledge extraction– Reformulation as machine-“understandable”

knowledge• Automatic & semi-automatic solutions via:

– Sibling tables (TISP/TISP++)– User-created forms (FOCIH)

7

Page 8: Cui Tao PhD Dissertation Defense

8

Automatic Annotation with TISP(Table Interpretation with Sibling Pages)

• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations

Page 9: Cui Tao PhD Dissertation Defense

9

Recognize Tables

Data Table

Layout Tables (discard)

NestedData Tables

Page 10: Cui Tao PhD Dissertation Defense

10

Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

12

Page 11: Cui Tao PhD Dissertation Defense

11

Interpretation Technique:Sibling Page Comparison

Page 12: Cui Tao PhD Dissertation Defense

12

Interpretation Technique:Sibling Page Comparison

Same

Page 13: Cui Tao PhD Dissertation Defense

13

Interpretation Technique:Sibling Page Comparison

Almost Same

Page 14: Cui Tao PhD Dissertation Defense

14

Interpretation Technique:Sibling Page Comparison

Different

Same

Page 15: Cui Tao PhD Dissertation Defense

15

Technique Details

• Unnest tables• Match tables in sibling pages

– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)

• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment

Page 16: Cui Tao PhD Dissertation Defense

16

Table Unnesting

Page 17: Cui Tao PhD Dissertation Defense

17

Regularity Expectations:

• (<tr><(td|th)> {L} <(td|th)> {V})n

• <tr>(<(td|th)> {L})n

(<tr>(<(td|th)> {V})n)+

• …

Pattern combinations are also possible.

Table Structure Patterns

Page 18: Cui Tao PhD Dissertation Defense

18

<tr>(<(td|th)> {L})n

(<tr>(<(td|th)> {V})n)+

Table Structure Patterns

Page 19: Cui Tao PhD Dissertation Defense

19

Pattern Usage

Page 20: Cui Tao PhD Dissertation Defense

20

Dynamic Pattern Adjustment

Page 21: Cui Tao PhD Dissertation Defense

21

TISP++

• Automatic ontology generation

• Automatic information annotation

Page 22: Cui Tao PhD Dissertation Defense

22

Ontology Generation – OSM

• Object set: table labels– Lexical: labels that associate with actual values– Non-lexical: labels that associate with other tables

• Relationship set: table nesting• Constraints: updates based on observation

Page 23: Cui Tao PhD Dissertation Defense

23

Ontology Generation – OWL

• Object set: OWL class• Relationship set: OWL object property• Lexical object set:

– OWL data type property– Different annotation properties to keep track of

the provenance

Page 24: Cui Tao PhD Dissertation Defense

Generated Ontology

Page 25: Cui Tao PhD Dissertation Defense

Generated Ontology

Page 26: Cui Tao PhD Dissertation Defense

26

RDF Graph

Page 27: Cui Tao PhD Dissertation Defense

27

Query the DataFind the protein

and the animo-acids information for gene “cdk-4"

Page 28: Cui Tao PhD Dissertation Defense

28

TISP Evaluation• Applications

– Commercial: car ads– Scientific: molecular biology– Geopolitical: US states and countries

• Data: > 2,000 tables in 35 sites• Evaluation

– Initial two sibling pages• Correct separation of data tables from layout tables?• Correct pattern recognition?

– Remaining tables in site• Information properly extracted?• Able to detect and adjust for pattern variations?

Page 29: Cui Tao PhD Dissertation Defense

29

Experimental Results• Table recognition: correctly discarded 157 of

158 layout tables

• Pattern recognition: correctly found 69 of 72 structure patterns

• Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct

Page 30: Cui Tao PhD Dissertation Defense

30

TISP++ Performance

• Performance depends on TISP• TISP test set

– Generates all ontologies correctly– Annotates all information in tables correctly

Page 31: Cui Tao PhD Dissertation Defense

31

Form-based Ontology Creation and Information Harvesting (FOCIH)

• Personalized ontology creation by form– General familiarity– Reasonable conceptual framework– Appropriate correspondence

• Transformable to ontological descriptions• Capable of accepting source data

• Automated ontology creation • Automated information harvesting

Page 32: Cui Tao PhD Dissertation Defense

32

Form Creation

Page 33: Cui Tao PhD Dissertation Defense

33

Created Sample Form

Page 34: Cui Tao PhD Dissertation Defense

34

Generated Ontology View

Page 35: Cui Tao PhD Dissertation Defense

35

Source-to-Form Mapping

Page 36: Cui Tao PhD Dissertation Defense

36

Source-to-Form Mapping

Page 37: Cui Tao PhD Dissertation Defense

37

Source-to-Form Mapping

Page 38: Cui Tao PhD Dissertation Defense

38

Source-to-Form Mapping

Page 39: Cui Tao PhD Dissertation Defense

39

Almost Ready to Harvest

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Pattern recognition– Instance recognition

Page 40: Cui Tao PhD Dissertation Defense

40

Reading Path

Page 41: Cui Tao PhD Dissertation Defense

41

Pattern & Instance Recognition

Page 42: Cui Tao PhD Dissertation Defense

42

Pattern & Instance Recognition

Page 43: Cui Tao PhD Dissertation Defense

43

Pattern & Instance Recognitionregular expression

for decimal numberleft context

right context

Page 44: Cui Tao PhD Dissertation Defense

44

Pattern & Instance Recognition

list pattern, delimiter is “,”

Page 45: Cui Tao PhD Dissertation Defense

45

Pattern & Instance Recognition

list pattern, delimiter is regular expression for percentage numbers and a comma

Page 46: Cui Tao PhD Dissertation Defense

46

Pattern & Instance Recognition

list pattern, delimiter is regular expression for percentage numbers and a comma

Page 47: Cui Tao PhD Dissertation Defense

47

Can Now Harvest

Page 48: Cui Tao PhD Dissertation Defense

48

Can Now Harvest

Page 49: Cui Tao PhD Dissertation Defense

49

Can Now Harvest

Page 50: Cui Tao PhD Dissertation Defense

50

Semantic Annotation

Page 51: Cui Tao PhD Dissertation Defense

51

Semantic Annotation

Page 52: Cui Tao PhD Dissertation Defense

52

Semantic Annotation

Page 53: Cui Tao PhD Dissertation Defense

53

Semantic Annotation

Page 54: Cui Tao PhD Dissertation Defense

54

Semantic Annotation

Page 55: Cui Tao PhD Dissertation Defense

55

Semantic Query

Page 56: Cui Tao PhD Dissertation Defense

56

FOCIH Performance

• Ontology creation• Semantic annotation

– Depends on TISP performance– Depends on pattern and instance recognition

performance

Page 57: Cui Tao PhD Dissertation Defense

57

FOCIH Performance

• Pattern and instance recognition:– Works with highly regular data– Tested 71 mappings– 25 full-string values (25/25 correct)– 38 substring values (29/38 correct)– 8 list patterns (6/8 correct)

Page 58: Cui Tao PhD Dissertation Defense

58

FOCIH Difficulties

Page 59: Cui Tao PhD Dissertation Defense

59

FOCIH Difficulties

Page 60: Cui Tao PhD Dissertation Defense

60

FOCIH Difficulties

No selection

Page 61: Cui Tao PhD Dissertation Defense

61

WoK via TISP

Page 62: Cui Tao PhD Dissertation Defense

62

WoK via TISP

Page 63: Cui Tao PhD Dissertation Defense

63

WoK via FOCIH

Page 64: Cui Tao PhD Dissertation Defense

64

WoK via FOCIH

Page 65: Cui Tao PhD Dissertation Defense

65

Contributions• TISP: automatic sibling table interpretation• TISP++:

– Automatic ontology generation based on interpreted tables– Automatic semantic annotation for interpreted tables

• FOCIH: – Semi-automatic personalized ontology creation– Automatic personalized information harvesting and

semantic annotation• All together: contributes to turning the current web

of pages into a web of Knowledge

Page 66: Cui Tao PhD Dissertation Defense

66

Future Work

• Sibling pages in addition to sibling tables

• Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.


Recommended