66
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages

Cui Tao PhD Dissertation Defense

Embed Size (px)

DESCRIPTION

Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages. Cui Tao PhD Dissertation Defense. Motivation. Birth date of my great grandpa Price and mileage of red Nissans, 1990 or newer Protein and amino acids information of gene cdk-4? - PowerPoint PPT Presentation

Citation preview

Page 1: Cui Tao PhD Dissertation Defense

1

Cui TaoPhD Dissertation Defense

Ontology Generation, Information Harvesting and Semantic Annotation For Machine-

Generated Web Pages

Page 2: Cui Tao PhD Dissertation Defense

2

MotivationBirth date of my great

grandpa

Price and mileage of red Nissans, 1990 or newer

Protein and amino acids information of gene cdk-4?

US states with property crime rates above 1%

Page 3: Cui Tao PhD Dissertation Defense

3

Search by Search Engine

Page 4: Cui Tao PhD Dissertation Defense

4

Search the Hidden Web

• The Hidden Web:– Hidden behind forms– Hard to query “cdk-4"

Page 5: Cui Tao PhD Dissertation Defense

5

Query for Data

• The Hidden Web:– Hidden behind forms– Hard to query

Find the protein and the animo-acids

information for gene “cdk-4"

Page 6: Cui Tao PhD Dissertation Defense

6

A Web of Pages A Web of Knowledge

• Web of Knowledge– Machine-“understandable”– Publicly accessible– Queriable by standard query languages

• Semantic annotation– Domain ontologies– Populated conceptual model

• Problems to resolve– How do we create ontologies?– How do we annotate pages for ontologies?

Page 7: Cui Tao PhD Dissertation Defense

Contributions of Dissertation Work

• Web of Pages Web of Knowledge– Knowledge & meta-knowledge extraction– Reformulation as machine-“understandable”

knowledge

• Automatic & semi-automatic solutions via:– Sibling tables (TISP/TISP++)– User-created forms (FOCIH)

7

Page 8: Cui Tao PhD Dissertation Defense

8

Automatic Annotation with TISP(Table Interpretation with Sibling Pages)

• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations

Page 9: Cui Tao PhD Dissertation Defense

9

Recognize Tables

Data Table

Layout Tables (discard)

NestedData Tables

Page 10: Cui Tao PhD Dissertation Defense

10

Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

12

Page 11: Cui Tao PhD Dissertation Defense

11

Interpretation Technique:Sibling Page Comparison

Page 12: Cui Tao PhD Dissertation Defense

12

Interpretation Technique:Sibling Page Comparison

Same

Page 13: Cui Tao PhD Dissertation Defense

13

Interpretation Technique:Sibling Page Comparison

Almost Same

Page 14: Cui Tao PhD Dissertation Defense

14

Interpretation Technique:Sibling Page Comparison

Different

Same

Page 15: Cui Tao PhD Dissertation Defense

15

Technique Details

• Unnest tables• Match tables in sibling pages

– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)

• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment

Page 16: Cui Tao PhD Dissertation Defense

16

Table Unnesting

Page 17: Cui Tao PhD Dissertation Defense

17

Regularity Expectations:

• (<tr><(td|th)> {L} <(td|th)> {V})n

• <tr>(<(td|th)> {L})n

(<tr>(<(td|th)> {V})n)+

• …

Pattern combinations are also possible.

Table Structure Patterns

Page 18: Cui Tao PhD Dissertation Defense

18

<tr>(<(td|th)> {L})n

(<tr>(<(td|th)> {V})n)+

Table Structure Patterns

Page 19: Cui Tao PhD Dissertation Defense

19

Pattern Usage

Page 20: Cui Tao PhD Dissertation Defense

20

Dynamic Pattern Adjustment

Page 21: Cui Tao PhD Dissertation Defense

21

TISP++

• Automatic ontology generation

• Automatic information annotation

Page 22: Cui Tao PhD Dissertation Defense

22

Ontology Generation – OSM

• Object set: table labels– Lexical: labels that associate with actual values– Non-lexical: labels that associate with other tables

• Relationship set: table nesting• Constraints: updates based on observation

Page 23: Cui Tao PhD Dissertation Defense

23

Ontology Generation – OWL

• Object set: OWL class• Relationship set: OWL object property• Lexical object set:

– OWL data type property– Different annotation properties to keep track of

the provenance

Page 24: Cui Tao PhD Dissertation Defense

Generated Ontology

Page 25: Cui Tao PhD Dissertation Defense

Generated Ontology

Page 26: Cui Tao PhD Dissertation Defense

26

RDF Graph

Page 27: Cui Tao PhD Dissertation Defense

27

Query the DataFind the protein

and the animo-acids information for gene “cdk-4"

Page 28: Cui Tao PhD Dissertation Defense

28

TISP Evaluation• Applications

– Commercial: car ads– Scientific: molecular biology– Geopolitical: US states and countries

• Data: > 2,000 tables in 35 sites• Evaluation

– Initial two sibling pages• Correct separation of data tables from layout tables?• Correct pattern recognition?

– Remaining tables in site• Information properly extracted?• Able to detect and adjust for pattern variations?

Page 29: Cui Tao PhD Dissertation Defense

29

Experimental Results• Table recognition: correctly discarded 157 of

158 layout tables

• Pattern recognition: correctly found 69 of 72 structure patterns

• Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct

Page 30: Cui Tao PhD Dissertation Defense

30

TISP++ Performance

• Performance depends on TISP• TISP test set

– Generates all ontologies correctly– Annotates all information in tables correctly

Page 31: Cui Tao PhD Dissertation Defense

31

Form-based Ontology Creation and Information Harvesting (FOCIH)

• Personalized ontology creation by form– General familiarity– Reasonable conceptual framework– Appropriate correspondence

• Transformable to ontological descriptions• Capable of accepting source data

• Automated ontology creation • Automated information harvesting

Page 32: Cui Tao PhD Dissertation Defense

32

Form Creation

Page 33: Cui Tao PhD Dissertation Defense

33

Created Sample Form

Page 34: Cui Tao PhD Dissertation Defense

34

Generated Ontology View

Page 35: Cui Tao PhD Dissertation Defense

35

Source-to-Form Mapping

Page 36: Cui Tao PhD Dissertation Defense

36

Source-to-Form Mapping

Page 37: Cui Tao PhD Dissertation Defense

37

Source-to-Form Mapping

Page 38: Cui Tao PhD Dissertation Defense

38

Source-to-Form Mapping

Page 39: Cui Tao PhD Dissertation Defense

39

Almost Ready to Harvest

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Pattern recognition– Instance recognition

Page 40: Cui Tao PhD Dissertation Defense

40

Reading Path

Page 41: Cui Tao PhD Dissertation Defense

41

Pattern & Instance Recognition

Page 42: Cui Tao PhD Dissertation Defense

42

Pattern & Instance Recognition

Page 43: Cui Tao PhD Dissertation Defense

43

Pattern & Instance Recognitionregular expression

for decimal numberleft context

right context

Page 44: Cui Tao PhD Dissertation Defense

44

Pattern & Instance Recognition

list pattern, delimiter is “,”

Page 45: Cui Tao PhD Dissertation Defense

45

Pattern & Instance Recognition

list pattern, delimiter is regular expression for percentage numbers and a comma

Page 46: Cui Tao PhD Dissertation Defense

46

Pattern & Instance Recognition

list pattern, delimiter is regular expression for percentage numbers and a comma

Page 47: Cui Tao PhD Dissertation Defense

47

Can Now Harvest

Page 48: Cui Tao PhD Dissertation Defense

48

Can Now Harvest

Page 49: Cui Tao PhD Dissertation Defense

49

Can Now Harvest

Page 50: Cui Tao PhD Dissertation Defense

50

Semantic Annotation

Page 51: Cui Tao PhD Dissertation Defense

51

Semantic Annotation

Page 52: Cui Tao PhD Dissertation Defense

52

Semantic Annotation

Page 53: Cui Tao PhD Dissertation Defense

53

Semantic Annotation

Page 54: Cui Tao PhD Dissertation Defense

54

Semantic Annotation

Page 55: Cui Tao PhD Dissertation Defense

55

Semantic Query

Page 56: Cui Tao PhD Dissertation Defense

56

FOCIH Performance

• Ontology creation• Semantic annotation

– Depends on TISP performance– Depends on pattern and instance recognition

performance

Page 57: Cui Tao PhD Dissertation Defense

57

FOCIH Performance

• Pattern and instance recognition:– Works with highly regular data– Tested 71 mappings– 25 full-string values (25/25 correct)– 38 substring values (29/38 correct)– 8 list patterns (6/8 correct)

Page 58: Cui Tao PhD Dissertation Defense

58

FOCIH Difficulties

Page 59: Cui Tao PhD Dissertation Defense

59

FOCIH Difficulties

Page 60: Cui Tao PhD Dissertation Defense

60

FOCIH Difficulties

No selection

Page 61: Cui Tao PhD Dissertation Defense

61

WoK via TISP

Page 62: Cui Tao PhD Dissertation Defense

62

WoK via TISP

Page 63: Cui Tao PhD Dissertation Defense

63

WoK via FOCIH

Page 64: Cui Tao PhD Dissertation Defense

64

WoK via FOCIH

Page 65: Cui Tao PhD Dissertation Defense

65

Contributions

• TISP: automatic sibling table interpretation• TISP++:

– Automatic ontology generation based on interpreted tables

– Automatic semantic annotation for interpreted tables• FOCIH:

– Semi-automatic personalized ontology creation– Automatic personalized information harvesting and

semantic annotation• All together: contributes to turning the current web

of pages into a web of Knowledge

Page 66: Cui Tao PhD Dissertation Defense

66

Future Work

• Sibling pages in addition to sibling tables

• Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.