74
WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics Institute Ermelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy & WeST Team

WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Embed Size (px)

Citation preview

Page 1: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Building and UsingKnowledge Bases

Steffen Staab

Saqib Mir – European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy

& WeST Team

Page 2: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 2

Semantic Web

Web Retrieval

Social Web

Multimedia Web

Software Web

Institut WeST – Web Science & Technologies

GESIS

Page 3: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 3

PhD thesis trauma 17 years ago

„Nach dem Auspacken der LPS 105 präsentiert sich dem Betrachter ein stabiles Laufwerk, das genauso geringe Außenmaße besitzt wie die Maxtor.“

Having unwrapped the LPS 105 – reveals itself to the onlooker - a stable disk drive, which has similarly small volume as the Maxtor.“

Page 4: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 4

GENERAL MOTIVATION

General motivation is not information extraction,

but it is solving tasks!

Page 5: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 5

General objective: Extracting to LOD

hasLivedInuseAsExample

Crucial to know: Ontologies nowadays reflect this structureOntologies are• Modular (vs one to rule them all)• Distributed (vs defined in one place)• Connected (vs isolated templates)• Extensible (vs claimed to be finished)• Lightweight (vs computationally intractable)• Popular ones are used more often (vs people disagreeing)

Ontologies – LEGO style

Page 6: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 6

Most famous applications

Steve Macbeth (Microsoft): - discussion wrt Schema.org -“about 7% of pages we crawl have mark-up” http://www.w3.org/2012/06/06-schema-minutes.html

LOD Cloud

Google Knowledge Graph Bing gets its own knowledge graph

http://searchengineland.com/bing-britannica-partnership-123930

Page 7: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 7

ANALYSIS OF URBAN PARAMETERS

Example ontology-based application 1:

Page 8: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 8

General objective: Analysing LOD

hasLivedInuseAsExample

Page 9: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 9

http://lisa.west.uni-koblenz.de/lisa-demo/

Family‘s analysis of Koblenz LOD + Open Street Map data

Page 10: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 10

http://lisa.west.uni-koblenz.de/lisa-demo/

Entrepreneur‘s analysis of Koblenz LOD + Open Street Map data

1. PrizeGerman Linked Open Gov Data Competition 2012

Page 11: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 11

FACETED MULTIMEDIA EXPLORATION

Example ontology-based application :

Page 12: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 12

Making Web 2.0 More Accessible

Links Location

Persons

Knowledge Tags

low- to midlevel features

xxxxxxxxx

GeoNames[Schenk et al; JoWS 2009]

Page 13: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 13

Choosing between Koblenz – and Koblenz

Video at: http://vimeo.com/2057249

Page 14: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 14

Contextual Information

Page 15: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 15

Tag-based refinement

Page 16: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 16

A tag view of „Koblenz“ & „Castle“

Page 17: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 17

Semantic Identity – Festung Ehrenbreitstein

Page 18: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 18

Persons – Celebrities, FOAFers & Flickr Users

Billion Triples Challenge 1. Prize 2008

[Schenk et al; JoWS 2009]

Page 19: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 19

OBSERVATIONS ON INFORMATION EXTRACTION

Now on to information extraction:

Page 20: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 20

Challenges & Opportunities for IE

Not all web pages are created equal

Page 21: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 21

Challenges & Opportunities for IE

Some challenges are the same, e.g. finding type instances

Page 22: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 22

Challenges & Opportunities for IE

Some challenges are the same, e.g. finding relation instances

Page 23: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 23

Challenges & Opportunities for IE

Some contain concepts and their descriptions, some don‘t

No types here,few relation types

Page 24: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 24

Challenges & Opportunities for IE

Knowing that they are instances and of which type

Textual indication

Positional indication

Page 25: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 25

Challenges & Opportunities for IE

To some extent

positional and layout

indications work across

languages and sites

Page 26: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 26

Challenges & Opportunities for IE

owl:sameAs

We should not only think about

Web pages, but about Web sites

Page 27: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 27

Challenges & Opportunities for IE

owl:sameAs

We should not only think about

Web pages, but about Web sites

Page 28: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 28

Comparing related work to our objectives

Related work objectives IE on Web pages Acquiring instances and

relationship instances

IE based on linear text

Our objectives IE on Web sites Acquiring items Classifying items in

Instances Concepts Relation instances Relationships

IE also based on spatial position

There is overlap and of course there are exceptions in related work

Page 29: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 29

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation

[Oro et al; VLDB 2010]

Page 30: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 30

Presentation-oriented documents

Page 31: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 31

Presentation-oriented documents

• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM tree structures are conceptually difficult to query

for the user (or a tool!)

Page 32: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 32

Related Work

Web Query languages Xpath 1.0 and XQuery1.0

Established Too difficult to use for scraping from intricate DOM structures

Visual languages Spatial Graph Grammars [Kong et al.] are quite complex in

term of both usability and efficiency Algebras for creating and querying multimedia interactive

presentations (e.g. ppt) [Subrahmanian et al.]

Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]

generate XPath location paths of DOM nodes can benefit from using Spatial XPath

Page 33: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 33

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation

Page 34: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 34

b

e

Representing Spatial Relations between DOM Nodes

Page 35: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 35

Idea: Use Spatial Relations among DOM Nodes

Page 36: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 36

Spatial DOM (SDOM)

Page 37: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 37

SXPath System Architecture

Page 38: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 38

Querying for Relations Among Nodes

Rectangular Cardinal Relations (RCR)

Topological Relations

r1 E:NE r2

Spatial models allow for expressing disjunctive relations among regions

Page 39: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 39

XPath Example

Page 40: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 40

SXPath Example

Page 41: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 41

Page 42: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 42

From XPath 1.0 towards Spatial Querying with SXPath

SXPath features adopts intuitive path notation:

axis::nodetest [pred]*

adds to XPath spatial axes spatial position functions

natural semantics for spatial querying

Page 43: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 43

SXPath System Architecture

Page 44: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 44

Complexity Results

Formal model defined in the paper [Oro et al; VLDB 2010]

Page 45: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 45

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation

Page 46: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 46

SXPath System

Page 47: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 47

Summative User Study

Page 48: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 48

Summative User Study

Page 49: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 49

Summative User Study

Page 50: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 50

Outline

The Bio-Case Motivation The (Biochemical) Deep

Web Contributions

Page-level wrapper induction

Site-wide wrapper generation

Error Correction by Mutual Reinforcement

Conclusions and Future Directions

The Social Media Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Page 51: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 51

>1000 Life Science DBs, number growing quickly

Page 52: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 52

Biochemical Web Sites: Observations - 1

Labeled Data

Total Labeled Unlabeled Unlabeled(Redundant)

754 719 19 16

Table 1: Data fields across 20 Biochemical Web sites

Full survey:http://sabio.villa-bosch.de/labelsurvey.html (404)

Page 53: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 53

Biochemical Web Sites: Observations - 2

Dynamic Web Pages

Page 54: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 54

Biochemical Web Sites: Observations - 3

Rich Site Structure

Page 55: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 55

Biochemical Web Sites: Observations - 4

Semantics is often only in the report, not in the underlying relational database

Web Services Survey: 11 of 100 Databases1 provide APIs Incomplete coverage Varying granularity No semantics in the service description

1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey was available at http://sabiork.villa-bosch.de/index.html/survey.html

Page 56: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 56

Biochemical Web Sites: Extraction Tasks

Induce Wrapper

Induce Wrapper

Induce Wrapper

[Mir et al; DILS 2009][Mir et al; ESWC 2010]

Page 57: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 57

Contributions

Unsupervised Page-Level Wrapper Induction

Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)

(Acquiring the Schema/Ontology)

Automatic Error Detection and Correction by Mutual Reinforcement

Page 58: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 58

Page-Level Wrapper Induction – 1D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

//*[text()]

Page 59: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 59

Page-Level Wrapper Induction - 2

Reclassify – Growing Data Regions

Page 60: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 60

Page-Level Wrapper Induction - 3

D1´ = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}O1´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

D2´ = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }O2´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

Page 61: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 61

Page-Level Wrapper Induction - 4

Selecting Labels for Datahtml/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” )

html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”)

html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)

Page 62: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 62

Page-Level Wrapper Induction - 5

Anchor the PathEnzyme - html/table[1]/tr[8]/th[1]/code[1]/

html/table[1]/tr[8]/td[1]/code[1]/a[1]html/table[1]/tr[8]/td[1]/code[1]/a[2]

//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()

Pivot GeneralizeRelative

Page 63: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 63

Selected Sources

KEGG, ChEBI, MSDChem Basic qualitative data Popular Overlapping/complementary data

Page 64: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 64

Wrapper Induction - Evaluation

SOURCE #L #D #S TP FN FP P R

KEGG Compoundhttp://www.genome.jp/kegg/ compound/

10 762 3 411 351 46 89.9 53.9

15 759 3 0 100 99.6

KEGG Reactionhttp://www.genome.jp/kegg/ reaction/

10 205 3 173 32 0 100 84.4

15 205 0 0 100 100

ChEBIhttp://www.ebi.ac.uk/chebi

22 831 3 595 236 41 93.5 71.6

15 829 2 0 100 99.7

MSDChemhttp://www.ebi.ac.uk/msd-srv/msdchem/

30 600 3 600 0 20 96.7 100

15 600 0 20 96.7 100

Average (based on final wrappers for each source) 99.1 99.8

~9 samples – ~99% P, ~98% R

Table 2: Page-level wrapper induction results, 20 test pages(L=Labels, D=Data entries, S=Training pages)

Page 65: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 65

Site-Wide Wrapper Induction: Observations

Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)

An efficient approach should ignore these pages We dont need to learn the entire site-structure

Page 66: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 66

Site-Wide Wrapper Induction: Observations - 2

Classified Link-Collections point to data-intensive pages of the same class.

Page 67: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 67

Site-Wide Wrapper Induction: Observations - 3

Pages belong to the same class describe the same concepts Some concepts are sometimes omitted Ordering is always the same

Page 68: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 68

Site-Wide Wrapper Induction

1. Start with C0

2. Follow all classified link-collections

3. Generate wrappers for each set of target pages

4. Determine if new class is formed

5. Add navigation step6. Repeat 2 – 5 for each

new class formed in 4

C0

L3

L1

L2

If C0 != Ci (i>0)S=S+Ci;

Navigation StepsW= {(C0 → L1→ C0),(C0 → L2→ C2),(C0 → L3→ C3)}

S={C0}

C1

C3

C2

Page 69: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 69

Site-Wide Wrapper Induction – Evaluation

SOURCE #C #C’ #D TP FN FP P R

MSDChem 1 1 N/A N/A N/A N/A N/A N/A

ChEBI 3 1 1711 1195 516 0 100 69.8

KEGG 10 7 6223 5044 1179 188 97 81.1

Average 98.5 75.5

Table 3: Site-wide wrapper induction results, 20 test pages for each class(C=Classes, C´=Classes discovered, D=Data entries)

Page 70: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 70

Error Detection and Correction:Mutual Reinforcement

Observation: Certain data reappear on more than one class of pages

Page 71: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 71

Error Detection and Correction:Mutual Reinforcement

Reinforcement if reappearing data correctly classified as Data

Otherwise it points to misclassification Label-Data Mismatch

• Correction: Introduce more samples Label-Label Mismatch

• Cannot be detected

Page 72: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 72

Where to go next?

Reverse engineering production1. LOD

2. Navigation model

3. Interaction model

4. Layout model

Capture this generative model using machine learning Relational learning

• Markov logic programmes?• …?

emitting RDF & RDFS

what belongs to what

(- not treated at all by us so far -)

spatial positioning

Page 73: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 73

Bibliography

Ermelinda Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.

S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.

Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.

Page 74: WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Thank you for your attention!