A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

Preview:

DESCRIPTION

 

Citation preview

A Graph-based Approach to Learn Semantic Descriptions of Data Sources

Mohsen Taheriyan

Craig Knoblock

Pedro Szekely

Jose Luis Ambite

Problem: How to learn semantic descriptions?

First, what is a semantic description?

4

Semantic DescriptionDescribing the source in terms of the concepts and relationships

defined by the domain ontology

Source

object propertydata propertysubClassOf

Domain Ontology

Person

Organization

Place

Statename

birthdatebornIn

worksFor state

name

phone

namelivesIn

CityEvent

ceolocation

organizer

nearby

startDate

endDatetitle

isPartOf

postalCode

Column 1 Column 2 Column 3 Column 4 Column 5Bill Gates Oct 1955 Microsoft Seattle WA

Mark Zuckerberg May 1984 Facebook White Plains NYLarry Page Mar 1973 Google East Lansing MI

5

Semantic Types

Column 1 Column 2 Column 3 Column 4 Column 5

Bill Gates Oct 1955 Microsoft Seattle WA

Mark Zuckerberg May 1984 Facebook White Plains NY

Larry Page Mar 1973 Google East Lansing MI

Person Organization City State

name birthdate name namename

Person

6

Relationships

Column 1 Column 2 Column 3 Column 4 Column 5

Bill Gates Oct 1955 Microsoft Seattle WA

Mark Zuckerberg May 1984 Facebook White Plains NY

Larry Page Mar 1973 Google East Lansing MI

Person

Organization

City

State

name birthdate

bornIn

worksForstate

name

name

name

This semantic model is converted to a semantic description in R2RML

Previous approach to learn semantic descriptions

8

Karma

Domain Ontology

Sample Data

LearnSemantic

Types

CRF

ExtractRelationships

Steiner Tree

Semantic Model

http://www.isi.edu/integration/karma @KarmaSemWeb

9

Refining The Model

Initial Model

10

Refining The Model

Refined Model

• Previous work does not learn the changes done by the user in relationships

• User has to go through the refinement process each time

Our new approach to learn semantic descriptions

12

Key Idea

• Sources in the same domain often have similar data

• Exploit knowledge of existing source models

• Leverage relationships in known source models to hypothesize relationships for new sources

13

Approach

LearnSemantic

Types

CRF

S1 S2 Sn

Known Source Models

…Inputs

Generate Candidate Models Rank Results

Domain Ontology New Source

Construct Graph G

14

Example

Person

Organization

City State

name birthdate

bornIn

worksFor

state

name

namename

name| city|birthdate| state|workplace

S1 = personalInfo

CityState

state

namename

state | cityS2 = getCities

Person

Organization

CityState

name

ceo

isPartOf

name

namename

company| city|ceo| state

S3 = businessInfo

location

Known Source Models

Domain Ontology

New Source

S4 = postalCodeLookup(zipcode, city, state)

15

Build a Graph from Known Models

S1 = personalInfo

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1} {s1}

{s1}

{s1}{s1}

{s1}

Component 1

• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component

• Annotate links with list of supporting models

16

Build a Graph from Known Models

S1 = personalInfo

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

S2 = getCities

Component 1

• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component

• Annotate links with list of supporting models

17

Build a Graph from Known Models

S1 = personalInfo

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

S2 = getCities S3 = businessInfo

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Component 1 Component 2

• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component

• Annotate links with list of supporting models

18

• Connect graph components using all paths inferred from the ontology

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

Build a Graph from Known Models

isPartOf

19

• Assign low weight = ε to links within a component (black links)

• Weight other links according to their (green links)

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

Build a Graph from Known Models

M = known source modelsWmax = number of links in M (>= |EG|) = 18c1(e) = number of links in M whose <label,source, target> match ec2(e) = number of links in M whose <label> match ewe = Min(Wmax - c1 , Wmax - c2/Wmax)

18

17

17

17.9418

17.94

17.94

17.94

17.94 isPartOf

17.94

20

Learn Semantic Types (Previous Work)

• A CRF-based model to assign a Semantic Type to each column from its data

• Semantic Type

– Ontology Class– Data Property + Domain

Domain Ontology

(zipcode , city , state)S4 = postalCodeLookup

Place.postalCode City.name State.name

21

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

17.9418

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

17.94

17.94

17.94

17.94 isPartOf

17.94

22

Generate Candidate Models • Map learned semantic types to nodes in graph G

– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 1

17.94

17.94

17.94

17.94

17.94 isPartOf

17.94

23

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

17.9418

Place.postalCode

postalCode

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 1

17.94

17.94

17.94

17.94 isPartOf

17.94

24

Generate Candidate Models • Map learned semantic types to nodes in graph G

– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

City State

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 2

17.94

17.94

17.94

17.94

17.94 isPartOf

17.94

25

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

City State

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

17.94

17.9418

Place.postalCode

postalCode

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 2

17.94

17.94

17.94 isPartOf

17.94

26

Generate Candidate Models • Map learned semantic types to nodes in graph G

– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 3

17.94

17.94

17.94

17.94

17.94 isPartOf

17.94

27

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

isPartOf

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 3

17.94

17.94

17.94

17.94

17.94

17.94

28

Generate Candidate Models • Map learned semantic types to nodes in graph G

– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

City State

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 4

isPartOf

17.94

17.94

17.94

17.94

17.94

17.94

29

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

City State

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 4

isPartOf

17.94

17.94

17.94

17.94

17.94

17.94

30

Rank Source Models• Rank the candidates based on:

– Cost: sum of the weights– Coherence: prefer the models with higher number of supporting models

Place

City State

postalCode

isPartOfstate

namename

Place.postalCode

City.name

State.name

{s1,s2} {s1,s2}

{s1,s2}

PlaceCity

State

postalCode

isPartOf

isPartOf

namename

Place.postalCode

City.name

State.name

{s1,s2} {s3}

Place

City State

postalCode

isPartOfisPartOf

namename

Place.postalCode

City.name

State.name

{s3} {s3}

{s3}

PlaceCity

State

postalCode

isPartOf

isPartOf

namename

Place.postalCode

City.name

State.name{s3}

{s1,s2}

Rank 1: Candidate 1 Rank 2: Candidate 4

Rank 3: Candidate 2 Rank 3: Candidate 3

31

Evaluation• Dataset 1

– 17 data sources containing overlapping data– Semantic descriptions created manually using DBPedia, FOAF,

GeoNames, and WGS84 ontologies

• Dataset 2– 6 museum sources– Semantic descriptions created by domain experts using EDM,

SKOS, and FOAF ontologies

• Learned a source model assuming other models as input• Computed the Graph Edit Distance (GED) between the learned

model and the correct one – Operations: node insertion, node deletion, edge insertion, edge

deletion, edge relabeling

• Compared the results with our previous work in Karma

32

Results - Dataset 1

Source Signature #Attributes

GED

Previous work

New Approach(Rank 1)

nearestCity(lat, lng, city, state, country) 5 6 1findRestaurant(zipcode, restaurantName, phone, address) 4 1 0zipcodesInCity(city, state, postalCode) 3 3 1parseAddress(address, city, state, zipcode, country) 5 6 1citiesOfState(state, city) 2 1 0ocean(lat, lng, name) 3 2 1postalCodeLookup(zipCode, city, state, country) 4 6 1country(lat, lng, code, name) 4 2 0companyCEO(company, name) 2 1 0personalInfo(firstname, lastname, birthdate, brithCity, birthCountry) 5 4 1businessInfo(company, phone, homepage, city, country, name) 6 10 8restaurantChef(restaurant, firstname, lastname) 3 2 1findSchool(city, state, name, code, homepage, ranking, dean) 7 8 6employees(organization, firstname, lastname, birthdate) 4 1 2education(person, hometown, homecountry, school, city, country) 6 9 4administrativeDistrict(city, province, country) 3 4 1capital(country, city) 2 2 1TOTAL 68 68 29

57% improvement

33

Results - Dataset 2

Source Signature #Attributes

GED

Previous work

New Approach(Rank 1)

S1(Attribution, BeginDate, EndDate, Title, Dated, Medium, Dimensions) 7 1 0

S2(ObjectID, ObjectTitle, ObjectWorkType, ArtistName, ArtistBirthDate, ArtistDeathDate, ObjectEarliestDate, ObjectRights, ObjectFacetValue1)

8 2 3

S3(death, birth, name) 3 0 0

S4(accessionNumber, artist, creditLine, dimensions, imageURL, materials, relatedArtworksURL, creationDate, provenance, keywordValues)

10 9 6

S5(AccessionNumber, Classification, CreditLine, Date, Description, DimensionsOrphan, WhatValues, Who, image, relatedArtworksValues)

10 9 5

S6(Artist, ArtistBornDate, ArtistDiedDate, Classification, Copyright, CreditLine, Image, KeywordValues, Ref, SitterValues) 10 8 6

TOTAL 68 29 20

31% improvement

34

Related Work• Writing semantic descriptions by hand

– R2RML, SWRL– Tedious and time-consuming task– Requires expertise in SW technologies

• Semantic annotation of Web services and Web tables– Very limited in learning the relationships

• Learning Semantic Definitions of Online Information Sources [Carman, Knoblock, 2007]– Learns LAV rules from known sources– Can only learn descriptions that overlap known sources

35

Discussion

• Automatically build rich semantic descriptions of data sources

• Exploit the background knowledge from (i) the domain ontology, and (ii) the known source models

• Semantic descriptions are the key ingredients to automate many tasks, e.g., – Source Discovery – Data Integration– Service Composition

36

Future Work• Investigate how to create a more compact graph

– Consolidate the overlapping segments of the known semantic models

• Relax the problem by removing the constraint that the correct semantic type of each attribute is known– CRF part returns a set of candidate semantic types along with their

confidence values

• Use the data available in Linked Open Data (LOD) cloud to learn more accurate models

• Put the user in the loop– Integrate the new approach into Karma

– The user refines one of the suggested models

– The new model will be added to the graph as a new pattern

Recommended