25
1 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Semantic Network Analysis 11.07.05 Analyzing Semantic Interoperability in Bioinformatic Database Networks Philippe Cudré-Mauroux, EPFL Joint work with: Julien Gaugaz, Adriana Budura and Karl Aberer

Semantic Network Analysis 11.07.05

Embed Size (px)

DESCRIPTION

Semantic Network Analysis 11.07.05. Analyzing Semantic Interoperability in Bioinformatic Database Networks Philippe Cudré-Mauroux, EPFL Joint work with: Julien Gaugaz, Adriana Budura and Karl Aberer. Overview. Peer Data Management Systems (PDMS) - PowerPoint PPT Presentation

Citation preview

Page 1: Semantic Network Analysis        11.07.05

1

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Semantic Network Analysis 11.07.05

Analyzing Semantic Interoperability in Bioinformatic

Database Networks

Philippe Cudré-Mauroux, EPFL

Joint work with:Julien Gaugaz, Adriana Budura and Karl Aberer

Page 2: Semantic Network Analysis        11.07.05

2

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Overview

1. Peer Data Management Systems (PDMS)2. Semantic Interoperability in the Large

• Generatingfunctionologic framework

3. The Sequence Retrieval System• Degree distribution• Analysis of giant component• Weighted analysis

4. Conclusions

Page 3: Semantic Network Analysis        11.07.05

3

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Beyond Keyword Search

searching semantically richer objects in large scale heterogeneous networks

<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate><xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate>

date?

<es:DofCreation> 05/08/2004 </es:DofCreation>

<myRDF:Date> Jan 1, 2005 </myRDF:Date>

?

?

??

?

Page 4: Semantic Network Analysis        11.07.05

4

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Decentralized Data Integration

• Large Scale Information Systems (e.g., WWW)– Number of sources > 100– Unreliable data

• Autonomy– Semi-structured data

• E.g., XML/RDF– No integrity constraints– No transactions– Simple SP queries

• E.g., triple patterns, ranking

– Schemata created by end users

– Network churn

• Distributed Databases

– Number of sources < 100– Consistent data

• Coordination– Structured data

• E.g., Relational data model– Integrity constraints– Transactions– Powerful queries

• E.g., SQL, aggregation– Schemas created by

administrators– Relatively Fixed topology

VS

Page 5: Semantic Network Analysis        11.07.05

5

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Data Integration: LAV/GAV

• Traditional database techniques (e.g., LAV/GAV) rely on centralized schemas to integrate data sources

• Not applicable to our context– Scale (upper ontologies?)– Churn– Autonomy

• How can we foster semantic interoperability in decentralized settings?

Date

myDate yourDate

m(Date) = yourDatem(Date) = myDate

Page 6: Semantic Network Analysis        11.07.05

6

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Semantic Interoperability

Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%"

<Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells </Item> <Item>Royal Council</Item> </Bag> </Subject> …</Photoshop_Image>

Photoshop(own schema)

<WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> …</WinFSImage>

WinFS (known schema)

T12 =<Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage

Q2=<GUID>$p/GUID</GUID> FOR $p IN T12 WHERE $p/Creator LIKE "%Robi%"

Extending semantic interoperability techniques to decentralized settings

Page 7: Semantic Network Analysis        11.07.05

7

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

1. Peer Data Management Systems

• Pairwise mappings– Peer Data Management Systems (PDMS)

• Local mappings overcome global heterogeneity– Iterative query rewriting

<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate><xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate>

date?

<es:cDate> 05/08/2004 </es:cDate>

<myRDF:Date> Jan 1, 2005 </myRDF:Date>

articleweather

es:cDate xap:CreateDate

es:cDate

myR

DF:D

ate

myR

DF:

Dat

e

xap

:Mod

ifyD

ate

Page 8: Semantic Network Analysis        11.07.05

8

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Semantic Mediation Layer

Correlated / Uncorrelated

Correlated / Uncorrelated

“Physical”layer

Overlay Layer

SemanticMediation Layer

Page 9: Semantic Network Analysis        11.07.05

9

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Schema-to-Schema Graph

Inter-organization of the different schemas used by the peers - Logical model- Directed- Weighted- Redundant

Page 10: Semantic Network Analysis        11.07.05

10

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

The Semantic Connectivity Graph

• Definition (Semantic Interoperability) Two peers are said to be semantically interoperable if they can

forward queries to each other in the Schema-to-Schema graph, potentially through series of semantic translation links

• Idea– As for physical network analyses, create a connectivity layer to

account for semantic interoperability

• The semantic connectivity Graph S– Unweighted, irreflexive and non-redundant version of the Schema-

to-Schema graph

Page 11: Semantic Network Analysis        11.07.05

11

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Observations

• Theorem Peers in a set Ps are semantically interoperable iff Ss is

strongly connected, with Ss {s | p Ps, ps}

• Observation 1 A set of peers Ps cannot be semantically interoperable if

|Es| < |Vs|

• Observation 2 A set of peers Ps is semantically interoperable if

|Es| > |Vs| (|Vs|-1) - (|Vs|-1)

Page 12: Semantic Network Analysis        11.07.05

12

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

2. Semantic Interoperability in the Large

• Question– How can we analyze semantic interoperability in

large-scale PDMS?

• Idea: use percolation theory to detect the emergence of a strongly connected component in S– Necessary condition for vertex-strong connectivity– Necessary condition for semantic interoperability

Page 13: Semantic Network Analysis        11.07.05

13

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

The Model

• Adaptation of a recent graph-theoretic framework– Newman, Strogatz, Watts 2001

• Large-scale semantic graphs as random graphs with arbitrary degree distribution– Exponentially distributed, small-world, scale-free… graphs

• Specificities of our model– Strong clustering (clustering coefficient cc)– Bidirectionality (bidirectionality coefficient bc) (for directed networks)

• Based on generatingfunctionology

• Percolation: ci > 0

Page 14: Semantic Network Analysis        11.07.05

14

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Size of the giant component

With u the smallest non-negative solution of

And G1 the distribution of edges from first to second-order neighbors:

Page 15: Semantic Network Analysis        11.07.05

15

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

3. The Sequence Retrieval System (SRS)

• Commercial information indexing and retrieval system

• Bioinformatic libraries– EMBL– SwissProt– Prosite– Etc.

• Schemas described in a custom language (Icarus)

• Mappings (links) from one database to others

Page 16: Semantic Network Analysis        11.07.05

16

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Why is SRS interesting?

• Applying our heuristics on a real large-scale corpus of interconnected databases– More than 380 databanks– More than 500 (undirected) links– Data used by professionals on a daily basis

Page 17: Semantic Network Analysis        11.07.05

17

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Crawling the SRS schema-to-schema graph

• Custom crawler• As of May 2005 (EBI repository)

– 388 nodes– 518 edges

– Giant connected component: 187 nodes– Power-law distribution of node degrees

– Clustering coefficient = 0.32– Diameter = 9

Page 18: Semantic Network Analysis        11.07.05

18

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Results

• Connectivity indicator ci = 25.4– Super-critical state

• Size of the giant component– 0.47 (derived)– 0.48 (observed)

Page 19: Semantic Network Analysis        11.07.05

19

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Graphs with same power-law degree distr.

• Varying number of edges

Page 20: Semantic Network Analysis        11.07.05

20

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

10x Bigger Graph

Page 21: Semantic Network Analysis        11.07.05

21

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Analyzing weighted networks

• Do we have a sufficient number of good mappings?

• Introducing quality measures from the mappings– Weights– Attribute / schema level– Cf. Chatty Web (WWW03)

• Semantic query forwarding– Per-hop forwarding behaviors

– Only forward if wi >= = 0 : flooding = 1 : exact answers

Page 22: Semantic Network Analysis        11.07.05

22

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Weighted Results

• Same degree distribution (388 nodes)• Uniformly distributed weights between 0 and 1

Page 23: Semantic Network Analysis        11.07.05

23

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

4. Conclusions

• Analyzing a real network of bioinformatic databases– Accurate results (even for relatively small networks)– Weighted / unweighted

• Current works– Compositions of weights along a path– Semantic random walkers– Public domain simulator

• Future works– Analyzing other forwarding behaviors– Implementation in a real PDMS (self-organizing

mappings)• GridVine

Page 24: Semantic Network Analysis        11.07.05

24

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

References

A Necessary Condition for Semantic Interoperability in the LargePhilippe Cudré-Mauroux and Karl AbererODBASE 2004

GridVine: Building Internet-Scale Semantic Overlay NetworksKarl Aberer, Philippe Cudré-Mauroux and Tim van PeltISWC 2004

Semantic Overlay Networks (Tutorial)Karl Aberer and Philippe Cudré-MaurouxVLDB 2005

… complete reference list athttp://lsirpeople.epfl.ch/pcudre/

Page 25: Semantic Network Analysis        11.07.05

25

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Thank you for your attention

Questions ?