79
Distributed Graph Databases and the Emerging Web of Data Marko A. Rodriguez T-5, Center for Nonlinear Studies Los Alamos National Laboratory http://markorodriguez.com April 16, 2009

Distributed Graph Databases and the Emerging Web of Data

Embed Size (px)

DESCRIPTION

The World Wide Web is the defacto medium for publicly exposing a corpus of interrelated documents. In its current form, the World Wide Web is the Web of Documents. The next generation of the World Wide Web will support the Web of Data. The Web of Data utilizes the same Uniform Resource Identifier (URI) address space as the Web of Documents, but instead of a exposing a graph of documents, the Web of Data exposes a graph of data. Given that the URI address space of the Web is distributed and infinite, the Web of Data provides a single unified space by which the worlds data can be publicly exposed and interrelated. The Web of Data is supported by both graph databases (which structure the data) and distributed computing mechanism (which process the data). This presentation will discuss the Web of Data, graph databases, and models of computing in this emerging space.

Citation preview

Page 1: Distributed Graph Databases and the Emerging Web of Data

Distributed Graph Databases and the

Emerging Web of Data

Marko A. RodriguezT-5, Center for Nonlinear StudiesLos Alamos National Laboratory

http://markorodriguez.com

April 16, 2009

Page 2: Distributed Graph Databases and the Emerging Web of Data

Abstract

The World Wide Web is the defacto medium for publicly exposing a corpusof interrelated documents. In its current form, the World Wide Web is theWeb of Documents. The next generation of the World Wide Web willsupport the Web of Data. The Web of Data utilizes the same UniformResource Identifier (URI) address space as the Web of Documents, butinstead of a exposing a graph of documents, the Web of Data exposes agraph of data. Given that the URI address space of the Web is distributedand infinite, the Web of Data provides a single unified space by which theworlds data can be publicly exposed and interrelated. The Web of Data issupported by both graph databases (which structure the data) anddistributed computing mechanism (which process the data). Thispresentation will discuss the Web of Data, graph databases, and models ofcomputing in this emerging space.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 3: Distributed Graph Databases and the Emerging Web of Data

Outline

• The Relational Database vs. the Graph Database

• The Web of Documents vs. the Web of Data

• Local Computing vs. Distributed Computing

• Multi-Relational Network Analysis with Grammar Walkers

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 4: Distributed Graph Databases and the Emerging Web of Data

Outline

• The Relational Database vs. the Graph Database

• The Web of Documents vs. the Web of Data

• Local Computing vs. Distributed Computing

• Multi-Relational Network Analysis with Grammar Walkers

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 5: Distributed Graph Databases and the Emerging Web of Data

The Relational Database vs. the Graph Database

• A relational database’s (e.g. MySQL, PostgreSQL, Oracle) data modelis a collection interlinked tables.

• A graph database’s (e.g. OpenSesame, AllegroGraph, Neo4j) data modelis a multi-relational graph.

Graph Database

127.0.0.2

Relational Database

127.0.0.1

aa

b

c

d

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 6: Distributed Graph Databases and the Emerging Web of Data

Types of Graphs

• Undirected single-relational graph: homogenous set of symmetric links.

• Directed single-relational graph: homogenous set of links.

• Directed multi-relational graph: heterogenous set of links.

x z

x z

x zy

undirected single-relational graph

directed single-relational graph

directed multi-relational graph

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 7: Distributed Graph Databases and the Emerging Web of Data

Our Make Believe World - Phase 1

• Marko is a human and Fluffy is a dog.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 8: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Relational Database - Phase 1

0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

Object_Table

Type

Dog

Human

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 9: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Graph Database - Phase 1

0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 10: Distributed Graph Databases and the Emerging Web of Data

Our Make Believe World - Phase 2

• Marko is a human and Fluffy is a dog.

• Marko and Fluffy are good friends.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 11: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Relational Database - Phase 2

0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

0001

ID2 ID2

0002

Object_Table Friendship_Table

0002

0001

Type

Dog

Human

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 12: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Graph Database - Phase 2

0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

friendfriend

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 13: Distributed Graph Databases and the Emerging Web of Data

Our Make Believe World - Phase 3

• Marko is a human and Fluffy is a dog.

• Marko and Fluffy are good friends.

• Human and dog are a subclass of mammal.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 14: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Relational Database - Phase 3

0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

0001

ID2 ID2

0002

Object_Table Friendship_Table

0002

0001

Type

Dog

Human Human

Type1 Type2

Dog

Mammal

Mammal

Subclass_Table

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 15: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Graph Database - Phase 3

0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

Mammal

subclassof subclassof

friendfriend

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 16: Distributed Graph Databases and the Emerging Web of Data

Our Make Believe World - Phase 4

• Marko is a human and Fluffy is a dog.

• Marko and Fluffy are good friends.

• Human and dog are a subclass of mammal.

• Fluffy peed on the carpet.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 17: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Relational Database - Phase 4

0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

0001

ID2 ID2

0002

Object_Table

Friendship_Table

0002

0001

Type

Dog

Human

0003 My_Rug Carpet N/A N/A

Human

Type1 Type2

Dog

Mammal

Mammal

Subclass_Table

0002

ID1 ID2

0003

Pee_Table

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 18: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Graph Database - Phase 4

0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

Mammal

subclassof subclassof

peedOn 0003

Carpet

type

My_Rug

name

friendfriend

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 19: Distributed Graph Databases and the Emerging Web of Data

Our Make Believe World - Phase 5

• Marko is a human and Fluffy is a dog.

• Marko and Fluffy are good friends.

• Human and dog are a subclass of mammal.

• Fluffy peed on the carpet.

• Marko and Fluffy are both mammals.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 20: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Relational Database - Phase 5

0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

0001

ID2 ID2

0002

Object_Table

Friendship_Table

0002

0001

Type

Dog

Human

0003 My_Rug Carpet N/A N/A

Human

Type1 Type2

Dog

Mammal

Mammal

Subclass_Table

0002

ID1 ID2

0003

Pee_Table

0001

ID Type

0002

Human

Dog

Type_Table

0003

0001

0002

Carpet

Mammal

Mammal

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 21: Distributed Graph Databases and the Emerging Web of Data

Our World Modeled in a Graph Database - Phase 5

0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

Mammal

subclassof subclassof

peedOn 0003

Carpet

type

My_Rug

name

type type

friendfriend

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 22: Distributed Graph Databases and the Emerging Web of Data

The Graph as the Natural World Model

• The world is inherently (or perceived as) object-oriented.

• The world is filled with objects and relations among them.

• The multi-relational graph is a very natural representation of the world.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 23: Distributed Graph Databases and the Emerging Web of Data

The Graph as the Natural Programming Model

• High-level computer languages are object-oriented.

• Nearly no impedance mismatch between the multi-relational graph andthe programming object.

• It is easy to go from graph database to in-memory object.

Human marko = new Human();marko.name = "Marko";marko.addFriend(fluffy);marko.setHasFur(false);marko.setLegs(2);

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 24: Distributed Graph Databases and the Emerging Web of Data

SQL vs. SPARQL

SELECT OTY.Name FROM Object_Table AS OTX,Object_Table AS OTY, Friendship_Table WHERE

OTX.Name = "Marko" ANDFriendship_Table.ID1 = OTY.ID ANDFriendship_Table.ID2 = OTX.ID;

SELECT ?z WHERE {?x name "Marko" .?y friend ?x .?y name ?z }

E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF, WWW Consortium,

http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/, 2004.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 25: Distributed Graph Databases and the Emerging Web of Data

Outline

• The Relational Database vs. the Graph Database

• The Web of Documents vs. the Web of Data

• Local Computing vs. Distributed Computing

• Multi-Relational Network Analysis with Grammar Walkers

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 26: Distributed Graph Databases and the Emerging Web of Data

Internet Address Spaces

• The Uniform Resource Identifier (URI) is the superclass of the UniformResource Locator (URL) and Uniform Resource Name (URN).

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 27: Distributed Graph Databases and the Emerging Web of Data

The Uniform Resource Locator

• The set of all URLs is the address space of all resources that can belocated and retrieved on the Web. URLs denote where a resource is.

? http://markorodriguez.com/index.html∗ Domain name server (DNS): markorodriguez.com→ 216.251.43.6∗ http:// means GET at port 80,∗ /index.html means the resource to get at that Internet location.

markorodriguez.com216.251.43.6

Web Server

index.html

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 28: Distributed Graph Databases and the Emerging Web of Data

The Uniform Resource Name

• The set of all URNs is the address space of all resources within the urn:namespace.

? urn:uuid:bd93def0-8026-11dd-842be54955baa12? urn:issn:0892-3310? urn:doi:10.1016/j.knosys.2008.03.030

• Named resources need not be retrievable through the Web.

• URNs denote what a resource is.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 29: Distributed Graph Databases and the Emerging Web of Data

The Uniform Resource Identifier

• The URI address space is an infinite space for all Internet resources.

? urn:issn:0892-3310? ftp://markorodriguez.com/private/markos_secrets.txt? http://www.lanl.gov#fluffy

• Important: URIs can denote concepts, instances, and datum.

lanl:fluffy lanl:fluffy_legs

lanl is a namespace prefix which extends to http://www.lanl.gov#.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 30: Distributed Graph Databases and the Emerging Web of Data

The Web of Documents

• The World of Documents is primarily concerned with the Hyper-TextTransfer Protocol (HTTP) and with retrievable resources in the URLaddress space.

• These retrievable resources are files: HTML documents, images, audio,etc. The “web” is created when HTML documents contain URLs.

index.html

Home.html Research.htmlResume.html hrefhref

href

http://markorodriguez.com/

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 31: Distributed Graph Databases and the Emerging Web of Data

The Web of Data

• The Web of Data is primarily concerned with URIs.

• The Resource Description Framework (RDF) is the standard forrepresenting the relationship between URIs and literals (e.g. float, string,date time, etc.).

lanl:marko lanl:fluffyfoaf:knows

foaf:name

"Marko A. Rodriguez"^^xsd:string

foaf:name

"Fluffy P. Everywhere"^^xsd:string

subject objectpredicate

C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee. Linked Data on the Web, International World Wide Web Conference, 2008.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 32: Distributed Graph Databases and the Emerging Web of Data

Our Make Believe World in RDF

lanl:marko lanl:fluffy

foaf:name

"Marko A. Rodriguez"^^xsd:string

foaf:name

"Fluffy P. Everywhere"^^xsd:string

lanl:Dog

rdf:typerdf:type

lanl:Human

lanl:Mammal

rdfs:subClassOf rdfs:subClassOf

"2"^^xsd:integer "4"^^xsd:integer

lanl:legs lanl:legs

"false"^^xsd:boolean

lanl:fur

"true"^^xsd:boolean

lanl:fur

lanl:friend

lanl:friend

rdf:type rdf:type

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 33: Distributed Graph Databases and the Emerging Web of Data

The Web of Data is a Distributed Database

• The URI address space is distributed.

• URIs can denote datum.

• RDF denotes the relationships URIs.

• The Web of Data’s foundational standard is RDF.

• Therefore, the Web of Data is a distributed database.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 34: Distributed Graph Databases and the Emerging Web of Data

The Web of Documents vs. the Web of Data

Web Server

127.0.0.1

HTML

Web Server

127.0.0.2

HTMLhref

Graph Database

127.0.0.1

Graph Database

127.0.0.2

lanl:friend

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 35: Distributed Graph Databases and the Emerging Web of Data

The Current Web of Data - March 2009

geospecies

freebase

dbpedia

libris

geneid

interpro

hgnc

symbol

pubmed

mgi

geneontology

uniprot

pubchem

unists

omim

homologene

pfam

pdb

reactome

chebi

uniparc

kegg

cas

uniref

prodomprosite

taxonomy

dailymed

linkedct

acm

dblprkbexplorer

laascnrs

newcastle

eprints

ecssouthampton

irittoulouseciteseer

pisa

resexibm

ieee

rae2001

budapestbme

eurecom

dblphannover

diseasome

drugbank

geonames

yago

opencyc

w3cwordnet

umbel

linkedmdb

rdfbookmashup

flickrwrappr

surgeradio

musicbrainz myspacewrapper

bbcplaycountdata

bbcprogrammes

semanticweborg

revyu

swconferencecorpus

lingvoj

pubguide

crunchbase

foafprofiles

riese

qdos

audioscrobbler

flickrexporter

bbcjohnpeel

wikicompany

govtrack

uscensusdata

openguides

doapspace

bbclatertotp

eurostat

semwebcentral

dblpberlin

siocsites

jamendo

magnatuneworldfactbook

projectgutenberg

opencalais

rdfohloh

virtuososponger

geospecies

freebase

dbpedia

libris

geneid

interpro

hgnc

symbol

pubmed

mgi

geneontology

uniprot

pubchem

unists

omim

homologene

pfam

pdb

reactome

chebi

uniparc

kegg

cas

uniref

prodomprosite

taxonomy

dailymed

linkedct

acm

dblprkbexplorer

laascnrs

newcastle

eprints

ecssouthampton

irittoulouseciteseer

pisa

resexibm

ieee

rae2001

budapestbme

eurecom

dblphannover

diseasome

drugbank

geonames

yago

opencyc

w3cwordnet

umbel

linkedmdb

rdfbookmashup

flickrwrappr

surgeradio

musicbrainz myspacewrapper

bbcplaycountdata

bbcprogrammes

semanticweborg

revyu

swconferencecorpus

lingvoj

pubguide

crunchbase

foafprofiles

riese

qdos

audioscrobbler

flickrexporter

bbcjohnpeel

wikicompany

govtrack

uscensusdata

openguides

doapspace

bbclatertotp

eurostat

semwebcentral

dblpberlin

siocsites

jamendo

magnatuneworldfactbook

projectgutenberg

opencalais

rdfohloh

virtuososponger

M.A. Rodriguez. A Graph Analysis of the Linked Data Cloud, in review, http://arxiv.org/abs/0903.0194, 2009.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 36: Distributed Graph Databases and the Emerging Web of Data

The Current Web of Data - March 2009data set domain data set domain data set domain

audioscrobbler music govtrack government pubguide booksbbclatertotp music homologene biology qdos socialbbcplaycountdata music ibm computer rae2001 computerbbcprogrammes media ieee computer rdfbookmashup booksbudapestbme computer interpro biology rdfohloh socialchebi biology jamendo music resex computercrunchbase business laascnrs computer riese governmentdailymed medical libris books semanticweborg computerdblpberlin computer lingvoj reference semwebcentral socialdblphannover computer linkedct medical siocsites socialdblprkbexplorer computer linkedmdb movie surgeradio musicdbpedia general magnatune music swconferencecorpus computerdoapspace social musicbrainz music taxonomy referencedrugbank medical myspacewrapper social umbel generaleurecom computer opencalais reference uniref biologyeurostat government opencyc general unists biologyflickrexporter images openguides reference uscensusdata governmentflickrwrappr images pdb biology virtuososponger referencefoafprofiles social pfam biology w3cwordnet referencefreebase general pisa computer wikicompany businessgeneid biology prodom biology worldfactbook governmentgeneontology biology projectgutenberg books yago generalgeonames geographic prosite biology . . .

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 37: Distributed Graph Databases and the Emerging Web of Data

Cultural Differences that are Leading to Web-BasedData Management - Part 1

• Relational databases tend to not maintain public access points.

• Relational database users tend to not publish their schemas.

• Web of Data graph databases maintain public access points calledSPARQL end-points or Linked Data URLs.

• Web of Data graph database users tend to reuse and extend publicschemas called ontologies.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 38: Distributed Graph Databases and the Emerging Web of Data

Cultural Differences that are Leading to Web-BasedData Management - Part 2

Web of Data

127.0.0.4 127.0.0.5 127.0.0.6

Application 1 Application 2 Application 3

structures structuresstructures

processes processes processes

127.0.0.1 127.0.0.2 127.0.0.3

Application 1 Application 2 Application 3

Conventional Model

structures structures structures

processes processes processes

Web of Data Model

127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.1 127.0.0.2 127.0.0.3

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 39: Distributed Graph Databases and the Emerging Web of Data

Outline

• The Relational Database vs. the Graph Database

• The Web of Documents vs. the Web of Data

• Local Computing vs. Distributed Computing

• Multi-Relational Network Analysis with Grammar Walkers

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 40: Distributed Graph Databases and the Emerging Web of Data

SPARQLing a Data Provider - Local Computing

Graph Database

127.0.0.2

SPARQL

END-POINT12

7.0.

0.1

SELECT ?x WHERE { lanl:marko lanl:friend ?x }

{ lanl:fluffy }

• The 127.0.0.1 client is querying the 127.0.0.2 server.

• The query is any read-based SPARQL query.

• The results are those resources that bound to the query arguments.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 41: Distributed Graph Databases and the Emerging Web of Data

GETing Linked Data as RDF - Local Computing

Web of Data

http://www.lanl.gov#marko

lanl:marko

lanl:fluffy

lanl:friend

lanl:wrote

vub:1010

ieee:2020

lanl:cites

lanl:marko

lanl:fluffy

lanl:friend

lanl:wrote

vub:1010

vub:1010

ieee:2020

lanl:cites

http://www.vub.edu#1010

HTTP GET

HTTP GET

127.0.0.1

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 42: Distributed Graph Databases and the Emerging Web of Data

Problem with the Current Web of Data Infrastructure

• The only interfaces are SPARQL end-points and HTTP GETs of RDFsubgraphs.

• For human-based document retrieval, this is fine. For machine-baseddata processing, this does not scale.

M.A. Rodriguez. A Distributed Process Infrastructure for a Distributed Data Structure. Semantic Web and Information Systems

Bulletin, AIS Special Interest Group on Semantic Web and Information Systems, http://arxiv.org/abs/0807.3908, 2008.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 43: Distributed Graph Databases and the Emerging Web of Data

Problem with the Current Web of Data Infrastructure

• We can not rely on the “download and index” philosophy of the WorldWide Web.

? As of March 2009, the Web of Data maintains 4.5 billion triples.

• The Web of Data can not rely on a single service provider.

? too much data.? too many types algorithms that can utilize this data.? too many clock cycles to locally process this data.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 44: Distributed Graph Databases and the Emerging Web of Data

The Open Virtual Machine FarmGraph Database

127.0.0.1

Graph Database

127.0.0.2

lanl:friend

Virtual MachineFarm

Virtual MachineFarm

code/machine

• Distributed computing through code/machine migration between farms.

• move the process to the data, not the data to the process.

M.A. Rodriguez. General Purpose Computing on a Semantic Network Substrate. in Emergent Web Intelligence, eds. R. Chbeir,

A. Hassanien, A. Abraham and Y. Badr, Springer-Verlag, http://arxiv.org/abs/0704.3395, 2009.

M.A. Rodriguez. The RDF Virtual Machine, in review, LA-UR-08-03925, 2009.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 45: Distributed Graph Databases and the Emerging Web of Data

Neno RDF Programming Language - Code Serialization

urn:uuid:6e400b42

hasBlock

urn:uuid:4e0bada0

urn:uuid:51b8d4a0

hasLeft

urn:uuid:54e14d4c

urn:uuid:6425e5ec

hasURI

"1"^^xsd:int

urn:uuid:67bbd072

hasURI

"2"^^xsd:int

urn:uuid:4fa0f752

hasMethod

rdf:typedemo:Human

"a"^^xsd:string

"example"^^xsd:string

hasMethodName

hasURI

trueInst urn:uuid: 610eb4b0

nextInst

nextInst

urn:uuid:0748e1c6

falseInst

nextInst

urn:uuid:62e8b8dc

nextInst

urn:uuid:008e999a

Block

Method

Equals

LocalDirect

Return

Return

LocalDirect

Block

Block

PushValueurn:uuid:5c4d5bc2

hasValue

urn:uuid:6d451a1e

hasValue

PushValue

LocalDirect

urn:uuid:51b8d4a0

Branch

nextInst

nextInst

hasRight

"marko"^^xsd:string

urn:uuid:5869b878

hasURI

LocalDirect

xsd:int example(xsd:string a) { if(a == "marko") return 1; else return 2;}

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 46: Distributed Graph Databases and the Emerging Web of Data

The Fhat RDF Virtual Machine - Machine Serialization

halt

Fhat

Instruction

programLocation

Frame

hasFrame

[0..*]

[0..1]

returnTop

ReturnStack

Instruction

rdf:firstrdf:rest

[0..1][0..1]

blockTop

[0..*]

FrameVariable

rdf:li

hasValue

rdfs:Resource

operandTop

OperandStack

rdfs:Resource

rdf:firstrdf:rest

[0..1]

[0..1]

[0..1]

RVM

[0..*]

hasSymbol

xsd:string

[1]

xsd:boolean[1]

forFrame[1]

fromBlock

Block

[1]

currentFrame

[0..1]

methodReuse

xsd:boolean[1]

[0..1]

BlockStack

Block

rdf:firstrdf:rest

[0..1]

[0..1]

[0..1]

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 47: Distributed Graph Databases and the Emerging Web of Data

A Collection of Interlinked Graph Databases - Currently

127.0.0.2 127.0.0.3

127.0.0.4 127.0.0.5

127.0.0.6

127.0.0.9

127.0.0.7

127.0.0.8

127.0.0.10

127.0.0.11

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 48: Distributed Graph Databases and the Emerging Web of Data

A Collection of Interlinked Graph Databases andProcessors - Future

127.0.0.2 127.0.0.3

127.0.0.4 127.0.0.5

127.0.0.6

127.0.0.9

127.0.0.7

127.0.0.8

127.0.0.10

127.0.0.11

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 49: Distributed Graph Databases and the Emerging Web of Data

The Future of Web-Based Distributed Computing

• The HTTP GET approach to Web of Data does not scale.

• The Neno/Fhat (or any general-purpose computing) environment isunsafe.

• The Web of Data needs an open, safe, flexible, and easy to adoptcomputing infrastructure.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 50: Distributed Graph Databases and the Emerging Web of Data

What Type of Processing?

• Object-oriented programming: Web of Data as an object repository.

• Logic: Web of Data as a knowledge-base.

• Graph/network analysis: Web of Data as a multi-relational graph.

• The future computing environment should support at least these popularprocessing models.

• We will focus on graph/network analysis for the remainder of thispresentation.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 51: Distributed Graph Databases and the Emerging Web of Data

Outline

• The Relational Database vs. the Graph Database

• The Web of Documents vs. the Web of Data

• Local Computing vs. Distributed Computing

• Multi-Relational Network Analysis with Grammar Walkers

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 52: Distributed Graph Databases and the Emerging Web of Data

Introduction to Random Walkers

• Random walkers can be used in single-relational networks to calculate:

? stationary probability distribution: primary eigenvector calculation? spreading activation: search by means of diffusion

• There is a continuous and a discrete form of the general random walkmethod.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 53: Distributed Graph Databases and the Emerging Web of Data

Random Walks in a Single-Relational Network

• Suppose a single-relational network G, where

G = (V,E ⊆ (V × V )).

• Let’s represent that network as a row stochastic adjacency matrix A ∈[0, 1]|V |×|V |, where

Ai,j =

{1

Γ(i) if (i, j) ∈ E0 otherwise.

• Finally, assume an “energy vector” π ∈ R|V |.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 54: Distributed Graph Databases and the Emerging Web of Data

Random Walks in a Single-Relational Network

0.5

0.5 0.5

0

0 0

00

0

0

0

00

AG

a d

cb

1

1

a b c d

a

b

c

d

1 0 0 0

!

0.5

• πA can be interpreted as the continuous form of propagating randomwalkers over the G.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 55: Distributed Graph Databases and the Emerging Web of Data

Stationary Probability Distribution in aSingle-Relational Network

0.5

0.5 0.5

0

0 0

00

0

0

0

00

A

1

1

a b c d

1 0 0 0

0.5

0 0.5 0 0.5

0 0.5 0.5 0

0.25 0 0.5 0.25

0.25 0.38 0 0.36

0 0.5 0.38 0.13

!1

!2

!3

!6

!4

!5

!!...

0.15 0.31 0.31 0.23

time

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 56: Distributed Graph Databases and the Emerging Web of Data

Stationary Probability Distribution in aSingle-Relational Network

• If G is strongly connected and aperiodic then there exits a π such thatπ = πA.

• This stationary π∞ is the primary eigenvector of A.

• PageRank computes the stationary π by forcing G (the Web citationgraph) to be strongly connected and aperiodic.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 57: Distributed Graph Databases and the Emerging Web of Data

Spreading Activation in a Single-Relational Network

• Spreading activation can be thought of as a “local rank” algorithm, whilecalculating the stationary probability provides you a “global rank”.

• With spreading activation, you iterate for only a certain number oftimesteps.

• Also, you record how much energy has flowed through each vertex.

• Let’s demonstrate using a single discrete walker...

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 58: Distributed Graph Databases and the Emerging Web of Data

Spreading Activation in a Single-Relational Network

• The walkers moves from vertex to vertex with choice dependent on theprobability distribution of A.

• At every step, if the walker is at vertex i then πi = π + 1.

a d

cbG

1

!1

!2

!3

2 3 1

1 1

1 1 1

0 0 0

0

0

0

2 1 1 0!44

time

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 59: Distributed Graph Databases and the Emerging Web of Data

Random Walks in a Multi-Relational Network

• Suppose a multi-relational network M , where

M = (V,E = {E0, E1, . . . , Ek ⊆ (V × V )})

• Represent as a {0, 1}-adjacency tensor A ∈ {0, 1}|V |×|V |×|E|, where

Ami,j =

{1 if (i, j) ∈ Em : 1 ≤ m ≤ k0 otherwise.

• Then assume a “energy vector” π ∈ R|V |.

M.A. Rodriguez and J. Shinavier. Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms, in

review, http://arxiv.org/abs/0806.2274, 2009.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 60: Distributed Graph Databases and the Emerging Web of Data

Random Walks in a Multi-Relational Network

a d

cb

authored

cites

contains

authoredcitescontains

0

0

0

0

0

0

0

0

0

0

0

0

0

00

1

M A

1 0 0 0

!

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 61: Distributed Graph Databases and the Emerging Web of Data

The Operations of the Multi-Relational Path Algebra

• A ·B: ordinary matrix multiplication determines the number of (A,B)-paths between vertices.

• A>: matrix transpose inverts path directionality.

• A ◦B: Hadamard, entry-wise multiplication applies a filter to selectivelyexclude paths.

• n(A): not generates the complement of a {0, 1}n×n matrix.

• c(A): clip generates a {0, 1}n×n matrix from a Rn×n+ matrix.

• v±(A): vertex generates a {0, 1}n×n matrix from a Rn×n+ matrix, where

only certain rows or columns contain non-zero values.

• λA: scalar multiplication weights the entries of a matrix.

• A + B: matrix addition merges paths.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 62: Distributed Graph Databases and the Emerging Web of Data

The Traverse Operation

• An interesting aspect of the single-relational adjacency matrix A ∈ {0, 1}n×n is that when it is raised

to the kth power, the entry A(k)i,j is equal to the number of paths of length k that connect vertex i to

vertex j.

• Given, by definition, that A(1)i,j (i.e. Ai,j) represents the number of paths that go from i to j of length

1 (i.e. a single edge) and by the rules of ordinary matrix multiplication,

A(k)i,j =

∑l∈V

A(k−1)i,l ·Al,j : k ≥ 2.

0

0

1

0

0

0 0

1

0 0

0

1

0

0

0 0

1

0

·0

0

0

0

0

0 1

0

0

=

a b c

a b c

a

b

c

a b c a b c

a

b

c

a

b

c

there is a path of length 2 from a to c

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 63: Distributed Graph Databases and the Emerging Web of Data

hA1 : authored

i hA2 : cites

i hA3 : contains

iThe Traverse Operation

Z = A1 · A2 · A1>,Zi,j defines the number of paths from vertex i to vertex j such that a path goes from author i to one the

articles he or she has authored, from that article to one of the articles it cites, and finally, from that cited

article to its author j. Semantically, Z is an author-citation single-relational path matrix.

lanl:marko

lanl:authored

vub:1010

lanl:authored

vub:fheyligh

ieee:2020lanl:cites

lanl:author-citation

A1

A2

A1!

Z

* NOTE: All diagrams are with respect to a “source” vertex (the blue vertex) in order to preserve clarity. In reality, the operations

operate on all vertices in parallel.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 64: Distributed Graph Databases and the Emerging Web of Data

The Filter Operation

Various path filters can be defined and applied using the entry-wiseHadamard matrix product denoted ◦, where

A ◦B =

A1,1 ·B1,1 · · · A1,m ·B1,m... . . . ...

An,1 ·Bn,1 · · · An,m ·Bn,m

.

0

0

0

72

1

15.3

0

0

0

23

0

24 00

0

0

0

4 0

0

0

0 12

0

0

0

0

0

1

1

0

0

0

0

1

0

0 00

0

0

0

0 0

0

0

0 0

0

0! =

0

0

0

72

1

0

0

0

0

23

0

0 00

0

0

0

0 0

0

0

0 0

0

0

Path Matrix Path Filter Filtered Path Matrix

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 65: Distributed Graph Databases and the Emerging Web of Data

The Filter Operation

• A ◦ 1 = A• A ◦ 0 = 0• A ◦B = B ◦A• A ◦ (B + C) = (A ◦B) + (A ◦C)• A> ◦B> = (A ◦B)>.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 66: Distributed Graph Databases and the Emerging Web of Data

The Not Filter

The not filter is useful for excluding a set of paths to or from a vertex.

n : {0, 1}n×n → {0, 1}n×n

with a function rule of

n(A)i,j =

{1 if Ai,j = 00 otherwise.

0

0

0

1

1

1

0

0

0

1

0

1 00

0

0

0

1 0

0

0

0 1

0

0=n

1

1

1

0

0

0

1

1

1

0

1

0 11

1

1

1

0 1

1

1

1 0

1

1

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 67: Distributed Graph Databases and the Emerging Web of Data

The Not Filter

If A ∈ {0, 1}n×n, then

• n(n(A)) = A• A ◦ n(A) = 0• n(A) ◦ n(A) = n(A).

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 68: Distributed Graph Databases and the Emerging Web of Data

hA1 : authored

i hA2 : cites

i hA3 : contains

iThe Not Filter

A coauthorship path matrix is

Z = A1 · A1> ◦ n(I)

lanl:marko

lanl:authored

acm:0505

lanl:jbollenlanl:coauthor

A1 A1!

Z

lanl:authored

lanl:coauthor

n(I)

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 69: Distributed Graph Databases and the Emerging Web of Data

The Clip Filter

The general purpose of clip is to take a path matrix and “clip”, ornormalize, it to a {0, 1}n×n matrix.

c : Rn×n+ → {0, 1}n×n

c(Z)i,j =

{1 if Zi,j > 00 otherwise.

0

0

0

72

1

15.3

0

0

0

23

0

24 00

0

0

0

4 0

0

0

0 12

0

0

0

0

0

1

1

1

0

0

0

1

0

1 00

0

0

0

1 0

0

0

0 1

0

0=c

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 70: Distributed Graph Databases and the Emerging Web of Data

The Clip Filter

If A,B ∈ {0, 1}n×n and Y,Z ∈ Rn×n+ , then

• c(A) = A• c(n(A)) = n(c(A)) = n(A)• c(Y ◦ Z) = c(Y) ◦ c(Z)• n(A ◦B) = c (n(A) + n(B))• n(A + B) = n(A) ◦ n(B)

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 71: Distributed Graph Databases and the Emerging Web of Data

hA1 : authored

i hA2 : cites

i hA3 : contains

iThe Clip Filter

Suppose we want to create an author citation path matrix that does not allow self citation or coauthorcitations.

Z =

„A1 · A2 · A1>

«| {z }

cites

◦n

„c

„A1 · A1> ◦ n(I)

««| {z }

no coauthors

◦ n(I)|{z}no self

lanl:marko

lanl:authored

lanl:3030

lanl:authored

lanl:jbollen

lanl:4040lanl:cites

lanl:author-citation

A1

A2

A1!

Z

authored

odu:nelson

A1!

lanl:authored

lanl:coauthor

self n(I)

n!c!A1 · A1! ! n(I)

""

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 72: Distributed Graph Databases and the Emerging Web of Data

hA1 : authored

i hA2 : cites

i hA3 : contains

iThe Clip Filter

However, using various theorems of the path algebra and abstract algebrain general,

Z =(A1 · A2 · A1>

)︸ ︷︷ ︸

cites

◦n(c(A1 · A1> ◦ n(I)

))︸ ︷︷ ︸

no coauthors

◦ n(I)︸︷︷︸no self

becomes

Z =(A1 · A2 · A1>

)◦ n(c(A1 · A1>

))◦ n(I).

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 73: Distributed Graph Databases and the Emerging Web of Data

Other Filters and Operations...

• Please refer to the article for more information on these filters andoperations.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 74: Distributed Graph Databases and the Emerging Web of Data

Problems with the Path Algebra

• As a matrix algebra, it is impossible (computationally speaking) tocompute matrix operations over the entire Web of Data.

• However, it is possible to approximate these calculations using “random”walkers.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 75: Distributed Graph Databases and the Emerging Web of Data

Mapping Paths to Grammar-Based Random Walkers

• A grammar-based random walker is a walker that obeys a pathdescription.

• Able to compute “semantically rich” spreading activation and stationaryprobability distributions in a multi-relational network.

• Able to approximate through the convergence properties of theseoperations.

• Provides a convenient application to the Web of Data and linked graphdatabases.

M.A. Rodriguez. Grammar-Based Random Walkers in Semantic Networks. Knowledge-Based Systems, 21(7), 727–739, 2008.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 76: Distributed Graph Databases and the Emerging Web of Data

A Grammar Walker

A1 · A1! ! n(I)

Web of Data

127.0.0.4 127.0.0.5 127.0.0.6

structures structuresstructures

t=1t=2 t=3

Grammar Walker

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 77: Distributed Graph Databases and the Emerging Web of Data

Grammar Walking the Web of Data

2

3

4

1

5

6

7

127.0.0.1

127.0.0.2 127.0.0.3

127.0.0.4 127.0.0.5

127.0.0.6

127.0.0.9

127.0.0.7

127.0.0.8

127.0.0.10

127.0.0.11

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 78: Distributed Graph Databases and the Emerging Web of Data

Conclusion

• Graph databases will increasingly support the Web of Data.

• The Web of Data is about open, global-scale data management.

• Distributed computing is required for global-scale data processing.

• Grammar walkers can be used for distributed network analysis on theWeb of Data.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Page 79: Distributed Graph Databases and the Emerging Web of Data

Thank You For Your Time

? My homepage: http://markorodriguez.com? Neno/Fhat: http://neno.lanl.gov? Collective Decision Making Systems: http://cdms.lanl.gov? Faith in the Algorithm: http://faithinthealgorithm.net? MESUR: http://www.mesur.org

Computer Science Department Colloquium – University of New Mexico – April 16, 2009