Distributed Graph Databases and the Emerging Web of Data

Distributed Graph Databases and the

Emerging Web of Data

Marko A. RodriguezT-5, Center for Nonlinear StudiesLos Alamos National Laboratory

http://markorodriguez.com

April 16, 2009

Abstract

The World Wide Web is the defacto medium for publicly exposing a corpusof interrelated documents. In its current form, the World Wide Web is theWeb of Documents. The next generation of the World Wide Web willsupport the Web of Data. The Web of Data utilizes the same UniformResource Identifier (URI) address space as the Web of Documents, butinstead of a exposing a graph of documents, the Web of Data exposes agraph of data. Given that the URI address space of the Web is distributedand infinite, the Web of Data provides a single unified space by which theworlds data can be publicly exposed and interrelated. The Web of Data issupported by both graph databases (which structure the data) anddistributed computing mechanism (which process the data). Thispresentation will discuss the Web of Data, graph databases, and models ofcomputing in this emerging space.

Computer Science Department Colloquium – University of New Mexico – April 16, 2009

Outline

• The Relational Database vs. the Graph Database

• The Web of Documents vs. the Web of Data

• Local Computing vs. Distributed Computing

• Multi-Relational Network Analysis with Grammar Walkers


Outline






The Relational Database vs. the Graph Database

• A relational database’s (e.g. MySQL, PostgreSQL, Oracle) data modelis a collection interlinked tables.

• A graph database’s (e.g. OpenSesame, AllegroGraph, Neo4j) data modelis a multi-relational graph.

Graph Database

127.0.0.2

Relational Database

127.0.0.1

aa

b

c

d


Types of Graphs

• Undirected single-relational graph: homogenous set of symmetric links.

• Directed single-relational graph: homogenous set of links.

• Directed multi-relational graph: heterogenous set of links.

x z

x z

x zy

undirected single-relational graph

directed single-relational graph

directed multi-relational graph


Our Make Believe World - Phase 1

• Marko is a human and Fluffy is a dog.


Our World Modeled in a Relational Database - Phase 1

0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

Object_Table

Type

Dog

Human


Our World Modeled in a Graph Database - Phase 1

0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur




• Marko and Fluffy are good friends.



0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

0001

ID2 ID2

0002

Object_Table Friendship_Table

0002

0001

Type

Dog

Human



0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

friendfriend





• Human and dog are a subclass of mammal.



0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

0001

ID2 ID2

0002

Object_Table Friendship_Table

0002

0001

Type

Dog

Human Human

Type1 Type2

Dog

Mammal

Mammal

Subclass_Table



0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

Mammal

subclassof subclassof

friendfriend






• Fluffy peed on the carpet.



0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

0001

ID2 ID2

0002

Object_Table

Friendship_Table

0002

0001

Type

Dog

Human

0003 My_Rug Carpet N/A N/A

Human

Type1 Type2

Dog

Mammal

Mammal

Subclass_Table

0002

ID1 ID2

0003

Pee_Table



0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

Mammal


peedOn 0003

Carpet

type

My_Rug

name

friendfriend






• Fluffy peed on the carpet.

• Marko and Fluffy are both mammals.



0001

ID Name Legs Fur

Marko 2 false

0002 Fluffy 4 true

0001

ID2 ID2

0002

Object_Table

Friendship_Table

0002

0001

Type

Dog

Human

0003 My_Rug Carpet N/A N/A

Human

Type1 Type2

Dog

Mammal

Mammal

Subclass_Table

0002

ID1 ID2

0003

Pee_Table

0001

ID Type

0002

Human

Dog

Type_Table

0003

0001

0002

Carpet

Mammal

Mammal



0001 0002

Marko Fluffy

Human Dog

2 4 truefalse

name

type

name

type

furlegs legs fur

Mammal


peedOn 0003

Carpet

type

My_Rug

name

type type

friendfriend


The Graph as the Natural World Model

• The world is inherently (or perceived as) object-oriented.

• The world is filled with objects and relations among them.

• The multi-relational graph is a very natural representation of the world.


The Graph as the Natural Programming Model

• High-level computer languages are object-oriented.

• Nearly no impedance mismatch between the multi-relational graph andthe programming object.

• It is easy to go from graph database to in-memory object.

Human marko = new Human();marko.name = "Marko";marko.addFriend(fluffy);marko.setHasFur(false);marko.setLegs(2);


SQL vs. SPARQL

SELECT OTY.Name FROM Object_Table AS OTX,Object_Table AS OTY, Friendship_Table WHERE

OTX.Name = "Marko" ANDFriendship_Table.ID1 = OTY.ID ANDFriendship_Table.ID2 = OTX.ID;

SELECT ?z WHERE {?x name "Marko" .?y friend ?x .?y name ?z }

E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF, WWW Consortium,

http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/, 2004.


Outline






Internet Address Spaces

• The Uniform Resource Identifier (URI) is the superclass of the UniformResource Locator (URL) and Uniform Resource Name (URN).


The Uniform Resource Locator

• The set of all URLs is the address space of all resources that can belocated and retrieved on the Web. URLs denote where a resource is.

? http://markorodriguez.com/index.html∗ Domain name server (DNS): markorodriguez.com→ 216.251.43.6∗ http:// means GET at port 80,∗ /index.html means the resource to get at that Internet location.

markorodriguez.com216.251.43.6

Web Server

index.html


http://markorodriguez.com/index.html

The Uniform Resource Name

• The set of all URNs is the address space of all resources within the urn:namespace.

? urn:uuid:bd93def0-8026-11dd-842be54955baa12? urn:issn:0892-3310? urn:doi:10.1016/j.knosys.2008.03.030

• Named resources need not be retrievable through the Web.

• URNs denote what a resource is.


urn:uuid:bd93def0-8026-11dd-842be54955baa12

urn:issn:0892-3310

urn:doi:10.1016/j.knosys.2008.03.030

The Uniform Resource Identifier

• The URI address space is an infinite space for all Internet resources.

? urn:issn:0892-3310? ftp://markorodriguez.com/private/markos_secrets.txt? http://www.lanl.gov#fluffy

• Important: URIs can denote concepts, instances, and datum.

lanl:fluffy lanl:fluffy_legs

lanl is a namespace prefix which extends to http://www.lanl.gov#.


urn:issn:0892-3310

ftp://markorodriguez.com/private/markos_secrets.txt

http://www.lanl.gov#fluffy

The Web of Documents

• The World of Documents is primarily concerned with the Hyper-TextTransfer Protocol (HTTP) and with retrievable resources in the URLaddress space.

• These retrievable resources are files: HTML documents, images, audio,etc. The “web” is created when HTML documents contain URLs.

index.html

Home.html Research.htmlResume.html hrefhref

href

http://markorodriguez.com/


The Web of Data

• The Web of Data is primarily concerned with URIs.

• The Resource Description Framework (RDF) is the standard forrepresenting the relationship between URIs and literals (e.g. float, string,date time, etc.).

lanl:marko lanl:fluffyfoaf:knows

foaf:name

"Marko A. Rodriguez"^^xsd:string

foaf:name

"Fluffy P. Everywhere"^^xsd:string

subject objectpredicate

C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee. Linked Data on the Web, International World Wide Web Conference, 2008.


Our Make Believe World in RDF

lanl:marko lanl:fluffy

foaf:name

"Marko A. Rodriguez"^^xsd:string

foaf:name

"Fluffy P. Everywhere"^^xsd:string

lanl:Dog

rdf:typerdf:type

lanl:Human

lanl:Mammal

rdfs:subClassOf rdfs:subClassOf

"2"^^xsd:integer "4"^^xsd:integer

lanl:legs lanl:legs

"false"^^xsd:boolean

lanl:fur

"true"^^xsd:boolean

lanl:fur

lanl:friend

lanl:friend

rdf:type rdf:type


The Web of Data is a Distributed Database

• The URI address space is distributed.

• URIs can denote datum.

• RDF denotes the relationships URIs.

• The Web of Data’s foundational standard is RDF.

• Therefore, the Web of Data is a distributed database.


The Web of Documents vs. the Web of Data

Web Server

127.0.0.1

HTML

Web Server

127.0.0.2

HTMLhref

Graph Database

127.0.0.1

Graph Database

127.0.0.2

lanl:friend


The Current Web of Data - March 2009

geospecies

freebase

dbpedia

libris

geneid

interpro

hgnc

symbol

pubmed

mgi

geneontology

uniprot

pubchem

unists

omim

homologene

pfam

pdb

reactome

chebi

uniparc

kegg

cas

uniref

prodomprosite

taxonomy

dailymed

linkedct

acm

dblprkbexplorer

laascnrs

newcastle

eprints

ecssouthampton

irittoulouseciteseer

pisa

resexibm

ieee

rae2001

budapestbme

eurecom

dblphannover

diseasome

drugbank

geonames

yago

opencyc

w3cwordnet

umbel

linkedmdb

rdfbookmashup

flickrwrappr

surgeradio

musicbrainz myspacewrapper

bbcplaycountdata

bbcprogrammes

semanticweborg

revyu

swconferencecorpus

lingvoj

pubguide

crunchbase

foafprofiles

riese

qdos

audioscrobbler

flickrexporter

bbcjohnpeel

wikicompany

govtrack

uscensusdata

openguides

doapspace

bbclatertotp

eurostat

semwebcentral

dblpberlin

siocsites

jamendo

magnatuneworldfactbook

projectgutenberg

opencalais

rdfohloh

virtuososponger

geospecies

freebase

dbpedia

libris

geneid

interpro

hgnc

symbol

pubmed

mgi

geneontology

uniprot

pubchem

unists

omim

homologene

pfam

pdb

reactome

chebi

uniparc

kegg

cas

uniref

prodomprosite

taxonomy

dailymed

linkedct

acm

dblprkbexplorer

laascnrs

newcastle

eprints

ecssouthampton

irittoulouseciteseer

pisa

resexibm

ieee

rae2001

budapestbme

eurecom

dblphannover

diseasome

drugbank

geonames

yago

opencyc

w3cwordnet

umbel

linkedmdb

rdfbookmashup

flickrwrappr

surgeradio

musicbrainz myspacewrapper

bbcplaycountdata

bbcprogrammes

semanticweborg

revyu

swconferencecorpus

lingvoj

pubguide

crunchbase

foafprofiles

riese

qdos

audioscrobbler

flickrexporter

bbcjohnpeel

wikicompany

govtrack

uscensusdata

openguides

doapspace

bbclatertotp

eurostat

semwebcentral

dblpberlin

siocsites

jamendo

magnatuneworldfactbook

projectgutenberg

opencalais

rdfohloh

virtuososponger

M.A. Rodriguez. A Graph Analysis of the Linked Data Cloud, in review, http://arxiv.org/abs/0903.0194, 2009.


The Current Web of Data - March 2009data set domain data set domain data set domain

audioscrobbler music govtrack government pubguide booksbbclatertotp music homologene biology qdos socialbbcplaycountdata music ibm computer rae2001 computerbbcprogrammes media ieee computer rdfbookmashup booksbudapestbme computer interpro biology rdfohloh socialchebi biology jamendo music resex computercrunchbase business laascnrs computer riese governmentdailymed medical libris books semanticweborg computerdblpberlin computer lingvoj reference semwebcentral socialdblphannover computer linkedct medical siocsites socialdblprkbexplorer computer linkedmdb movie surgeradio musicdbpedia general magnatune music swconferencecorpus computerdoapspace social musicbrainz music taxonomy referencedrugbank medical myspacewrapper social umbel generaleurecom computer opencalais reference uniref biologyeurostat government opencyc general unists biologyflickrexporter images openguides reference uscensusdata governmentflickrwrappr images pdb biology virtuososponger referencefoafprofiles social pfam biology w3cwordnet referencefreebase general pisa computer wikicompany businessgeneid biology prodom biology worldfactbook governmentgeneontology biology projectgutenberg books yago generalgeonames geographic prosite biology . . .


Cultural Differences that are Leading to Web-BasedData Management - Part 1

• Relational databases tend to not maintain public access points.

• Relational database users tend to not publish their schemas.

• Web of Data graph databases maintain public access points calledSPARQL end-points or Linked Data URLs.

• Web of Data graph database users tend to reuse and extend publicschemas called ontologies.


Cultural Differences that are Leading to Web-BasedData Management - Part 2

Web of Data

127.0.0.4 127.0.0.5 127.0.0.6

Application 1 Application 2 Application 3

structures structuresstructures

processes processes processes

127.0.0.1 127.0.0.2 127.0.0.3

Application 1 Application 2 Application 3

Conventional Model

structures structures structures

processes processes processes

Web of Data Model

127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.1 127.0.0.2 127.0.0.3


Outline






SPARQLing a Data Provider - Local Computing

Graph Database

127.0.0.2

SPARQL

END-POINT12

7.0.

0.1

SELECT ?x WHERE { lanl:marko lanl:friend ?x }

{ lanl:fluffy }

• The 127.0.0.1 client is querying the 127.0.0.2 server.

• The query is any read-based SPARQL query.

• The results are those resources that bound to the query arguments.


GETing Linked Data as RDF - Local Computing

Web of Data

http://www.lanl.gov#marko

lanl:marko

lanl:fluffy

lanl:friend

lanl:wrote

vub:1010

ieee:2020

lanl:cites

lanl:marko

lanl:fluffy

lanl:friend

lanl:wrote

vub:1010

vub:1010

ieee:2020

lanl:cites

http://www.vub.edu#1010

HTTP GET

HTTP GET

127.0.0.1


Problem with the Current Web of Data Infrastructure

• The only interfaces are SPARQL end-points and HTTP GETs of RDFsubgraphs.

• For human-based document retrieval, this is fine. For machine-baseddata processing, this does not scale.

M.A. Rodriguez. A Distributed Process Infrastructure for a Distributed Data Structure. Semantic Web and Information Systems

Bulletin, AIS Special Interest Group on Semantic Web and Information Systems, http://arxiv.org/abs/0807.3908, 2008.


Problem with the Current Web of Data Infrastructure

• We can not rely on the “download and index” philosophy of the WorldWide Web.

? As of March 2009, the Web of Data maintains 4.5 billion triples.

• The Web of Data can not rely on a single service provider.

? too much data.? too many types algorithms that can utilize this data.? too many clock cycles to locally process this data.


The Open Virtual Machine FarmGraph Database

127.0.0.1

Graph Database

127.0.0.2

lanl:friend

Virtual MachineFarm

Virtual MachineFarm

code/machine

• Distributed computing through code/machine migration between farms.

• move the process to the data, not the data to the process.

M.A. Rodriguez. General Purpose Computing on a Semantic Network Substrate. in Emergent Web Intelligence, eds. R. Chbeir,

A. Hassanien, A. Abraham and Y. Badr, Springer-Verlag, http://arxiv.org/abs/0704.3395, 2009.

M.A. Rodriguez. The RDF Virtual Machine, in review, LA-UR-08-03925, 2009.


Neno RDF Programming Language - Code Serialization

urn:uuid:6e400b42

hasBlock

urn:uuid:4e0bada0

urn:uuid:51b8d4a0

hasLeft

urn:uuid:54e14d4c

urn:uuid:6425e5ec

hasURI

"1"^^xsd:int

urn:uuid:67bbd072

hasURI

"2"^^xsd:int

urn:uuid:4fa0f752

hasMethod

rdf:typedemo:Human

"a"^^xsd:string

"example"^^xsd:string

hasMethodName

hasURI

trueInst urn:uuid: 610eb4b0

nextInst

nextInst

urn:uuid:0748e1c6

falseInst

nextInst

urn:uuid:62e8b8dc

nextInst

urn:uuid:008e999a

Block

Method

Equals

LocalDirect

Return

Return

LocalDirect

Block

Block

PushValueurn:uuid:5c4d5bc2

hasValue

urn:uuid:6d451a1e

hasValue

PushValue

LocalDirect

urn:uuid:51b8d4a0

Branch

nextInst

nextInst

hasRight

"marko"^^xsd:string

urn:uuid:5869b878

hasURI

LocalDirect

xsd:int example(xsd:string a) { if(a == "marko") return 1; else return 2;}


The Fhat RDF Virtual Machine - Machine Serialization

halt

Fhat

Instruction

programLocation

Frame

hasFrame

[0..*]

[0..1]

returnTop

ReturnStack

Instruction

rdf:firstrdf:rest

[0..1][0..1]

blockTop

[0..*]

FrameVariable

rdf:li

hasValue

rdfs:Resource

operandTop

OperandStack

rdfs:Resource

rdf:firstrdf:rest

[0..1]

[0..1]

[0..1]

RVM

[0..*]

hasSymbol

xsd:string

[1]

xsd:boolean[1]

forFrame[1]

fromBlock

Block

[1]

currentFrame

[0..1]

methodReuse

xsd:boolean[1]

[0..1]

BlockStack

Block

rdf:firstrdf:rest

[0..1]

[0..1]

[0..1]


A Collection of Interlinked Graph Databases - Currently

127.0.0.2 127.0.0.3

127.0.0.4 127.0.0.5

127.0.0.6

127.0.0.9

127.0.0.7

127.0.0.8

127.0.0.10

127.0.0.11


A Collection of Interlinked Graph Databases andProcessors - Future

127.0.0.2 127.0.0.3

127.0.0.4 127.0.0.5

127.0.0.6

127.0.0.9

127.0.0.7

127.0.0.8

127.0.0.10

127.0.0.11


The Future of Web-Based Distributed Computing

• The HTTP GET approach to Web of Data does not scale.

• The Neno/Fhat (or any general-purpose computing) environment isunsafe.

• The Web of Data needs an open, safe, flexible, and easy to adoptcomputing infrastructure.


What Type of Processing?

• Object-oriented programming: Web of Data as an object repository.

• Logic: Web of Data as a knowledge-base.

• Graph/network analysis: Web of Data as a multi-relational graph.

• The future computing environment should support at least these popularprocessing models.

• We will focus on graph/network analysis for the remainder of thispresentation.


Outline






Introduction to Random Walkers

• Random walkers can be used in single-relational networks to calculate:

? stationary probability distribution: primary eigenvector calculation? spreading activation: search by means of diffusion

• There is a continuous and a discrete form of the general random walkmethod.


Random Walks in a Single-Relational Network

• Suppose a single-relational network G, where

G = (V,E ⊆ (V × V )).

• Let’s represent that network as a row stochastic adjacency matrix A ∈[0, 1]|V |×|V |, where

Ai,j =

{1

Γ(i) if (i, j) ∈ E0 otherwise.

• Finally, assume an “energy vector” π ∈ R|V |.


Random Walks in a Single-Relational Network

0.5

0.5 0.5

0

0 0

00

0

0

0

00

AG

a d

cb

1

1

a b c d

a

b

c

d

1 0 0 0

!

0.5

• πA can be interpreted as the continuous form of propagating randomwalkers over the G.


Stationary Probability Distribution in aSingle-Relational Network

0.5

0.5 0.5

0

0 0

00

0

0

0

00

A

1

1

a b c d

1 0 0 0

0.5

0 0.5 0 0.5

0 0.5 0.5 0

0.25 0 0.5 0.25

0.25 0.38 0 0.36

0 0.5 0.38 0.13

!1

!2

!3

!6

!4

!5

!!...

0.15 0.31 0.31 0.23

time


Stationary Probability Distribution in aSingle-Relational Network

• If G is strongly connected and aperiodic then there exits a π such thatπ = πA.

• This stationary π∞ is the primary eigenvector of A.

• PageRank computes the stationary π by forcing G (the Web citationgraph) to be strongly connected and aperiodic.


Spreading Activation in a Single-Relational Network

• Spreading activation can be thought of as a “local rank” algorithm, whilecalculating the stationary probability provides you a “global rank”.

• With spreading activation, you iterate for only a certain number oftimesteps.

• Also, you record how much energy has flowed through each vertex.

• Let’s demonstrate using a single discrete walker...


Spreading Activation in a Single-Relational Network

• The walkers moves from vertex to vertex with choice dependent on theprobability distribution of A.

• At every step, if the walker is at vertex i then πi = π + 1.

a d

cbG

1

!1

!2

!3

2 3 1

1 1

1 1 1

0 0 0

0

0

0

2 1 1 0!44

time


Random Walks in a Multi-Relational Network

• Suppose a multi-relational network M , where

M = (V,E = {E0, E1, . . . , Ek ⊆ (V × V )})

• Represent as a {0, 1}-adjacency tensor A ∈ {0, 1}|V |×|V |×|E|, where

Ami,j =

{1 if (i, j) ∈ Em : 1 ≤ m ≤ k0 otherwise.

• Then assume a “energy vector” π ∈ R|V |.

M.A. Rodriguez and J. Shinavier. Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms, in

review, http://arxiv.org/abs/0806.2274, 2009.


Random Walks in a Multi-Relational Network

a d

cb

authored

cites

contains

authoredcitescontains

0

0

0

0

0

0

0

0

0

0

0

0

0

00

1

M A

1 0 0 0

!


The Operations of the Multi-Relational Path Algebra

• A ·B: ordinary matrix multiplication determines the number of (A,B)-paths between vertices.

• A>: matrix transpose inverts path directionality.

• A ◦B: Hadamard, entry-wise multiplication applies a filter to selectivelyexclude paths.

• n(A): not generates the complement of a {0, 1}n×n matrix.

• c(A): clip generates a {0, 1}n×n matrix from a Rn×n+ matrix.

• v±(A): vertex generates a {0, 1}n×n matrix from a Rn×n+ matrix, where

only certain rows or columns contain non-zero values.

• λA: scalar multiplication weights the entries of a matrix.

• A + B: matrix addition merges paths.


The Traverse Operation

• An interesting aspect of the single-relational adjacency matrix A ∈ {0, 1}n×n is that when it is raised

to the kth power, the entry A(k)i,j is equal to the number of paths of length k that connect vertex i to

vertex j.

• Given, by definition, that A(1)i,j (i.e. Ai,j) represents the number of paths that go from i to j of length

1 (i.e. a single edge) and by the rules of ordinary matrix multiplication,

A(k)i,j =

∑l∈V

A(k−1)i,l ·Al,j : k ≥ 2.

0

0

1

0

0

0 0

1

0 0

0

1

0

0

0 0

1

0

·0

0

0

0

0

0 1

0

0

=

a b c

a b c

a

b

c

a b c a b c

a

b

c

a

b

c

there is a path of length 2 from a to c


hA1 : authored

i hA2 : cites

i hA3 : contains

iThe Traverse Operation

Z = A1 · A2 · A1>,Zi,j defines the number of paths from vertex i to vertex j such that a path goes from author i to one the

articles he or she has authored, from that article to one of the articles it cites, and finally, from that cited

article to its author j. Semantically, Z is an author-citation single-relational path matrix.

lanl:marko

lanl:authored

vub:1010

lanl:authored

vub:fheyligh

ieee:2020lanl:cites

lanl:author-citation

A1

A2

A1!

Z

* NOTE: All diagrams are with respect to a “source” vertex (the blue vertex) in order to preserve clarity. In reality, the operations

operate on all vertices in parallel.


The Filter Operation

Various path filters can be defined and applied using the entry-wiseHadamard matrix product denoted ◦, where

A ◦B =

A1,1 ·B1,1 · · · A1,m ·B1,m... . . . ...

An,1 ·Bn,1 · · · An,m ·Bn,m

.

0

0

0

72

1

15.3

0

0

0

23

0

24 00

0

0

0

4 0

0

0

0 12

0

0

0

0

0

1

1

0

0

0

0

1

0

0 00

0

0

0

0 0

0

0

0 0

0

0! =

0

0

0

72

1

0

0

0

0

23

0

0 00

0

0

0

0 0

0

0

0 0

0

0

Path Matrix Path Filter Filtered Path Matrix


The Filter Operation

• A ◦ 1 = A• A ◦ 0 = 0• A ◦B = B ◦A• A ◦ (B + C) = (A ◦B) + (A ◦C)• A> ◦B> = (A ◦B)>.


The Not Filter

The not filter is useful for excluding a set of paths to or from a vertex.

n : {0, 1}n×n → {0, 1}n×n

with a function rule of

n(A)i,j =

{1 if Ai,j = 00 otherwise.

0

0

0

1

1

1

0

0

0

1

0

1 00

0

0

0

1 0

0

0

0 1

0

0=n

1

1

1

0

0

0

1

1

1

0

1

0 11

1

1

1

0 1

1

1

1 0

1

1


The Not Filter

If A ∈ {0, 1}n×n, then

• n(n(A)) = A• A ◦ n(A) = 0• n(A) ◦ n(A) = n(A).


hA1 : authored

i hA2 : cites

i hA3 : contains

iThe Not Filter

A coauthorship path matrix is

Z = A1 · A1> ◦ n(I)

lanl:marko

lanl:authored

acm:0505

lanl:jbollenlanl:coauthor

A1 A1!

Z

lanl:authored

lanl:coauthor

n(I)


The Clip Filter

The general purpose of clip is to take a path matrix and “clip”, ornormalize, it to a {0, 1}n×n matrix.

c : Rn×n+ → {0, 1}n×n

c(Z)i,j =

{1 if Zi,j > 00 otherwise.

0

0

0

72

1

15.3

0

0

0

23

0

24 00

0

0

0

4 0

0

0

0 12

0

0

0

0

0

1

1

1

0

0

0

1

0

1 00

0

0

0

1 0

0

0

0 1

0

0=c


The Clip Filter

If A,B ∈ {0, 1}n×n and Y,Z ∈ Rn×n+ , then

• c(A) = A• c(n(A)) = n(c(A)) = n(A)• c(Y ◦ Z) = c(Y) ◦ c(Z)• n(A ◦B) = c (n(A) + n(B))• n(A + B) = n(A) ◦ n(B)


hA1 : authored

i hA2 : cites

i hA3 : contains

iThe Clip Filter

Suppose we want to create an author citation path matrix that does not allow self citation or coauthorcitations.

Z =

„A1 · A2 · A1>

«| {z }

cites

◦n

„c

„A1 · A1> ◦ n(I)

««| {z }

no coauthors

◦ n(I)|{z}no self

lanl:marko

lanl:authored

lanl:3030

lanl:authored

lanl:jbollen

lanl:4040lanl:cites

lanl:author-citation

A1

A2

A1!

Z

authored

odu:nelson

A1!

lanl:authored

lanl:coauthor

self n(I)

n!c!A1 · A1! ! n(I)

""


hA1 : authored

i hA2 : cites

i hA3 : contains

iThe Clip Filter

However, using various theorems of the path algebra and abstract algebrain general,

Z =(A1 · A2 · A1>

)︸︷︷︸

cites

◦n(c(A1 · A1> ◦ n(I)

))︸︷︷︸

no coauthors

◦ n(I)︸︷︷︸no self

becomes

Z =(A1 · A2 · A1>

)◦ n(c(A1 · A1>

))◦ n(I).


Other Filters and Operations...

• Please refer to the article for more information on these filters andoperations.


Problems with the Path Algebra

• As a matrix algebra, it is impossible (computationally speaking) tocompute matrix operations over the entire Web of Data.

• However, it is possible to approximate these calculations using “random”walkers.


Mapping Paths to Grammar-Based Random Walkers

• A grammar-based random walker is a walker that obeys a pathdescription.

• Able to compute “semantically rich” spreading activation and stationaryprobability distributions in a multi-relational network.

• Able to approximate through the convergence properties of theseoperations.

• Provides a convenient application to the Web of Data and linked graphdatabases.

M.A. Rodriguez. Grammar-Based Random Walkers in Semantic Networks. Knowledge-Based Systems, 21(7), 727–739, 2008.


A Grammar Walker

A1 · A1! ! n(I)

Web of Data

127.0.0.4 127.0.0.5 127.0.0.6

structures structuresstructures

t=1t=2 t=3

Grammar Walker


Grammar Walking the Web of Data

2

3

4

1

5

6

7

127.0.0.1

127.0.0.2 127.0.0.3

127.0.0.4 127.0.0.5

127.0.0.6

127.0.0.9

127.0.0.7

127.0.0.8

127.0.0.10

127.0.0.11


Conclusion

• Graph databases will increasingly support the Web of Data.

• The Web of Data is about open, global-scale data management.

• Distributed computing is required for global-scale data processing.

• Grammar walkers can be used for distributed network analysis on theWeb of Data.


Thank You For Your Time

? My homepage: http://markorodriguez.com? Neno/Fhat: http://neno.lanl.gov? Collective Decision Making Systems: http://cdms.lanl.gov? Faith in the Algorithm: http://faithinthealgorithm.net? MESUR: http://www.mesur.org


http://markorodriguez.com

http://neno.lanl.gov

http://cdms.lanl.gov

http://faithinthealgorithm.net

http://www.mesur.org

Technology

Distributed Graph Databases and the Emerging Web of Data