Upload
marko-rodriguez
View
4.318
Download
1
Tags:
Embed Size (px)
DESCRIPTION
The World Wide Web is the defacto medium for publicly exposing a corpus of interrelated documents. In its current form, the World Wide Web is the Web of Documents. The next generation of the World Wide Web will support the Web of Data. The Web of Data utilizes the same Uniform Resource Identifier (URI) address space as the Web of Documents, but instead of a exposing a graph of documents, the Web of Data exposes a graph of data. Given that the URI address space of the Web is distributed and infinite, the Web of Data provides a single unified space by which the worlds data can be publicly exposed and interrelated. The Web of Data is supported by both graph databases (which structure the data) and distributed computing mechanism (which process the data). This presentation will discuss the Web of Data, graph databases, and models of computing in this emerging space.
Citation preview
Distributed Graph Databases and the
Emerging Web of Data
Marko A. RodriguezT-5, Center for Nonlinear StudiesLos Alamos National Laboratory
http://markorodriguez.com
April 16, 2009
Abstract
The World Wide Web is the defacto medium for publicly exposing a corpusof interrelated documents. In its current form, the World Wide Web is theWeb of Documents. The next generation of the World Wide Web willsupport the Web of Data. The Web of Data utilizes the same UniformResource Identifier (URI) address space as the Web of Documents, butinstead of a exposing a graph of documents, the Web of Data exposes agraph of data. Given that the URI address space of the Web is distributedand infinite, the Web of Data provides a single unified space by which theworlds data can be publicly exposed and interrelated. The Web of Data issupported by both graph databases (which structure the data) anddistributed computing mechanism (which process the data). Thispresentation will discuss the Web of Data, graph databases, and models ofcomputing in this emerging space.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Outline
• The Relational Database vs. the Graph Database
• The Web of Documents vs. the Web of Data
• Local Computing vs. Distributed Computing
• Multi-Relational Network Analysis with Grammar Walkers
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Outline
• The Relational Database vs. the Graph Database
• The Web of Documents vs. the Web of Data
• Local Computing vs. Distributed Computing
• Multi-Relational Network Analysis with Grammar Walkers
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Relational Database vs. the Graph Database
• A relational database’s (e.g. MySQL, PostgreSQL, Oracle) data modelis a collection interlinked tables.
• A graph database’s (e.g. OpenSesame, AllegroGraph, Neo4j) data modelis a multi-relational graph.
Graph Database
127.0.0.2
Relational Database
127.0.0.1
aa
b
c
d
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Types of Graphs
• Undirected single-relational graph: homogenous set of symmetric links.
• Directed single-relational graph: homogenous set of links.
• Directed multi-relational graph: heterogenous set of links.
x z
x z
x zy
undirected single-relational graph
directed single-relational graph
directed multi-relational graph
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our Make Believe World - Phase 1
• Marko is a human and Fluffy is a dog.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Relational Database - Phase 1
0001
ID Name Legs Fur
Marko 2 false
0002 Fluffy 4 true
Object_Table
Type
Dog
Human
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Graph Database - Phase 1
0001 0002
Marko Fluffy
Human Dog
2 4 truefalse
name
type
name
type
furlegs legs fur
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our Make Believe World - Phase 2
• Marko is a human and Fluffy is a dog.
• Marko and Fluffy are good friends.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Relational Database - Phase 2
0001
ID Name Legs Fur
Marko 2 false
0002 Fluffy 4 true
0001
ID2 ID2
0002
Object_Table Friendship_Table
0002
0001
Type
Dog
Human
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Graph Database - Phase 2
0001 0002
Marko Fluffy
Human Dog
2 4 truefalse
name
type
name
type
furlegs legs fur
friendfriend
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our Make Believe World - Phase 3
• Marko is a human and Fluffy is a dog.
• Marko and Fluffy are good friends.
• Human and dog are a subclass of mammal.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Relational Database - Phase 3
0001
ID Name Legs Fur
Marko 2 false
0002 Fluffy 4 true
0001
ID2 ID2
0002
Object_Table Friendship_Table
0002
0001
Type
Dog
Human Human
Type1 Type2
Dog
Mammal
Mammal
Subclass_Table
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Graph Database - Phase 3
0001 0002
Marko Fluffy
Human Dog
2 4 truefalse
name
type
name
type
furlegs legs fur
Mammal
subclassof subclassof
friendfriend
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our Make Believe World - Phase 4
• Marko is a human and Fluffy is a dog.
• Marko and Fluffy are good friends.
• Human and dog are a subclass of mammal.
• Fluffy peed on the carpet.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Relational Database - Phase 4
0001
ID Name Legs Fur
Marko 2 false
0002 Fluffy 4 true
0001
ID2 ID2
0002
Object_Table
Friendship_Table
0002
0001
Type
Dog
Human
0003 My_Rug Carpet N/A N/A
Human
Type1 Type2
Dog
Mammal
Mammal
Subclass_Table
0002
ID1 ID2
0003
Pee_Table
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Graph Database - Phase 4
0001 0002
Marko Fluffy
Human Dog
2 4 truefalse
name
type
name
type
furlegs legs fur
Mammal
subclassof subclassof
peedOn 0003
Carpet
type
My_Rug
name
friendfriend
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our Make Believe World - Phase 5
• Marko is a human and Fluffy is a dog.
• Marko and Fluffy are good friends.
• Human and dog are a subclass of mammal.
• Fluffy peed on the carpet.
• Marko and Fluffy are both mammals.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Relational Database - Phase 5
0001
ID Name Legs Fur
Marko 2 false
0002 Fluffy 4 true
0001
ID2 ID2
0002
Object_Table
Friendship_Table
0002
0001
Type
Dog
Human
0003 My_Rug Carpet N/A N/A
Human
Type1 Type2
Dog
Mammal
Mammal
Subclass_Table
0002
ID1 ID2
0003
Pee_Table
0001
ID Type
0002
Human
Dog
Type_Table
0003
0001
0002
Carpet
Mammal
Mammal
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our World Modeled in a Graph Database - Phase 5
0001 0002
Marko Fluffy
Human Dog
2 4 truefalse
name
type
name
type
furlegs legs fur
Mammal
subclassof subclassof
peedOn 0003
Carpet
type
My_Rug
name
type type
friendfriend
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Graph as the Natural World Model
• The world is inherently (or perceived as) object-oriented.
• The world is filled with objects and relations among them.
• The multi-relational graph is a very natural representation of the world.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Graph as the Natural Programming Model
• High-level computer languages are object-oriented.
• Nearly no impedance mismatch between the multi-relational graph andthe programming object.
• It is easy to go from graph database to in-memory object.
Human marko = new Human();marko.name = "Marko";marko.addFriend(fluffy);marko.setHasFur(false);marko.setLegs(2);
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
SQL vs. SPARQL
SELECT OTY.Name FROM Object_Table AS OTX,Object_Table AS OTY, Friendship_Table WHERE
OTX.Name = "Marko" ANDFriendship_Table.ID1 = OTY.ID ANDFriendship_Table.ID2 = OTX.ID;
SELECT ?z WHERE {?x name "Marko" .?y friend ?x .?y name ?z }
E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF, WWW Consortium,
http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/, 2004.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Outline
• The Relational Database vs. the Graph Database
• The Web of Documents vs. the Web of Data
• Local Computing vs. Distributed Computing
• Multi-Relational Network Analysis with Grammar Walkers
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Internet Address Spaces
• The Uniform Resource Identifier (URI) is the superclass of the UniformResource Locator (URL) and Uniform Resource Name (URN).
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Uniform Resource Locator
• The set of all URLs is the address space of all resources that can belocated and retrieved on the Web. URLs denote where a resource is.
? http://markorodriguez.com/index.html∗ Domain name server (DNS): markorodriguez.com→ 216.251.43.6∗ http:// means GET at port 80,∗ /index.html means the resource to get at that Internet location.
markorodriguez.com216.251.43.6
Web Server
index.html
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Uniform Resource Name
• The set of all URNs is the address space of all resources within the urn:namespace.
? urn:uuid:bd93def0-8026-11dd-842be54955baa12? urn:issn:0892-3310? urn:doi:10.1016/j.knosys.2008.03.030
• Named resources need not be retrievable through the Web.
• URNs denote what a resource is.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Uniform Resource Identifier
• The URI address space is an infinite space for all Internet resources.
? urn:issn:0892-3310? ftp://markorodriguez.com/private/markos_secrets.txt? http://www.lanl.gov#fluffy
• Important: URIs can denote concepts, instances, and datum.
lanl:fluffy lanl:fluffy_legs
lanl is a namespace prefix which extends to http://www.lanl.gov#.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Web of Documents
• The World of Documents is primarily concerned with the Hyper-TextTransfer Protocol (HTTP) and with retrievable resources in the URLaddress space.
• These retrievable resources are files: HTML documents, images, audio,etc. The “web” is created when HTML documents contain URLs.
index.html
Home.html Research.htmlResume.html hrefhref
href
http://markorodriguez.com/
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Web of Data
• The Web of Data is primarily concerned with URIs.
• The Resource Description Framework (RDF) is the standard forrepresenting the relationship between URIs and literals (e.g. float, string,date time, etc.).
lanl:marko lanl:fluffyfoaf:knows
foaf:name
"Marko A. Rodriguez"^^xsd:string
foaf:name
"Fluffy P. Everywhere"^^xsd:string
subject objectpredicate
C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee. Linked Data on the Web, International World Wide Web Conference, 2008.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Our Make Believe World in RDF
lanl:marko lanl:fluffy
foaf:name
"Marko A. Rodriguez"^^xsd:string
foaf:name
"Fluffy P. Everywhere"^^xsd:string
lanl:Dog
rdf:typerdf:type
lanl:Human
lanl:Mammal
rdfs:subClassOf rdfs:subClassOf
"2"^^xsd:integer "4"^^xsd:integer
lanl:legs lanl:legs
"false"^^xsd:boolean
lanl:fur
"true"^^xsd:boolean
lanl:fur
lanl:friend
lanl:friend
rdf:type rdf:type
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Web of Data is a Distributed Database
• The URI address space is distributed.
• URIs can denote datum.
• RDF denotes the relationships URIs.
• The Web of Data’s foundational standard is RDF.
• Therefore, the Web of Data is a distributed database.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Web of Documents vs. the Web of Data
Web Server
127.0.0.1
HTML
Web Server
127.0.0.2
HTMLhref
Graph Database
127.0.0.1
Graph Database
127.0.0.2
lanl:friend
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Current Web of Data - March 2009
geospecies
freebase
dbpedia
libris
geneid
interpro
hgnc
symbol
pubmed
mgi
geneontology
uniprot
pubchem
unists
omim
homologene
pfam
pdb
reactome
chebi
uniparc
kegg
cas
uniref
prodomprosite
taxonomy
dailymed
linkedct
acm
dblprkbexplorer
laascnrs
newcastle
eprints
ecssouthampton
irittoulouseciteseer
pisa
resexibm
ieee
rae2001
budapestbme
eurecom
dblphannover
diseasome
drugbank
geonames
yago
opencyc
w3cwordnet
umbel
linkedmdb
rdfbookmashup
flickrwrappr
surgeradio
musicbrainz myspacewrapper
bbcplaycountdata
bbcprogrammes
semanticweborg
revyu
swconferencecorpus
lingvoj
pubguide
crunchbase
foafprofiles
riese
qdos
audioscrobbler
flickrexporter
bbcjohnpeel
wikicompany
govtrack
uscensusdata
openguides
doapspace
bbclatertotp
eurostat
semwebcentral
dblpberlin
siocsites
jamendo
magnatuneworldfactbook
projectgutenberg
opencalais
rdfohloh
virtuososponger
geospecies
freebase
dbpedia
libris
geneid
interpro
hgnc
symbol
pubmed
mgi
geneontology
uniprot
pubchem
unists
omim
homologene
pfam
pdb
reactome
chebi
uniparc
kegg
cas
uniref
prodomprosite
taxonomy
dailymed
linkedct
acm
dblprkbexplorer
laascnrs
newcastle
eprints
ecssouthampton
irittoulouseciteseer
pisa
resexibm
ieee
rae2001
budapestbme
eurecom
dblphannover
diseasome
drugbank
geonames
yago
opencyc
w3cwordnet
umbel
linkedmdb
rdfbookmashup
flickrwrappr
surgeradio
musicbrainz myspacewrapper
bbcplaycountdata
bbcprogrammes
semanticweborg
revyu
swconferencecorpus
lingvoj
pubguide
crunchbase
foafprofiles
riese
qdos
audioscrobbler
flickrexporter
bbcjohnpeel
wikicompany
govtrack
uscensusdata
openguides
doapspace
bbclatertotp
eurostat
semwebcentral
dblpberlin
siocsites
jamendo
magnatuneworldfactbook
projectgutenberg
opencalais
rdfohloh
virtuososponger
M.A. Rodriguez. A Graph Analysis of the Linked Data Cloud, in review, http://arxiv.org/abs/0903.0194, 2009.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Current Web of Data - March 2009data set domain data set domain data set domain
audioscrobbler music govtrack government pubguide booksbbclatertotp music homologene biology qdos socialbbcplaycountdata music ibm computer rae2001 computerbbcprogrammes media ieee computer rdfbookmashup booksbudapestbme computer interpro biology rdfohloh socialchebi biology jamendo music resex computercrunchbase business laascnrs computer riese governmentdailymed medical libris books semanticweborg computerdblpberlin computer lingvoj reference semwebcentral socialdblphannover computer linkedct medical siocsites socialdblprkbexplorer computer linkedmdb movie surgeradio musicdbpedia general magnatune music swconferencecorpus computerdoapspace social musicbrainz music taxonomy referencedrugbank medical myspacewrapper social umbel generaleurecom computer opencalais reference uniref biologyeurostat government opencyc general unists biologyflickrexporter images openguides reference uscensusdata governmentflickrwrappr images pdb biology virtuososponger referencefoafprofiles social pfam biology w3cwordnet referencefreebase general pisa computer wikicompany businessgeneid biology prodom biology worldfactbook governmentgeneontology biology projectgutenberg books yago generalgeonames geographic prosite biology . . .
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Cultural Differences that are Leading to Web-BasedData Management - Part 1
• Relational databases tend to not maintain public access points.
• Relational database users tend to not publish their schemas.
• Web of Data graph databases maintain public access points calledSPARQL end-points or Linked Data URLs.
• Web of Data graph database users tend to reuse and extend publicschemas called ontologies.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Cultural Differences that are Leading to Web-BasedData Management - Part 2
Web of Data
127.0.0.4 127.0.0.5 127.0.0.6
Application 1 Application 2 Application 3
structures structuresstructures
processes processes processes
127.0.0.1 127.0.0.2 127.0.0.3
Application 1 Application 2 Application 3
Conventional Model
structures structures structures
processes processes processes
Web of Data Model
127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.1 127.0.0.2 127.0.0.3
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Outline
• The Relational Database vs. the Graph Database
• The Web of Documents vs. the Web of Data
• Local Computing vs. Distributed Computing
• Multi-Relational Network Analysis with Grammar Walkers
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
SPARQLing a Data Provider - Local Computing
Graph Database
127.0.0.2
SPARQL
END-POINT12
7.0.
0.1
SELECT ?x WHERE { lanl:marko lanl:friend ?x }
{ lanl:fluffy }
• The 127.0.0.1 client is querying the 127.0.0.2 server.
• The query is any read-based SPARQL query.
• The results are those resources that bound to the query arguments.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
GETing Linked Data as RDF - Local Computing
Web of Data
http://www.lanl.gov#marko
lanl:marko
lanl:fluffy
lanl:friend
lanl:wrote
vub:1010
ieee:2020
lanl:cites
lanl:marko
lanl:fluffy
lanl:friend
lanl:wrote
vub:1010
vub:1010
ieee:2020
lanl:cites
http://www.vub.edu#1010
HTTP GET
HTTP GET
127.0.0.1
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Problem with the Current Web of Data Infrastructure
• The only interfaces are SPARQL end-points and HTTP GETs of RDFsubgraphs.
• For human-based document retrieval, this is fine. For machine-baseddata processing, this does not scale.
M.A. Rodriguez. A Distributed Process Infrastructure for a Distributed Data Structure. Semantic Web and Information Systems
Bulletin, AIS Special Interest Group on Semantic Web and Information Systems, http://arxiv.org/abs/0807.3908, 2008.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Problem with the Current Web of Data Infrastructure
• We can not rely on the “download and index” philosophy of the WorldWide Web.
? As of March 2009, the Web of Data maintains 4.5 billion triples.
• The Web of Data can not rely on a single service provider.
? too much data.? too many types algorithms that can utilize this data.? too many clock cycles to locally process this data.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Open Virtual Machine FarmGraph Database
127.0.0.1
Graph Database
127.0.0.2
lanl:friend
Virtual MachineFarm
Virtual MachineFarm
code/machine
• Distributed computing through code/machine migration between farms.
• move the process to the data, not the data to the process.
M.A. Rodriguez. General Purpose Computing on a Semantic Network Substrate. in Emergent Web Intelligence, eds. R. Chbeir,
A. Hassanien, A. Abraham and Y. Badr, Springer-Verlag, http://arxiv.org/abs/0704.3395, 2009.
M.A. Rodriguez. The RDF Virtual Machine, in review, LA-UR-08-03925, 2009.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Neno RDF Programming Language - Code Serialization
urn:uuid:6e400b42
hasBlock
urn:uuid:4e0bada0
urn:uuid:51b8d4a0
hasLeft
urn:uuid:54e14d4c
urn:uuid:6425e5ec
hasURI
"1"^^xsd:int
urn:uuid:67bbd072
hasURI
"2"^^xsd:int
urn:uuid:4fa0f752
hasMethod
rdf:typedemo:Human
"a"^^xsd:string
"example"^^xsd:string
hasMethodName
hasURI
trueInst urn:uuid: 610eb4b0
nextInst
nextInst
urn:uuid:0748e1c6
falseInst
nextInst
urn:uuid:62e8b8dc
nextInst
urn:uuid:008e999a
Block
Method
Equals
LocalDirect
Return
Return
LocalDirect
Block
Block
PushValueurn:uuid:5c4d5bc2
hasValue
urn:uuid:6d451a1e
hasValue
PushValue
LocalDirect
urn:uuid:51b8d4a0
Branch
nextInst
nextInst
hasRight
"marko"^^xsd:string
urn:uuid:5869b878
hasURI
LocalDirect
xsd:int example(xsd:string a) { if(a == "marko") return 1; else return 2;}
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Fhat RDF Virtual Machine - Machine Serialization
halt
Fhat
Instruction
programLocation
Frame
hasFrame
[0..*]
[0..1]
returnTop
ReturnStack
Instruction
rdf:firstrdf:rest
[0..1][0..1]
blockTop
[0..*]
FrameVariable
rdf:li
hasValue
rdfs:Resource
operandTop
OperandStack
rdfs:Resource
rdf:firstrdf:rest
[0..1]
[0..1]
[0..1]
RVM
[0..*]
hasSymbol
xsd:string
[1]
xsd:boolean[1]
forFrame[1]
fromBlock
Block
[1]
currentFrame
[0..1]
methodReuse
xsd:boolean[1]
[0..1]
BlockStack
Block
rdf:firstrdf:rest
[0..1]
[0..1]
[0..1]
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
A Collection of Interlinked Graph Databases - Currently
127.0.0.2 127.0.0.3
127.0.0.4 127.0.0.5
127.0.0.6
127.0.0.9
127.0.0.7
127.0.0.8
127.0.0.10
127.0.0.11
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
A Collection of Interlinked Graph Databases andProcessors - Future
127.0.0.2 127.0.0.3
127.0.0.4 127.0.0.5
127.0.0.6
127.0.0.9
127.0.0.7
127.0.0.8
127.0.0.10
127.0.0.11
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Future of Web-Based Distributed Computing
• The HTTP GET approach to Web of Data does not scale.
• The Neno/Fhat (or any general-purpose computing) environment isunsafe.
• The Web of Data needs an open, safe, flexible, and easy to adoptcomputing infrastructure.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
What Type of Processing?
• Object-oriented programming: Web of Data as an object repository.
• Logic: Web of Data as a knowledge-base.
• Graph/network analysis: Web of Data as a multi-relational graph.
• The future computing environment should support at least these popularprocessing models.
• We will focus on graph/network analysis for the remainder of thispresentation.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Outline
• The Relational Database vs. the Graph Database
• The Web of Documents vs. the Web of Data
• Local Computing vs. Distributed Computing
• Multi-Relational Network Analysis with Grammar Walkers
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Introduction to Random Walkers
• Random walkers can be used in single-relational networks to calculate:
? stationary probability distribution: primary eigenvector calculation? spreading activation: search by means of diffusion
• There is a continuous and a discrete form of the general random walkmethod.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Random Walks in a Single-Relational Network
• Suppose a single-relational network G, where
G = (V,E ⊆ (V × V )).
• Let’s represent that network as a row stochastic adjacency matrix A ∈[0, 1]|V |×|V |, where
Ai,j =
{1
Γ(i) if (i, j) ∈ E0 otherwise.
• Finally, assume an “energy vector” π ∈ R|V |.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Random Walks in a Single-Relational Network
0.5
0.5 0.5
0
0 0
00
0
0
0
00
AG
a d
cb
1
1
a b c d
a
b
c
d
1 0 0 0
!
0.5
• πA can be interpreted as the continuous form of propagating randomwalkers over the G.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Stationary Probability Distribution in aSingle-Relational Network
0.5
0.5 0.5
0
0 0
00
0
0
0
00
A
1
1
a b c d
1 0 0 0
0.5
0 0.5 0 0.5
0 0.5 0.5 0
0.25 0 0.5 0.25
0.25 0.38 0 0.36
0 0.5 0.38 0.13
!1
!2
!3
!6
!4
!5
!!...
0.15 0.31 0.31 0.23
time
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Stationary Probability Distribution in aSingle-Relational Network
• If G is strongly connected and aperiodic then there exits a π such thatπ = πA.
• This stationary π∞ is the primary eigenvector of A.
• PageRank computes the stationary π by forcing G (the Web citationgraph) to be strongly connected and aperiodic.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Spreading Activation in a Single-Relational Network
• Spreading activation can be thought of as a “local rank” algorithm, whilecalculating the stationary probability provides you a “global rank”.
• With spreading activation, you iterate for only a certain number oftimesteps.
• Also, you record how much energy has flowed through each vertex.
• Let’s demonstrate using a single discrete walker...
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Spreading Activation in a Single-Relational Network
• The walkers moves from vertex to vertex with choice dependent on theprobability distribution of A.
• At every step, if the walker is at vertex i then πi = π + 1.
a d
cbG
1
!1
!2
!3
2 3 1
1 1
1 1 1
0 0 0
0
0
0
2 1 1 0!44
time
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Random Walks in a Multi-Relational Network
• Suppose a multi-relational network M , where
M = (V,E = {E0, E1, . . . , Ek ⊆ (V × V )})
• Represent as a {0, 1}-adjacency tensor A ∈ {0, 1}|V |×|V |×|E|, where
Ami,j =
{1 if (i, j) ∈ Em : 1 ≤ m ≤ k0 otherwise.
• Then assume a “energy vector” π ∈ R|V |.
M.A. Rodriguez and J. Shinavier. Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms, in
review, http://arxiv.org/abs/0806.2274, 2009.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Random Walks in a Multi-Relational Network
a d
cb
authored
cites
contains
authoredcitescontains
0
0
0
0
0
0
0
0
0
0
0
0
0
00
1
M A
1 0 0 0
!
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Operations of the Multi-Relational Path Algebra
• A ·B: ordinary matrix multiplication determines the number of (A,B)-paths between vertices.
• A>: matrix transpose inverts path directionality.
• A ◦B: Hadamard, entry-wise multiplication applies a filter to selectivelyexclude paths.
• n(A): not generates the complement of a {0, 1}n×n matrix.
• c(A): clip generates a {0, 1}n×n matrix from a Rn×n+ matrix.
• v±(A): vertex generates a {0, 1}n×n matrix from a Rn×n+ matrix, where
only certain rows or columns contain non-zero values.
• λA: scalar multiplication weights the entries of a matrix.
• A + B: matrix addition merges paths.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Traverse Operation
• An interesting aspect of the single-relational adjacency matrix A ∈ {0, 1}n×n is that when it is raised
to the kth power, the entry A(k)i,j is equal to the number of paths of length k that connect vertex i to
vertex j.
• Given, by definition, that A(1)i,j (i.e. Ai,j) represents the number of paths that go from i to j of length
1 (i.e. a single edge) and by the rules of ordinary matrix multiplication,
A(k)i,j =
∑l∈V
A(k−1)i,l ·Al,j : k ≥ 2.
0
0
1
0
0
0 0
1
0 0
0
1
0
0
0 0
1
0
·0
0
0
0
0
0 1
0
0
=
a b c
a b c
a
b
c
a b c a b c
a
b
c
a
b
c
there is a path of length 2 from a to c
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
hA1 : authored
i hA2 : cites
i hA3 : contains
iThe Traverse Operation
Z = A1 · A2 · A1>,Zi,j defines the number of paths from vertex i to vertex j such that a path goes from author i to one the
articles he or she has authored, from that article to one of the articles it cites, and finally, from that cited
article to its author j. Semantically, Z is an author-citation single-relational path matrix.
lanl:marko
lanl:authored
vub:1010
lanl:authored
vub:fheyligh
ieee:2020lanl:cites
lanl:author-citation
A1
A2
A1!
Z
* NOTE: All diagrams are with respect to a “source” vertex (the blue vertex) in order to preserve clarity. In reality, the operations
operate on all vertices in parallel.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Filter Operation
Various path filters can be defined and applied using the entry-wiseHadamard matrix product denoted ◦, where
A ◦B =
A1,1 ·B1,1 · · · A1,m ·B1,m... . . . ...
An,1 ·Bn,1 · · · An,m ·Bn,m
.
0
0
0
72
1
15.3
0
0
0
23
0
24 00
0
0
0
4 0
0
0
0 12
0
0
0
0
0
1
1
0
0
0
0
1
0
0 00
0
0
0
0 0
0
0
0 0
0
0! =
0
0
0
72
1
0
0
0
0
23
0
0 00
0
0
0
0 0
0
0
0 0
0
0
Path Matrix Path Filter Filtered Path Matrix
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Filter Operation
• A ◦ 1 = A• A ◦ 0 = 0• A ◦B = B ◦A• A ◦ (B + C) = (A ◦B) + (A ◦C)• A> ◦B> = (A ◦B)>.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Not Filter
The not filter is useful for excluding a set of paths to or from a vertex.
n : {0, 1}n×n → {0, 1}n×n
with a function rule of
n(A)i,j =
{1 if Ai,j = 00 otherwise.
0
0
0
1
1
1
0
0
0
1
0
1 00
0
0
0
1 0
0
0
0 1
0
0=n
1
1
1
0
0
0
1
1
1
0
1
0 11
1
1
1
0 1
1
1
1 0
1
1
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Not Filter
If A ∈ {0, 1}n×n, then
• n(n(A)) = A• A ◦ n(A) = 0• n(A) ◦ n(A) = n(A).
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
hA1 : authored
i hA2 : cites
i hA3 : contains
iThe Not Filter
A coauthorship path matrix is
Z = A1 · A1> ◦ n(I)
lanl:marko
lanl:authored
acm:0505
lanl:jbollenlanl:coauthor
A1 A1!
Z
lanl:authored
lanl:coauthor
n(I)
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Clip Filter
The general purpose of clip is to take a path matrix and “clip”, ornormalize, it to a {0, 1}n×n matrix.
c : Rn×n+ → {0, 1}n×n
c(Z)i,j =
{1 if Zi,j > 00 otherwise.
0
0
0
72
1
15.3
0
0
0
23
0
24 00
0
0
0
4 0
0
0
0 12
0
0
0
0
0
1
1
1
0
0
0
1
0
1 00
0
0
0
1 0
0
0
0 1
0
0=c
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
The Clip Filter
If A,B ∈ {0, 1}n×n and Y,Z ∈ Rn×n+ , then
• c(A) = A• c(n(A)) = n(c(A)) = n(A)• c(Y ◦ Z) = c(Y) ◦ c(Z)• n(A ◦B) = c (n(A) + n(B))• n(A + B) = n(A) ◦ n(B)
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
hA1 : authored
i hA2 : cites
i hA3 : contains
iThe Clip Filter
Suppose we want to create an author citation path matrix that does not allow self citation or coauthorcitations.
Z =
„A1 · A2 · A1>
«| {z }
cites
◦n
„c
„A1 · A1> ◦ n(I)
««| {z }
no coauthors
◦ n(I)|{z}no self
lanl:marko
lanl:authored
lanl:3030
lanl:authored
lanl:jbollen
lanl:4040lanl:cites
lanl:author-citation
A1
A2
A1!
Z
authored
odu:nelson
A1!
lanl:authored
lanl:coauthor
self n(I)
n!c!A1 · A1! ! n(I)
""
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
hA1 : authored
i hA2 : cites
i hA3 : contains
iThe Clip Filter
However, using various theorems of the path algebra and abstract algebrain general,
Z =(A1 · A2 · A1>
)︸ ︷︷ ︸
cites
◦n(c(A1 · A1> ◦ n(I)
))︸ ︷︷ ︸
no coauthors
◦ n(I)︸︷︷︸no self
becomes
Z =(A1 · A2 · A1>
)◦ n(c(A1 · A1>
))◦ n(I).
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Other Filters and Operations...
• Please refer to the article for more information on these filters andoperations.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Problems with the Path Algebra
• As a matrix algebra, it is impossible (computationally speaking) tocompute matrix operations over the entire Web of Data.
• However, it is possible to approximate these calculations using “random”walkers.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Mapping Paths to Grammar-Based Random Walkers
• A grammar-based random walker is a walker that obeys a pathdescription.
• Able to compute “semantically rich” spreading activation and stationaryprobability distributions in a multi-relational network.
• Able to approximate through the convergence properties of theseoperations.
• Provides a convenient application to the Web of Data and linked graphdatabases.
M.A. Rodriguez. Grammar-Based Random Walkers in Semantic Networks. Knowledge-Based Systems, 21(7), 727–739, 2008.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
A Grammar Walker
A1 · A1! ! n(I)
Web of Data
127.0.0.4 127.0.0.5 127.0.0.6
structures structuresstructures
t=1t=2 t=3
Grammar Walker
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Grammar Walking the Web of Data
2
3
4
1
5
6
7
127.0.0.1
127.0.0.2 127.0.0.3
127.0.0.4 127.0.0.5
127.0.0.6
127.0.0.9
127.0.0.7
127.0.0.8
127.0.0.10
127.0.0.11
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Conclusion
• Graph databases will increasingly support the Web of Data.
• The Web of Data is about open, global-scale data management.
• Distributed computing is required for global-scale data processing.
• Grammar walkers can be used for distributed network analysis on theWeb of Data.
Computer Science Department Colloquium – University of New Mexico – April 16, 2009
Thank You For Your Time
? My homepage: http://markorodriguez.com? Neno/Fhat: http://neno.lanl.gov? Collective Decision Making Systems: http://cdms.lanl.gov? Faith in the Algorithm: http://faithinthealgorithm.net? MESUR: http://www.mesur.org
Computer Science Department Colloquium – University of New Mexico – April 16, 2009