69
Graphs and paths: Finding connections and patterns in data Peter Wood Department of Computer Science and Information Systems Birkbeck, University of London [email protected] Graphs and paths: Finding connections and patterns in data 1

Graphsandpaths: Findingconnectionsand … › ~ptw › inaugural.pdfLinked Data Linked Railway Data Project FAO 270a.info GovUK Wellbeing Worthwhile Mean Bibbase Semantic-web.org British

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Graphs and paths: Finding connections andpatterns in data

Peter Wood

Department of Computer Science and Information Systems

Birkbeck, University of London

[email protected]

Graphs and paths: Finding connections and patterns in data 1

Contents

Introduction

The complexity of querying graphs (1987–1995)

Optimising queries on trees (1999–2007)

Adding flexibility to graph queries (2006–

Ongoing and future work

Graphs and paths: Finding connections and patterns in data 2

Contents

Introduction

The complexity of querying graphs (1987–1995)

Optimising queries on trees (1999–2007)

Adding flexibility to graph queries (2006–

Ongoing and future work

Graphs and paths: Finding connections and patterns in data 3

Introduction

What is a graph and what are they used for?

What is a query language?

What are some of the problems we study?

Graphs and paths: Finding connections and patterns in data 4

What is a graph?

A B C

D

A graph is an abstract representation of some real-world data,

where the data consists of connections between entities:

the entities are denoted by nodes

the connections are denoted by edges

Graphs have been studied since the 18th century

Graphs and paths: Finding connections and patterns in data 5

What is a graph?

A B C

D

A graph is an abstract representation of some real-world data,

where the data consists of connections between entities:

the entities are denoted by nodes

the connections are denoted by edges

Graphs have been studied since the 18th century

Graphs and paths: Finding connections and patterns in data 6

Bridges of Königsberg

The city of Königsberg straddled a river containing 2 islands.

The islands and mainland were connected by 7 bridges.

Was there a walk that crossed each bridge once and only once?

In 1736 Leonhard Euler proved that such a walk did not exist.Graphs and paths: Finding connections and patterns in data 7

Bridges of Königsberg as a graph

A B C

D

Graphs and paths: Finding connections and patterns in data 8

Airline flights

Graphs and paths: Finding connections and patterns in data 9

Airline flights as a graph

A graph of airports and flight distances (in hundreds of miles):

SEA

YVR

LHR

CDGIAH MAD

LIM SCL

250

47

8

1

19

31 59

15

72

Graphs and paths: Finding connections and patterns in data 10

Airlines connecting airports

A graph of airports and airlines flying between them:

SEA

YVR

LHR

CDGIAH MAD

LIM SCL

BABA

BA

BA

AA

AA

AA LA

LA

LA

AFAA

AC

IB

AC

LA IB AF

Graphs and paths: Finding connections and patterns in data 11

Large graphs

the previous examples were of “toy” graphs

real-life graphs can be very large

examples include

Facebook: 1.65 billion users (nodes) and 280 billion connections (edges)

Linked Open Data cloud: 295 datasets, with 31 billion facts

Graphs and paths: Finding connections and patterns in data 12

Linked data graph

Linked Datasets as of August 2014

Uniprot

AlexandriaDigital Library

Gazetteer

lobidOrganizations

chem2bio2rdf

MultimediaLab University

Ghent

Open DataEcuador

GeoEcuador

Serendipity

UTPLLOD

GovAgriBusDenmark

DBpedialive

URIBurner

Linguistics

Social Networking

Life Sciences

Cross-Domain

Government

User-Generated Content

Publications

Geographic

Media

Identifiers

EionetRDF

lobidResources

WiktionaryDBpedia

Viaf

Umthes

RKBExplorer

Courseware

Opencyc

Olia

Gem.Thesaurus

AudiovisueleArchieven

DiseasomeFU-Berlin

Eurovocin

SKOS

DNBGND

Cornetto

Bio2RDFPubmed

Bio2RDFNDC

Bio2RDFMesh

IDS

OntosNewsPortal

AEMET

ineverycrea

LinkedUser

Feedback

MuseosEspaniaGNOSS

Europeana

NomenclatorAsturias

Red UnoInternacional

GNOSS

GeoWordnet

Bio2RDFHGNC

CticPublic

Dataset

Bio2RDFHomologene

Bio2RDFAffymetrix

MuninnWorld War I

CKAN

GovernmentWeb Integration

forLinkedData

Universidadde CuencaLinkeddata

Freebase

Linklion

Ariadne

OrganicEdunet

GeneExpressionAtlas RDF

ChemblRDF

BiosamplesRDF

IdentifiersOrg

BiomodelsRDF

ReactomeRDF

Disgenet

SemanticQuran

IATI asLinked Data

DutchShips and

Sailors

Verrijktkoninkrijk

IServe

Arago-dbpedia

LinkedTCGA

ABS270a.info

RDFLicense

EnvironmentalApplications

ReferenceThesaurus

Thist

JudaicaLink

BPR

OCD

ShoahVictimsNames

Reload

Data forTourists in

Castilla y Leon

2001SpanishCensusto RDF

RKBExplorer

Webscience

RKBExplorerEprintsHarvest

NVS

EU AgenciesBodies

EPO

LinkedNUTS

RKBExplorer

Epsrc

OpenMobile

Network

RKBExplorerLisbon

RKBExplorer

Italy

CE4R

EnvironmentAgency

Bathing WaterQuality

RKBExplorerKaunas

OpenData

Thesaurus

RKBExplorerWordnet

RKBExplorer

ECS

AustrianSki

Racers

Social-semweb

Thesaurus

DataOpenAc Uk

RKBExplorer

IEEE

RKBExplorer

LAAS

RKBExplorer

Wiki

RKBExplorer

JISC

RKBExplorerEprints

RKBExplorer

Pisa

RKBExplorer

Darmstadt

RKBExplorerunlocode

RKBExplorer

Newcastle

RKBExplorer

OS

RKBExplorer

Curriculum

RKBExplorer

Resex

RKBExplorer

Roma

RKBExplorerEurecom

RKBExplorer

IBM

RKBExplorer

NSF

RKBExplorer

kisti

RKBExplorer

DBLP

RKBExplorer

ACM

RKBExplorerCiteseer

RKBExplorer

Southampton

RKBExplorerDeepblue

RKBExplorerDeploy

RKBExplorer

Risks

RKBExplorer

ERA

RKBExplorer

OAI

RKBExplorer

FT

RKBExplorer

Ulm

RKBExplorer

Irit

RKBExplorerRAE2001

RKBExplorer

Dotac

RKBExplorerBudapest

SwedishOpen Cultural

Heritage

Radatana

CourtsThesaurus

GermanLabor LawThesaurus

GovUKTransport

Data

GovUKEducation

Data

EnaktingMortality

EnaktingEnergy

EnaktingCrime

EnaktingPopulation

EnaktingCO2Emission

EnaktingNHS

RKBExplorer

Crime

RKBExplorercordis

Govtrack

GeologicalSurvey of

AustriaThesaurus

GeoLinkedData

GesisThesoz

Bio2RDFPharmgkb

Bio2RDFSabiorkBio2RDF

Ncbigene

Bio2RDFIrefindex

Bio2RDFIproclass

Bio2RDFGOA

Bio2RDFDrugbank

Bio2RDFCTD

Bio2RDFBiomodels

Bio2RDFDBSNP

Bio2RDFClinicaltrials

Bio2RDFLSR

Bio2RDFOrphanet

Bio2RDFWormbase

BIS270a.info

DM2E

DBpediaPT

DBpediaES

DBpediaCS

DBnary

AlpinoRDF

YAGO

PdevLemon

Lemonuby

Isocat

Ietflang

Core

KUPKB

GettyAAT

SemanticWeb

Journal

OpenlinkSWDataspaces

MyOpenlinkDataspaces

Jugem

Typepad

AspireHarperAdams

NBNResolving

Worldcat

Bio2RDF

Bio2RDFECO

Taxon-conceptAssets

Indymedia

GovUKSocietal

WellbeingDeprivation imd

EmploymentRank La 2010

GNULicenses

GreekWordnet

DBpedia

CIPFA

Yso.fiAllars

Glottolog

StatusNetBonifaz

StatusNetshnoulle

Revyu

StatusNetKathryl

ChargingStations

AspireUCL

Tekord

Didactalia

ArtenueVosmedios

GNOSS

LinkedCrunchbase

ESDStandards

VIVOUniversityof Florida

Bio2RDFSGD

Resources

ProductOntology

DatosBne.es

StatusNetMrblog

Bio2RDFDataset

EUNIS

GovUKHousingMarket

LCSH

GovUKTransparencyImpact ind.Households

In temp.Accom.

UniprotKB

StatusNetTimttmy

SemanticWeb

Grundlagen

GovUKInput ind.

Local AuthorityFunding FromGovernment

Grant

StatusNetFcestrada

JITA

StatusNetSomsants

StatusNetIlikefreedom

DrugbankFU-Berlin

Semanlink

StatusNetDtdns

StatusNetStatus.net

DCSSheffield

AtheliaRFID

StatusNetTekk

ListaEncabezaMientosMateria

StatusNetFragdev

Morelab

DBTuneJohn PeelSessions

RDFizelast.fm

OpenData

Euskadi

GovUKTransparency

Input ind.Local auth.Funding f.

Gvmnt. Grant

MSC

Lexinfo

StatusNetEquestriarp

Asn.us

GovUKSocietal

WellbeingDeprivation ImdHealth Rank la

2010

StatusNetMacno

OceandrillingBorehole

AspireQmul

GovUKImpact

IndicatorsPlanning

ApplicationsGranted

Loius

Datahub.io

StatusNetMaymay

Prospectsand

TrendsGNOSS

GovUKTransparency

Impact IndicatorsEnergy Efficiency

new Builds

DBpediaEU

Bio2RDFTaxon

StatusNetTschlotfeldt

JamendoDBTune

AspireNTU

GovUKSocietal

WellbeingDeprivation Imd

Health Score2010

LoticoGNOSS

UniprotMetadata

LinkedEurostat

AspireSussex

Lexvo

LinkedGeoData

StatusNetSpip

SORS

GovUKHomeless-

nessAccept. per

1000

TWCIEEEvis

AspireBrunel

PlanetDataProject

Wiki

StatusNetFreelish

Statisticsdata.gov.uk

StatusNetMulestable

Enipedia

UKLegislation

API

LinkedMDB

StatusNetQth

SiderFU-Berlin

DBpediaDE

GovUKHouseholds

Social lettingsGeneral Needs

Lettings PrpNumber

Bedrooms

AgrovocSkos

MyExperiment

ProyectoApadrina

GovUKImd CrimeRank 2010

SISVU

GovUKSocietal

WellbeingDeprivation ImdHousing Rank la

2010

StatusNetUni

Siegen

OpendataScotland Simd

EducationRank

StatusNetKaimi

GovUKHouseholds

Accommodatedper 1000

StatusNetPlanetlibre

DBpediaEL

SztakiLOD

DBpediaLite

DrugInteractionKnowledge

BaseStatusNet

Qdnx

AmsterdamMuseum

AS EDN LOD

RDFOhloh

DBTuneartistslast.fm

AspireUclan

HellenicFire Brigade

Bibsonomy

NottinghamTrent

ResourceLists

OpendataScotland SimdIncome Rank

RandomnessGuide

London

OpendataScotland

Simd HealthRank

SouthamptonECS Eprints

FRB270a.info

StatusNetSebseb01

StatusNetBka

ESDToolkit

HellenicPolice

StatusNetCed117

OpenEnergy

Info Wiki

StatusNetLydiastench

OpenDataRISP

Taxon-concept

Occurences

Bio2RDFSGD

UIS270a.info

NYTimesLinked Open

Data

AspireKeele

GovUKHouseholdsProjectionsPopulation

W3C

OpendataScotland

Simd HousingRank

ZDB

StatusNet1w6

StatusNetAlexandre

Franke

DeweyDecimal

Classification

StatusNetStatus

StatusNetdoomicile

CurrencyDesignators

StatusNetHiico

LinkedEdgar

GovUKHouseholds

2008

DOI

StatusNetPandaid

BrazilianPoliticians

NHSJargon

Theses.fr

LinkedLifeData

Semantic WebDogFood

UMBEL

OpenlyLocal

StatusNetSsweeny

LinkedFood

InteractiveMaps

GNOSS

OECD270a.info

Sudoc.fr

GreenCompetitive-

nessGNOSS

StatusNetIntegralblue

WOLD

LinkedStockIndex

Apache

KDATA

LinkedOpenPiracy

GovUKSocietal

WellbeingDeprv. ImdEmpl. Rank

La 2010

BBCMusic

StatusNetQuitter

StatusNetScoffoni

OpenElection

DataProject

Referencedata.gov.uk

StatusNetJonkman

ProjectGutenbergFU-BerlinDBTropes

StatusNetSpraci

Libris

ECB270a.info

StatusNetThelovebug

Icane

GreekAdministrative

Geography

Bio2RDFOMIM

StatusNetOrangeseeds

NationalDiet Library

WEB NDLAuthorities

UniprotTaxonomy

DBpediaNL

L3SDBLP

FAOGeopolitical

Ontology

GovUKImpact

IndicatorsHousing Starts

DeutscheBiographie

StatusNetldnfai

StatusNetKeuser

StatusNetRusswurm

GovUK SocietalWellbeing

Deprivation ImdCrime Rank 2010

GovUKImd Income

Rank La2010

StatusNetDatenfahrt

StatusNetImirhil

Southamptonac.uk

LOD2Project

Wiki

DBpediaKO

DailymedFU-Berlin

WALS

DBpediaIT

StatusNetRecit

Livejournal

StatusNetExdc

Elviajero

Aves3D

OpenCalais

ZaragozaTurruta

AspireManchester

Wordnet(VU)

GovUKTransparency

Impact IndicatorsNeighbourhood

Plans

StatusNetDavid

Haberthuer

B3Kat

PubBielefeld

Prefix.cc

NALT

Vulnera-pedia

GovUKImpact

IndicatorsAffordable

Housing Starts

GovUKWellbeing lsoa

HappyYesterday

Mean

FlickrWrappr

Yso.fiYSA

OpenLibrary

AspirePlymouth

StatusNetJohndrink

Water

StatusNetGomertronic

Tags2conDelicious

StatusNettl1n

StatusNetProgval

Testee

WorldFactbookFU-Berlin

DBpediaJA

StatusNetCooleysekula

ProductDB

IMF270a.info

StatusNetPostblue

StatusNetSkilledtests

NextwebGNOSS

EurostatFU-Berlin

GovUKHouseholds

Social LettingsGeneral Needs

Lettings PrpHousehold

Composition

StatusNetFcac

DWSGroup

OpendataScotland

GraphSimd Rank

DNB

CleanEnergyData

Reegle

OpendataScotland SimdEmployment

Rank

ChroniclingAmerica

GovUKSocietal

WellbeingDeprivation

Imd Rank 2010

StatusNetBelfalas

AspireMMU

StatusNetLegadolibre

BlukBNB

StatusNetLebsanft

GADMGeovocab

GovUKImd Score

2010

SemanticXBRL

UKPostcodes

GeoNames

EEARodAspire

Roehampton

BFS270a.info

CameraDeputatiLinkedData

Bio2RDFGeneID

GovUKTransparency

Impact IndicatorsPlanning

ApplicationsGranted

StatusNetSweetie

Belle

O'Reilly

GNI

CityLichfield

GovUKImd

Rank 2010

BibleOntology

Idref.fr

StatusNetAtari

Frosch

Dev8d

NobelPrizes

StatusNetSoucy

ArchiveshubLinkedData

LinkedRailway

DataProject

FAO270a.info

GovUKWellbeing

WorthwhileMean

Bibbase

Semantic-web.org

BritishMuseum

Collection

GovUKDev LocalAuthorityServices

CodeHaus

Lingvoj

OrdnanceSurveyLinkedData

Wordpress

EurostatRDF

StatusNetKenzoid

GEMET

GovUKSocietal

WellbeingDeprv. imdScore '10

MisMuseosGNOSS

GovUKHouseholdsProjections

totalHouseolds

StatusNet20100

EEA

CiardRing

OpendataScotland Graph

EducationPupils by

School andDatazone

VIVOIndiana

University

Pokepedia

Transparency270a.info

StatusNetGlou

GovUKHomelessness

HouseholdsAccommodated

TemporaryHousing Types

STWThesaurus

forEconomics

DebianPackageTrackingSystem

DBTuneMagnatune

NUTSGeo-vocab

GovUKSocietal

WellbeingDeprivation ImdIncome Rank La

2010

BBCWildlifeFinder

StatusNetMystatus

MiguiadEviajesGNOSS

AcornSat

DataBnf.fr

GovUKimd env.

rank 2010

StatusNetOpensimchat

OpenFoodFacts

GovUKSocietal

WellbeingDeprivation Imd

Education Rank La2010

LODACBDLS

FOAF-Profiles

StatusNetSamnoble

GovUKTransparency

Impact IndicatorsAffordable

Housing Starts

StatusNetCoreyavisEnel

Shops

DBpediaFR

StatusNetRainbowdash

StatusNetMamalibre

PrincetonLibrary

Findingaids

WWWFoundation

Bio2RDFOMIM

Resources

OpendataScotland Simd

GeographicAccess Rank

Gutenberg

StatusNetOtbm

ODCLSOA

StatusNetOurcoffs

Colinda

WebNmasunoTraveler

StatusNetHackerposse

LOV

GarnicaPlywood

GovUKwellb. happy

yesterdaystd. dev.

StatusNetLudost

BBCProgram-

mes

GovUKSocietal

WellbeingDeprivation Imd

EnvironmentRank 2010

Bio2RDFTaxonomy

Worldbank270a.info

OSM

DBTuneMusic-brainz

LinkedMarkMail

StatusNetDeuxpi

GovUKTransparency

ImpactIndicators

Housing Starts

BizkaiSense

GovUKimpact

indicators energyefficiency new

builds

StatusNetMorphtown

GovUKTransparency

Input indicatorsLocal authorities

Working w. tr.Families

ISO 639Oasis

AspirePortsmouth

ZaragozaDatos

AbiertosOpendataScotland

SimdCrime Rank

Berlios

StatusNetpiana

GovUKNet Add.Dwellings

Bootsnall

StatusNetchromic

Geospecies

linkedct

Wordnet(W3C)

StatusNetthornton2

StatusNetmkuttner

StatusNetlinuxwrangling

EurostatLinkedData

GovUKsocietal

wellbeingdeprv. imd

rank '07

GovUKsocietal

wellbeingdeprv. imdrank la '10

LinkedOpen Data

ofEcology

StatusNetchickenkiller

StatusNetgegeweb

DeustoTech

StatusNetschiessle

GovUKtransparency

impactindicatorstr. families

Taxonconcept

GovUKservice

expenditure

GovUKsocietal

wellbeingdeprivation imd

employmentscore 2010

Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak.

http://lod-cloud.net/

Graphs and paths: Finding connections and patterns in data 13

Linked Open Data

Linked Open Data stores facts such as

“Hilary Mantel won the Man Booker Prize"

as triples of the form subject — predicate (or property) — object

e.g. ( Hilary Mantel , hasWonPrize, Man Booker ) or

Hilary Mantel Man BookerhasWonPrize

This language is called RDF (Resource Description Framework)

Graphs and paths: Finding connections and patterns in data 14

What are graphs used for?

social networks

knowledge representation (e.g. semantic web)

transportation and other networks (e.g WWW)

geographical information

(hyper)document structure

semantic associations in criminal investigations

bibliographic citation analysis

pathways in biological processes

computer program analysis

workflow systems

data provenance

. . .Graphs and paths: Finding connections and patterns in data 15

Query languages

A query language is a declarative language in which to express an

information request from a data set.

“declarative” means that we shouldn’t need to specify precisely how

the information is to be retrieved

that is left up to the (database) system to work out

e.g. What is the length of the shortest route (path) from LHR to LIM

Graphs and paths: Finding connections and patterns in data 16

Query languages

A query language is a declarative language in which to express an

information request from a data set.

“declarative” means that we shouldn’t need to specify precisely how

the information is to be retrieved

that is left up to the (database) system to work out

e.g. What is the length of the shortest route (path) from LHR to LIM

Graphs and paths: Finding connections and patterns in data 17

Airline flights as a graph

A graph of airports and flight distances (in hundreds of miles):

SEA

YVR

LHR

CDGIAH MAD

LIM SCL

250

47

8

1

19

31 59

15

72

Graphs and paths: Finding connections and patterns in data 18

Airline flights as a graph

A graph of airports and flight distances (in hundreds of miles):

SEA

YVR

LHR

CDGIAH MAD

LIM SCL

250

47

8

1

19

31 59

15

72

Graphs and paths: Finding connections and patterns in data 19

Airline flights as a graph

A graph of airports and flight distances (in hundreds of miles):

SEA

YVR

LHR

CDGIAH MAD

LIM SCL

250

47

8

1

19

31 59

15

72

Graphs and paths: Finding connections and patterns in data 20

Airline flights as a graph

A graph of airports and flight distances (in hundreds of miles):

SEA

YVR

LHR

CDGIAH MAD

LIM SCL

250

47

8

1

19

31 59

15

72

Graphs and paths: Finding connections and patterns in data 21

Airline flights as a graph

A graph of airports and flight distances (in hundreds of miles):

SEA

YVR

LHR

CDGIAH MAD

LIM SCL

250

47

8

1

19

31 59

15

72

Graphs and paths: Finding connections and patterns in data 22

Shortest path algorithmFunction Dijkstra(Graph, source)dist[LHR] ← 0

create node set Q

foreach node v in Graph do

if v 6= source then dist[v] ← INFINITY

Q.add_with_priority(v, dist[v])

while Q is not empty do

u ← Q.extract_min()

foreach neighbour v of u do

alt ← dist[u] + length(u, v)

if alt < dist[v] then

dist[v] ← alt

Q.decrease_priority(v, alt)

return dist[]Graphs and paths: Finding connections and patterns in data 23

Shortest path as a query

LHR LIM( )+(D)

shortestPath(min(sum(D))

variable D collects up the distance values

sum adds up the distances along a path

min asks for the minimum summed distance

Graphs and paths: Finding connections and patterns in data 24

Problems to study

3 main topics of investigation:

what are appropriate query language constructs?

how efficiently can queries be evaluated?

how can query evaluation be optimised?

Graphs and paths: Finding connections and patterns in data 25

Measures of efficiency

we can run queries to see how fast they execute

we can study their computational complexity

polynomial-time algorithms

NP-complete (or worse) problems — intractable

we can try to find polynomial-time subclasses (occurring in practice)

we can try to find approximation algorithms or heuristics

we can claim that the problem size is small enough

Graphs and paths: Finding connections and patterns in data 26

Measures of efficiency

we can run queries to see how fast they execute

we can study their computational complexity

polynomial-time algorithms

NP-complete (or worse) problems — intractable

we can try to find polynomial-time subclasses (occurring in practice)

we can try to find approximation algorithms or heuristics

we can claim that the problem size is small enough

Graphs and paths: Finding connections and patterns in data 27

Measures of efficiency

we can run queries to see how fast they execute

we can study their computational complexity

polynomial-time algorithms

NP-complete (or worse) problems — intractable

we can try to find polynomial-time subclasses (occurring in practice)

we can try to find approximation algorithms or heuristics

we can claim that the problem size is small enough

Graphs and paths: Finding connections and patterns in data 28

Contents

Introduction

The complexity of querying graphs (1987–1995)

Optimising queries on trees (1999–2007)

Adding flexibility to graph queries (2006–

Ongoing and future work

Graphs and paths: Finding connections and patterns in data 29

Many years ago . . .

my PhD was about “Queries on Graphs”

one contribution was the use of regular expressions to specify the

paths of interest

also contributions in terms of complexity of query evaluation

25 years later

regular expressions were added to SPARQL 1.1, the standard query

language for RDF

Graphs and paths: Finding connections and patterns in data 30

Many years ago . . .

my PhD was about “Queries on Graphs”

one contribution was the use of regular expressions to specify the

paths of interest

also contributions in terms of complexity of query evaluation

25 years later

regular expressions were added to SPARQL 1.1, the standard query

language for RDF

Graphs and paths: Finding connections and patterns in data 31

Regular expressions

Used to express patterns of interest, using 3 (or more) operators:

“choice” (|): a parent is a father or mother

(father | mother)

“sequence” (·): a grandparent is a parent of a parent

(parent · parent)

“repetition” (∗): an ancestor is a parent of a parent of a . . .

(parent)∗

These can be combined: so (mother | father)∗ specifies any sequence of

mother and/or father relationships.

Graphs and paths: Finding connections and patterns in data 32

Regular expressions

Used to express patterns of interest, using 3 (or more) operators:

“choice” (|): a parent is a father or mother

(father | mother)

“sequence” (·): a grandparent is a parent of a parent

(parent · parent)

“repetition” (∗): an ancestor is a parent of a parent of a . . .

(parent)∗

These can be combined: so (mother | father)∗ specifies any sequence of

mother and/or father relationships.

Graphs and paths: Finding connections and patterns in data 33

Regular expressions

Used to express patterns of interest, using 3 (or more) operators:

“choice” (|): a parent is a father or mother

(father | mother)

“sequence” (·): a grandparent is a parent of a parent

(parent · parent)

“repetition” (∗): an ancestor is a parent of a parent of a . . .

(parent)∗

These can be combined: so (mother | father)∗ specifies any sequence of

mother and/or father relationships.

Graphs and paths: Finding connections and patterns in data 34

Regular expressions

Used to express patterns of interest, using 3 (or more) operators:

“choice” (|): a parent is a father or mother

(father | mother)

“sequence” (·): a grandparent is a parent of a parent

(parent · parent)

“repetition” (∗): an ancestor is a parent of a parent of a . . .

(parent)∗

These can be combined: so (mother | father)∗ specifies any sequence of

mother and/or father relationships.

Graphs and paths: Finding connections and patterns in data 35

Regular expressions

Used to express patterns of interest, using 3 (or more) operators:

“choice” (|): a parent is a father or mother

(father | mother)

“sequence” (·): a grandparent is a parent of a parent

(parent · parent)

“repetition” (∗): an ancestor is a parent of a parent of a . . .

(parent)∗

These can be combined: so (mother | father)∗ specifies any sequence of

mother and/or father relationships.

Graphs and paths: Finding connections and patterns in data 36

Example of a query

given a graph

nodes representing people

edges labelled with m (for motherOf) or f (for fatherOf)

following query finds parents followed by pairs of people who have a

common ancestor

x z y x y

x y x y

p* p* a

m|f p

x, y and z are variables, for matching nodes in the graphGraphs and paths: Finding connections and patterns in data 37

Example of a query

given a graph

nodes representing people

edges labelled with m (for motherOf) or f (for fatherOf)

following query finds parents followed by pairs of people who have a

common ancestor

x z y x y

x y x y

p* p* a

m|f p

x, y and z are variables, for matching nodes in the graphGraphs and paths: Finding connections and patterns in data 38

Regular simple path queries

a path p is simple if no node is repeated on p

Regular Simple Path Problem

Given a graph, a pair of nodes x and y and a regular expression r , is

there a simple path from x to y satisfying r?

Regular Simple Path Problem is NP-complete, even for fixed

expressions

the complexity is in the size of the graph

Graphs and paths: Finding connections and patterns in data 39

Regular simple path queries

a path p is simple if no node is repeated on p

Regular Simple Path Problem

Given a graph, a pair of nodes x and y and a regular expression r , is

there a simple path from x to y satisfying r?

Regular Simple Path Problem is NP-complete, even for fixed

expressions

the complexity is in the size of the graph

Graphs and paths: Finding connections and patterns in data 40

Regular simple path queries

a path p is simple if no node is repeated on p

Regular Simple Path Problem

Given a graph, a pair of nodes x and y and a regular expression r , is

there a simple path from x to y satisfying r?

Regular Simple Path Problem is NP-complete, even for fixed

expressions

the complexity is in the size of the graph

Graphs and paths: Finding connections and patterns in data 41

Regular simple path problem

solvable in polynomial time when restricted to regular expressions

closed under abbreviations (or deletions)

if we take a sequence satisfying regular expression r and delete some

labels from the sequence, we get another sequence satisfying r

many commonly used regular expressions, such as p* and (m|f)* are

closed under abbreviations

Graphs and paths: Finding connections and patterns in data 42

Contents

Introduction

The complexity of querying graphs (1987–1995)

Optimising queries on trees (1999–2007)

Adding flexibility to graph queries (2006–

Ongoing and future work

Graphs and paths: Finding connections and patterns in data 43

Trees

trees are commonly used for representing data

one language for trees is called XML (eXtensible Modelling Language)

XML is used for data exchange between systems/applications

e.g., HESA (Higher Education Statistics Agency) requires student

data to be submitted in XML

Graphs and paths: Finding connections and patterns in data 44

Calendar data in XML

<diary>

<event>

<date>3/12/88</date> <description>anniversary</description>

<repeat><frequency>yearly</frequency></repeat>

</event>

<event>

<date>6/6/16</date> <description>inaugural</description>

<time>17:00</time> <duration>1:00</duration>

</event>

<event>

<date>5/1/2016</date> <description>lecture</description>

<time>18:00</time> <duration>3:00</duration>

<repeat><until>19/3/2016</until><frequency>weekly</frequency></repeat>

</event>

</diary>Graphs and paths: Finding connections and patterns in data 45

Calendar data as a tree

diary

event event event

date desc repeat date desc time duration date desc time duration repeat

3/12/88 anniversary frequency 6/6/16 inaugural 17:00 1:00 5/1/16 lecture 18:00 3:00 until frequency

yearly 19/3/16 weekly

Graphs and paths: Finding connections and patterns in data 46

A query language for XML

XPath (XML Path language) is a query language for XML

it is part of many other languages for processing XML

it makes sense to study optimisation of XPath queries

particularly in the presence of schema information

a schema is a set of constraints that restricts the permissable

relationships between data items

one way to define schemas for XML is to use

Document Type Definitions (DTDs)

Graphs and paths: Finding connections and patterns in data 47

A query language for XML

XPath (XML Path language) is a query language for XML

it is part of many other languages for processing XML

it makes sense to study optimisation of XPath queries

particularly in the presence of schema information

a schema is a set of constraints that restricts the permissable

relationships between data items

one way to define schemas for XML is to use

Document Type Definitions (DTDs)

Graphs and paths: Finding connections and patterns in data 48

DTD example

Consider an XML DTD for our diary application:

diary → (event)*

event → (date · description · (time · duration)?, repeat?)

repeat → ((until | occurrences)? · frequency)

note the use of regular expressions above

? means that what precedes it is optional

some constraints specified by the DTD are:

– every event must have a date

– every event with a time must have a duration

– every event has at most one description

Graphs and paths: Finding connections and patterns in data 49

DTD example

Consider an XML DTD for our diary application:

diary → (event)*

event → (date · description · (time · duration)?, repeat?)

repeat → ((until | occurrences)? · frequency)

note the use of regular expressions above

? means that what precedes it is optional

some constraints specified by the DTD are:

– every event must have a date

– every event with a time must have a duration

– every event has at most one description

Graphs and paths: Finding connections and patterns in data 50

Query satisfiability

an XPath query may be unsatisfiable (with respect to a DTD)

this means that the answer to the query will always be empty

e.g., asking for diary events which repeat both until some date and

some number of occurrences

checking satisfiability first can yield savings in overall query

processing time

Graphs and paths: Finding connections and patterns in data 51

XPath satisfiability results

(joint work with PhD student Manizheh Montazerian)

we showed that checking satisfiability, even for simple XPath

fragments, is NP-complete

we also examined several real-world DTDs and discovered a new

property, called covering, which most of them preserved

we showed that query satisfiability for a fragment of XPath can be

tested in polynomial time when the underlying DTD is covering

Graphs and paths: Finding connections and patterns in data 52

Contents

Introduction

The complexity of querying graphs (1987–1995)

Optimising queries on trees (1999–2007)

Adding flexibility to graph queries (2006–

Ongoing and future work

Graphs and paths: Finding connections and patterns in data 53

Flexible querying

(joint work with Alex Poulovassilis, Andrea Calì, Carlos Hurtado, Petra

Selmer, Riccardo Frosini)

with LOD, many heterogeneous data sets are available

the standard query language is called SPARQL

regular expressions (property paths in SPARQL 1.1 (2013)) already

allow for some flexibility

but when users don’t know the vocabulary or structure, they may

need help

in particular, they may pose unsatisfiable queries

Graphs and paths: Finding connections and patterns in data 54

Example LOD graph

Graph of authors , prizes they have won, and countries where they

were born:

Nobel Booker

Neruda Gordimer Coetzee Carey

Chile South Africa Australia

hasWonPrize

wasBornIn

Graphs and paths: Finding connections and patterns in data 55

Example SPARQL query

Which authors born in South Africa have won both the

Nobel Prize in Literature and the Man Booker prize?

xBooker Nobel

South Africa

hasWonPrize hasWonPrize

wasBornIn

x is a variable

Graphs and paths: Finding connections and patterns in data 56

Matching answers

Two matchings for variable x : Gordimer and Coetzee

Nobel Booker

Neruda Gordimer Coetzee Carey

Chile South Africa Australia

Graphs and paths: Finding connections and patterns in data 57

Matching answers

Two matchings for variable x : Gordimer and Coetzee

Nobel Booker

Neruda Gordimer Coetzee Carey

Chile South Africa Australia

Gordimer

Graphs and paths: Finding connections and patterns in data 58

Matching answers

Two matchings for variable x : Gordimer and Coetzee

Nobel Booker

Neruda Gordimer Coetzee Carey

Chile South Africa Australia

Coetzee

Graphs and paths: Finding connections and patterns in data 59

in the real data set, wasBornIn connects a person to a city, not a

country

so the previous query will return no answers

a city is connected to a country by an edge labelled with isLocatedIn

so the correct regular expression to connect x to South Africa is:

wasBornIn · isLocatedIn

the query system can automatically make changes to a query so as to

help the user find relevant information

Graphs and paths: Finding connections and patterns in data 60

Approximate matching

the user’s original query is modified by applying edit operations to theregular expressions, e.g.,

insertions

deletions

substitutions

so isLocatedIn would be automatically inserted after wasBornIn

such edit operations have also been used for DNA sequence alignment

each operation may have a different cost associated with it

answers to queries are returned in ranked order, in increasing total

cost from the original query

Graphs and paths: Finding connections and patterns in data 61

Approximate matching

the user’s original query is modified by applying edit operations to theregular expressions, e.g.,

insertions

deletions

substitutions

so isLocatedIn would be automatically inserted after wasBornIn

such edit operations have also been used for DNA sequence alignment

each operation may have a different cost associated with it

answers to queries are returned in ranked order, in increasing total

cost from the original query

Graphs and paths: Finding connections and patterns in data 62

Optimisation

applying edit operations generates many additional queries

so we have been investigating ways to speed up query execution

one is to construct an approximate summary schema from the data

this allows the system to detect and ignore unsatisfiable queries

Graphs and paths: Finding connections and patterns in data 63

Schema summary optimisation

basic idea is to record all the path sequences up to a certain

maximum length that actually appear in the data

choose a small maximum length, e.g. 2

only an approximation: if we discover two paths a · b and b · c in the

data, the summary records that a · b · c is a possible path, even if it

doesn’t actually appear in the data

nevertheless, if a path does not appear in the summary, then it does

not appear in the data

this allows the system to check for unsatisfiable queries

and save time by not executing them

Graphs and paths: Finding connections and patterns in data 64

Schema summary optimisation

basic idea is to record all the path sequences up to a certain

maximum length that actually appear in the data

choose a small maximum length, e.g. 2

only an approximation: if we discover two paths a · b and b · c in the

data, the summary records that a · b · c is a possible path, even if it

doesn’t actually appear in the data

nevertheless, if a path does not appear in the summary, then it does

not appear in the data

this allows the system to check for unsatisfiable queries

and save time by not executing them

Graphs and paths: Finding connections and patterns in data 65

Sample schema summary results

for one query with a maximum edit cost set to 3,

112 additional queries were generated via edit operations

using a schema summary on a data set of 6.7M facts,

59 of these queries were unsatisfiable

without the summary, the queries did not complete execution within 8

hours

with a schema summary of size 2, they completed in under 4 seconds

Graphs and paths: Finding connections and patterns in data 66

Sample schema summary results

for one query with a maximum edit cost set to 3,

112 additional queries were generated via edit operations

using a schema summary on a data set of 6.7M facts,

59 of these queries were unsatisfiable

without the summary, the queries did not complete execution within 8

hours

with a schema summary of size 2, they completed in under 4 seconds

Graphs and paths: Finding connections and patterns in data 67

Contents

Introduction

The complexity of querying graphs (1987–1995)

Optimising queries on trees (1999–2007)

Adding flexibility to graph queries (2006–

Ongoing and future work

Graphs and paths: Finding connections and patterns in data 68

Ongoing and future work

further investigation of suitable query constructs, particularly for

social network applications

returning recommended paths to users, rather than simply nodes

provide explanations to users about how answers to queries were

derived

other methods to improve query execution time

use mappings between LOD data sets to allow for automatic rewriting

of queries in distributed scenarios

Graphs and paths: Finding connections and patterns in data 69