29
Outline Problem Statement and Motivation – Generalized Path Queries – Algebraic Problem Interpretation • Background – Algebraic Path Problem Solving • Approach – Integrating of graph pattern matching with algebraic path problem solving • Evaluation • Conclusion 1

Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

Outline

•  Problem Statement and Motivation – Generalized Path Queries – Algebraic Problem Interpretation

•  Background – Algebraic Path Problem Solving

•  Approach –  Integrating of graph pattern matching with

algebraic path problem solving

•  Evaluation •  Conclusion

1

Page 2: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

MotivatingExample2

Query: Find rela*onships between passengers on any flights toWashingtonDCbetweenJune30andJuly6whopurchasedone-way*cketsbycash,andcountriesontheCIAwatchlistwherethereisatleastonefinanciallinkthroughanybank.

1.   Rela,onships are paths that must be extracted not merelymatched!!!

•  e.g.differentfrompathexpression/propertypathqueries

2.   Par,cipa,ngen,,esarenotexplicitlystated

Pa>ernp1:Passengers on flights to WashingtonDCbetween June 30 and July 6whopurchasedone-way*cketsbycash

Pa>ernp2:CountriesontheCIAwatchlist

?

Page 3: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

ExistingProblemInterpretationApproaches3

Exis*ngApproaches

GraphTheore*cInterpreta*on

AlgebraicInterpreta*on

Rela*onalAlgebra(DatabaseApproach)

HybridApproach(Par*allyrela*onal+graph)

TotalAlgebraicApproach(Rela,onal

Algebra*+PathAlgebra)

•  ForgraphdatamodelslikeRDFtherearemul*plewaystoapproachtheproblem

Holistic Decomposable

Page 4: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

Comparison of Approaches

Graph Theoretic •  Different problems translate to

different graph problems i.e. different algorithms –  Ex. is sub-graph homeomorphism

problem (NP hard). –  Others: Subgraph Isomorphism,

Shortest Paths, Disjoint Paths

•  Often difficult to parallelize or optimize

•  Cant deal with declarative specifications

Algebraic •  Decompose into smaller operators

–  Reuse is possible –  Adapt optimization and composition of

smaller operators

•  Relational algebra not ideal –  Arbitrary path length requires iteration

and fixpoint semantics

•  Hybrid approaches exist –  Pattern matching (algebraic), traversal

(graph theoretic) •  Emphasis on shortest paths for

traversals

4

Page 5: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

5

Query: Find relationships between passengers on any flights to Washington DC between June 30 and July 6 who purchased one-way tickets by cash, and countries on the CIA watchlist where there is at least one financial link through any bank.

Problem interpretation is as a Generalized Path Query – gpqs with

1.  Two entity sets declaratively specified as patterns – P1, P2

2.  A connection set linking p1, p2 instances

3.  A connection constraint on the connection set (edge type membership)

OurProposal:ATotalAlgebraicApproach

1 : Solvable with algebraic graph pattern matching

2, 3 : Solve with algebraic path finding (plus extension)

Page 6: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

AlgebraicPathProblemSolvingFramework6

•  Anefficient algebraic path problem solving approach introducedbyTarjan[23,24].

•  BasicDefini,ons:•  AnedgeeinadirectedlabeledgraphG=(V,E)isdenotedase=(v1,v2)withlabelλ(e)=lewherev1,v2∈Vande∈E.

•  ApathpinsuchagraphG=(V,E)isanalterna*ngsequenceofnodesandlabelededgestermina*nginanodep

•  ={v1,le1,v2,le2,...,vn,len,vn+1}wherev1,v2,...,vn,vn+1∈Vande1,e2,...,en∈E.

Page 7: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

•  APathExpression[23,24]oftype(s,d),PE(s,d),isa3-tuple〈s,d,R〉,where§  R is a regular expression over the set of edges defined using

union(∪),concatena*on(•)andclosure(*).§  The language L(R) ofR represents paths from s to d where the

graphcontainsnodessandd.

PathEncodingasPathExpressions7

PE(2,7)=〈2,7,((b•c•f)∪(i•f))〉

1

4

2

3

7

5

8

6

ka

bi

c

d

fh

g

e

j

PE(1,4)=〈1,4,(k)〉

Page 8: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

GraphPathInformationRepresentation:ASequenceofPathExpressions

8

•  APathSequence(PS)[23,25]isauniqueorderingofpathexpressionsthatrepresentallpathinforma*oninagraph,suchthatforanypathp,

§  thereisauniquepar**onofpintononemptysubpaths,and

§  auniquesequencesofindicesIofPS,suchthat

§  theithsubpathofpisrepresentedbythepathexpressioninPSattheithindexinI.

PE1 PE2 PE3 PE4 PE1000... PE100 PE155... ... PE390 ...PS:

Pathp:p1 p2 p3 p4 p5

I={2,3,100,390,1000}

•  Thepathsequencerepresenta*onofagraphallowsforpathproblemstobesolvedbysingleforwardscan.

Page 9: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

PathComputationusingPathSequence9

•  Asimplepropaga*onSOLVEalgorithm[23,25]canbeusedtosolvepathproblems.

•  The SOLVE algorithm assembles path informa*on while scanningthepath-sequencefromleitoright.

•  At every itera*on of the SOLVE algorithm the following step isperformed

𝑃𝐸(𝑠, 𝑤↓𝑖 )⋃(𝑃𝐸(𝑠, 𝑣↓𝑖 ) � 𝑃𝐸(𝑣↓𝑖 , 𝑤↓𝑖 ))→𝑆𝐴[ 𝑤↓𝑖 ]

Page 10: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

OverviewofApproach(1)10

•  Webuiltaprototypebyintegra*ng§  SemStorm[31]:

o  aHadoop-basedfileorganiza*onstoragesystemo  supportsefficientgraphpakernmatchingqueryexecu*onusinganalgebraicqueryevalua*ontechnique

o  usesApacheTezastheexecu*onenvironment.§  Serpent[25,27]:

o  plalorm for finding all paths between a set of sources anddes*na*ons.

o  buildsonthepathalgebraictechniqueusingpath-sequences.

Path Filter Path Finder Pattern Filter Pattern Matching

P1 : Pas sengers onflights to WashingtonDC, Purchased one-way*cketsbycash

P2:CountriesontheCIAwatchlist

FindingpathsbetweenP1andP2

P1: Between June30andJuly6

WithatleastonefinanciallinkSemstormperforms

thesecomponentsSerpentperformsthesecomponents

Page 11: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

1.   QueryExpression•  Goal:Minimizedisrup*ontoexis*nginfrastructure,e.g.parser.•  Solu>on:Usesyntac*csugartorepresentpathoperator.

2.   QueryPlanning•  Goal: Integrate graph pakern matching with path problemsolving

•  Solu>on:Modifygraphpakernmatchingqueryplanbyaddingalgebraicpathqueryingoperatorstoproduceagpqsqueryplan.

3.   QueryExecu,on•  Goal:Executegpqs.•  Solu>on: Translate gpq logical query plan into physical queryplanbyintroducingrequiredphysicalqueryoperators.

OverviewofApproach(2)11

Page 12: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

IntroductionofSyntacticSugar–?pathVarasaspecialpropertyvariable

12

•  AvoidschangetoSPARQL'squerysyntax.

•  Fortriplepakern〈?s ?pathVar ?d〉, •  ?pathVarisactuallyapathvariable

•  removedandhandledspeciallywhilerestofqueryiscompilednormallyasgraphpakernquery.

SELECT * WHERE { ?s1 rdf:type akt:Affiliated-Person . ?s1 akt:full-name "Wendy E. Mackay" . ?s akt:has-author ?s1 . ?s2 akt:full-name "Irene Greif" . ?s2 akt:has-affiliation ?d . ?s ?pathVar ?d . }

Page 13: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

IntroductionofSyntacticsugar–?pathVarasaPropertyVariable

13

SELECT * WHERE { ?s1 rdf:type akt:Affiliated-Person . ?s1 akt:full-name "Wendy E. Mackay" . ?s akt:has-author ?s1 . ?s2 akt:full-name "Irene Greif" . ?s2 akt:has-affiliation ?d . ?s ?pathVar ?d . }

?s2

?d

"IreneGreif":fn

:ha?s1

"WendyE.Mackay"

"IreneGreif":type

:fn?s

:ha

•  Challenge: Need to track thepath source and des*na*onvariablesinthegraphpakern

•  Solu,on: Use SemStorm’sdatastructures for trackingvariables.

Page 14: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

1.   QueryExpression•  Goal:Minimizedisrup*ontoexis*nginfrastructure,e.g.parser.•  Solu>on:Usesyntac*csugartorepresentpathoperator.

2.   QueryPlanning•  Goal: Integrate graph pakern matching with path problemsolving

•  Solu>on:Modifygraphpakernmatchingqueryplanbyaddingalgebraicpathqueryingoperatorstoproduceagpqsqueryplan.

3.   QueryExecu,on•  Goal:Executegpqs.•  Solu>on: Translate gpq logical query plan into physical queryplanbyintroducingrequiredphysicalqueryoperators.

OverviewofApproach14

Page 15: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

LogicalPlanExampleasatree15

https://jena.apache.org/

(prefix ((akt:http://www.aktors.org/ontology/portal#) (rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns#)) (project (?s1 ?s ?d) (product (join (BGP [triple ?s1 rdf:type akt:Affiliated-Person] [triple ?s1 akt:full-name "Wendy E. Mackay"] ) (BGP [triple ?s akt:has-author ?s1] )) (BGP [triple ?s2 akt:full-name "Irene Greif"] [triple ?s2 akt:has-affiliation ?d] )))

ApacheJenaQueryPlan(SparqlSyntaxExpression–SSE)

BGP?s1 BGP?dBGP?s

join

product

project

LogicalQueryPlanasatree

?s1

Disconnectedsub-graphpakernsleadstocross-product

Page 16: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

16

•  Removecross-productandprojec*onoperatorsandintroducepathqueryoperators.

pathfilter

pathoperator

LogicalPlanTransformationExample

BGP?s1 BGP?dBGP?s

join

product

project

BGP?s1 BGP?dBGP?s

join?s1 ?s1

Page 17: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

1.   QueryExpression•  Goal:Minimizedisrup*ontoexis*nginfrastructure,e.g.parser.•  Solu>on:Usesyntac*csugartorepresentpathoperator.

2.   QueryPlanning•  Goal: Integrate graph pakern matching with path problemsolving

•  Solu>on:Modifygraphpakernmatchingqueryplanbyaddingalgebraicpathqueryingoperatorstoproduceagpqsqueryplan.

3.   QueryExecu,on•  Goal:Executegpqs.•  Solu>on: Translate gpq logical query plan into physical queryplanbyintroducingrequiredphysicalqueryoperators.

OverviewofApproach17

Page 18: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

ImplementationStrategy18

•  PhysicalqueryoperatorsforourplalormareTezver*ces

•  PhysicalqueryplanisrepresentedbytheTezDAG.

•  Thefollowingphysicalqueryoperatorswereintroduced

§  Annotator Vertex for source, des,na,on and constraintvariables iden*fy the source, des*na*on or constraintvariables, allowing only the bindings for that variable topassthrough.

§  PathComputerVertexisthepathoperatorwhichperformsthefinalpathcomputa*on.

Page 19: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

ExampleQuery&DAG19

TypeScanner::?s

TypeScanner::?s1

Annotator-Source:[[0,-1]]

Annotator-Des*na*on:[[2,1]]

PathComputer::Src:?s,Dst:?d

TypeScanner::?s2

Packager::?s,?s1

Annotator(?s1):0

Annotator(?s1):1

PREFIX akt: <http://www.aktors.org/ontology/ portal#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT * WHERE { ?s1 rdf:type akt:Affiliated-Person . ?s1 akt:full-name "Wendy E. Mackay" . ?s akt:has-author ?s1 .

?s2 akt:full-name "Irene Greif" . ?s2 akt:has-affiliation ?d . ?s ?pathVar ?d . }

ExamplegpqPhysicalqueryplan

Page 20: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

Evaluation20

TestSetup:

•  Wecomparedourintegratedsystemwithanexis*ngplalormon1.  Expressiveness.2.  Query compila*on *me comparison with and without path

operator.3.  Performance.4.  Completenessofresults.

•  Evalua*onwasconductedonsinglenodeserver§  runningHDFS§  withXeonoctacorex8664CPU(2.33GHz),§  40GBRAM,§  twoHDDs(3.6TBand445GB).

Page 21: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

IssueswithExistingPlatforms21

Neo4j:•  ThefastBFSalgorithmisonlyforfindingshortestpath.

•  Forfindingallpathstheslowerexhaus,veDFSisused.

•  Gpqscouldnotbe runonNeo4jas itwas runningoutof resourcesandcrashing.

Stardog:•  Uses algebraicoperatorsbutappliespathfilterfirstand then joins

withgraphpakern.

•  Thisisefficientonlyifthepathfilterishighlyrestric*ve.

•  Limitedsupportforpathconstraints.

•  Hence,couldnotcompareconstrainedqueries.

Page 22: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

DatasetandQueries22

•  Our queries were evaluated on the BTC500M dataset (size0.5GB,2.5milliontriples).

•  Thequerieswere formulated tofindpaths that are at leastthreehopslong.

•  Thequeriesvary fromsmall setof sourcesanddes*na*onstoverylargesetofsourcesanddes*na*ons.

NumberofsourcesandDes,na,ons

Queries Sources Des,na,ons Queries Sources Des,na,ons

SmallQuery1 25 2 LargeQuery1 13641 907

SmallQuery2 4 6 LargeQuery2 29974 32583

SmallQuery3 4 3 LargeQuery3 11793 6

SmallQuery4 29 7 LargeQuery4 29974 2290

SmallQuery5 26 31 LargeQuery5 2290 32582

Page 23: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

LargeConstrainedQueryExecu,onTimeComparison

ConstrainedQuery UnconstrainedQuery

0 200 400 600 800 1000

LQ1

LQ2

LQ3

LQ4

LQ5

Execu,onTime(seconds)

SmallConstrainedQueryExecu,onTimeComparison

ConstrainedQuery UnconstrainedQuery

0 3 6 9 12 15

SQ1

SQ2

SQ3

SQ4

SQ5

Execu,on,me(seconds)

Constrainedvs.UnconstrainedGPQs23

•  Stardoghaslimitedsupportforpathconstraints.

•  Neo4jhaspredicatefunc*ons(all,any,exists,none,single)similartoconstraints.

•  Theconstrainedqueriestooklonger*metocompletesincetheseincludeanextrafilteringstep.

Page 24: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

QueryCompilationTimeComparison24

•  Thepathoperatordoesnothavemucheffectonthequerycompila*on*me.

•  Inmostcasesthecompila*on*meincreasedbylessthanonesecond.

0

0,5

1

1,5

2

2,5

3

SQ1 SQ2 SQ3 SQ4 SQ5 LQ1 LQ2 LQ3 LQ4 LQ5

Time(secon

ds)

Querycompila,on,mecomparison

WithPathOperator WithoutPathOperator

Page 25: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

PerformanceEvaluation25

•  Stardogperformsbekerintermsofabsolute*metaken.

•  However, for most queries the number of paths found byStardogismuchless.

•  Hence,weplokedthe*metakenperpathiden*fied.

0,00

300,00

600,00

900,00

1200,00

SQ1

Timepe

rpath(m

illisecs)

SmallQueries,meperpath

Sem-Ser Stardog

0,00

20,00

40,00

60,00

80,00

100,00

120,00

140,00

SQ2 SQ3 SQ4 SQ5

Timepe

rpath(m

illisecs)

LargeQueries,meperpath

Sem-Ser Stardog

0,00

2,00

4,00

6,00

8,00

10,00

12,00

LQ1 LQ3 LQ4

Timepe

rpath(m

illisecs)

0,000,050,100,150,200,250,300,350,400,45

LQ2* LQ5

Timepe

rpath(m

illisecs)

Algebraicpathevalua*onisalsomoreMQOamenable

Page 26: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

CompletenessofResults26

•  Stardogproducedincompleteresults.

•  BTChasalotofself-loops:tripleslike〈acm:58567 akt:has-publication-reference acm:58567〉.

•  Stardogdoesnotconsiderthesetriplesinitspaths.

•  Stardogresultsalsocontainduplicatepaths.

SmallQueriesNumberofpathsComparison

Sem-Ser Stardog

0

500

1000

1500

2000

2500

3000

SQ2Num

bero

fpaths

LargeQueriesNumberofpathsComparison

Sem-Ser Stardog

0

50

100

150

200

250

300

350

SQ1 SQ3 SQ4 SQ5

Num

bero

fpaths

0

2000

4000

6000

8000

10000

LQ1 LQ3

Num

bero

fpaths

0

150000

300000

450000

600000

750000

900000

LQ2* LQ4 LQ5

Num

bero

fpaths

Page 27: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

Conclusion •  An algebraic query evalua*on strategy forgeneralizedpathquerieswithdeclara*velydefinedsources,des*na*onsandconstraints.

•  A general framework to integrate any graphpakern matching plalorm with a pathcomputa*onplalorm.

•  An example implementa*on of an integratedplalorm.

•  Performance comparison of integrated plalormwithapopularexis*ngplalorm.

•  Thework presented here is par*ally funded byNSF grantIIS-1218277andCNS-1526113.

27

Page 28: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

ThankYou!

28

Page 29: Outline - SEMANTiCS 2020 US...Constrained vs. Unconstrained GPQs 23 • Stardog has limited support for path constraints. • Neo4j has predicate func*ons (all, any, exists, none,

PE(1,4)⋃()PE(s,wi)⋃()→SA[wi]

PE(1,5)⋃()

ExamplePathSequenceandSolveAlgorithm

29

1

4

2

3

7

5

8

6

ka

bi

c

d

fh

g

e

j

1: (1,4,k) 2: (2,3,b)

3: (2,4,i) 4: (3,4,a•k∪c)

5: (4,5,d) 6: (4,7,f)

7: (5,6,e∪j) 8: (5,7,h)

9: (7,8,g) 10: (3,1,a)

Solving(s=1,d):Ini*alize:PE(s,s)=λ→SA[s],PE(s,d)=∅ford ≠𝑠→SA[d]Stepi(itera,oni): PE(vi,wi)PE(s,vi)•Step1(s=1,v1=1,w1=4): PE(1,1)• PE(1,4)

=SA[4]⋃(SA[1]•PE(1,4))

Step5(s=1,v5=4,w5=5): PE(1,4)•PE(4,5)=SA[5]⋃(SA[4]•PE(4,5))

PE(1,6)⋃()Step7(s=1,v7=4,w7=5): PE(1,5)•PE(5,6)=∅⋃((k•d)•(e⋃j)=(k•d•e)⋃(k•d•j)→SA[6]

……

…… =SA[6]⋃(SA[5]•PE(5,6))

=∅⋃(λ•k)=k→SA[4]

=∅⋃(k•d)=k•d→SA[5]