Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Outline
• Problem Statement and Motivation – Generalized Path Queries – Algebraic Problem Interpretation
• Background – Algebraic Path Problem Solving
• Approach – Integrating of graph pattern matching with
algebraic path problem solving
• Evaluation • Conclusion
1
MotivatingExample2
Query: Find rela*onships between passengers on any flights toWashingtonDCbetweenJune30andJuly6whopurchasedone-way*cketsbycash,andcountriesontheCIAwatchlistwherethereisatleastonefinanciallinkthroughanybank.
1. Rela,onships are paths that must be extracted not merelymatched!!!
• e.g.differentfrompathexpression/propertypathqueries
2. Par,cipa,ngen,,esarenotexplicitlystated
Pa>ernp1:Passengers on flights to WashingtonDCbetween June 30 and July 6whopurchasedone-way*cketsbycash
Pa>ernp2:CountriesontheCIAwatchlist
?
ExistingProblemInterpretationApproaches3
Exis*ngApproaches
GraphTheore*cInterpreta*on
AlgebraicInterpreta*on
Rela*onalAlgebra(DatabaseApproach)
HybridApproach(Par*allyrela*onal+graph)
TotalAlgebraicApproach(Rela,onal
Algebra*+PathAlgebra)
• ForgraphdatamodelslikeRDFtherearemul*plewaystoapproachtheproblem
Holistic Decomposable
Comparison of Approaches
Graph Theoretic • Different problems translate to
different graph problems i.e. different algorithms – Ex. is sub-graph homeomorphism
problem (NP hard). – Others: Subgraph Isomorphism,
Shortest Paths, Disjoint Paths
• Often difficult to parallelize or optimize
• Cant deal with declarative specifications
Algebraic • Decompose into smaller operators
– Reuse is possible – Adapt optimization and composition of
smaller operators
• Relational algebra not ideal – Arbitrary path length requires iteration
and fixpoint semantics
• Hybrid approaches exist – Pattern matching (algebraic), traversal
(graph theoretic) • Emphasis on shortest paths for
traversals
4
5
Query: Find relationships between passengers on any flights to Washington DC between June 30 and July 6 who purchased one-way tickets by cash, and countries on the CIA watchlist where there is at least one financial link through any bank.
Problem interpretation is as a Generalized Path Query – gpqs with
1. Two entity sets declaratively specified as patterns – P1, P2
2. A connection set linking p1, p2 instances
3. A connection constraint on the connection set (edge type membership)
OurProposal:ATotalAlgebraicApproach
1 : Solvable with algebraic graph pattern matching
2, 3 : Solve with algebraic path finding (plus extension)
AlgebraicPathProblemSolvingFramework6
• Anefficient algebraic path problem solving approach introducedbyTarjan[23,24].
• BasicDefini,ons:• AnedgeeinadirectedlabeledgraphG=(V,E)isdenotedase=(v1,v2)withlabelλ(e)=lewherev1,v2∈Vande∈E.
• ApathpinsuchagraphG=(V,E)isanalterna*ngsequenceofnodesandlabelededgestermina*nginanodep
• ={v1,le1,v2,le2,...,vn,len,vn+1}wherev1,v2,...,vn,vn+1∈Vande1,e2,...,en∈E.
• APathExpression[23,24]oftype(s,d),PE(s,d),isa3-tuple〈s,d,R〉,where§ R is a regular expression over the set of edges defined using
union(∪),concatena*on(•)andclosure(*).§ The language L(R) ofR represents paths from s to d where the
graphcontainsnodessandd.
PathEncodingasPathExpressions7
PE(2,7)=〈2,7,((b•c•f)∪(i•f))〉
1
4
2
3
7
5
8
6
ka
bi
c
d
fh
g
e
j
PE(1,4)=〈1,4,(k)〉
GraphPathInformationRepresentation:ASequenceofPathExpressions
8
• APathSequence(PS)[23,25]isauniqueorderingofpathexpressionsthatrepresentallpathinforma*oninagraph,suchthatforanypathp,
§ thereisauniquepar**onofpintononemptysubpaths,and
§ auniquesequencesofindicesIofPS,suchthat
§ theithsubpathofpisrepresentedbythepathexpressioninPSattheithindexinI.
PE1 PE2 PE3 PE4 PE1000... PE100 PE155... ... PE390 ...PS:
Pathp:p1 p2 p3 p4 p5
I={2,3,100,390,1000}
• Thepathsequencerepresenta*onofagraphallowsforpathproblemstobesolvedbysingleforwardscan.
PathComputationusingPathSequence9
• Asimplepropaga*onSOLVEalgorithm[23,25]canbeusedtosolvepathproblems.
• The SOLVE algorithm assembles path informa*on while scanningthepath-sequencefromleitoright.
• At every itera*on of the SOLVE algorithm the following step isperformed
𝑃𝐸(𝑠, 𝑤↓𝑖 )⋃(𝑃𝐸(𝑠, 𝑣↓𝑖 ) � 𝑃𝐸(𝑣↓𝑖 , 𝑤↓𝑖 ))→𝑆𝐴[ 𝑤↓𝑖 ]
OverviewofApproach(1)10
• Webuiltaprototypebyintegra*ng§ SemStorm[31]:
o aHadoop-basedfileorganiza*onstoragesystemo supportsefficientgraphpakernmatchingqueryexecu*onusinganalgebraicqueryevalua*ontechnique
o usesApacheTezastheexecu*onenvironment.§ Serpent[25,27]:
o plalorm for finding all paths between a set of sources anddes*na*ons.
o buildsonthepathalgebraictechniqueusingpath-sequences.
Path Filter Path Finder Pattern Filter Pattern Matching
P1 : Pas sengers onflights to WashingtonDC, Purchased one-way*cketsbycash
P2:CountriesontheCIAwatchlist
FindingpathsbetweenP1andP2
P1: Between June30andJuly6
WithatleastonefinanciallinkSemstormperforms
thesecomponentsSerpentperformsthesecomponents
1. QueryExpression• Goal:Minimizedisrup*ontoexis*nginfrastructure,e.g.parser.• Solu>on:Usesyntac*csugartorepresentpathoperator.
2. QueryPlanning• Goal: Integrate graph pakern matching with path problemsolving
• Solu>on:Modifygraphpakernmatchingqueryplanbyaddingalgebraicpathqueryingoperatorstoproduceagpqsqueryplan.
3. QueryExecu,on• Goal:Executegpqs.• Solu>on: Translate gpq logical query plan into physical queryplanbyintroducingrequiredphysicalqueryoperators.
OverviewofApproach(2)11
IntroductionofSyntacticSugar–?pathVarasaspecialpropertyvariable
12
• AvoidschangetoSPARQL'squerysyntax.
• Fortriplepakern〈?s ?pathVar ?d〉, • ?pathVarisactuallyapathvariable
• removedandhandledspeciallywhilerestofqueryiscompilednormallyasgraphpakernquery.
SELECT * WHERE { ?s1 rdf:type akt:Affiliated-Person . ?s1 akt:full-name "Wendy E. Mackay" . ?s akt:has-author ?s1 . ?s2 akt:full-name "Irene Greif" . ?s2 akt:has-affiliation ?d . ?s ?pathVar ?d . }
IntroductionofSyntacticsugar–?pathVarasaPropertyVariable
13
SELECT * WHERE { ?s1 rdf:type akt:Affiliated-Person . ?s1 akt:full-name "Wendy E. Mackay" . ?s akt:has-author ?s1 . ?s2 akt:full-name "Irene Greif" . ?s2 akt:has-affiliation ?d . ?s ?pathVar ?d . }
?s2
?d
"IreneGreif":fn
:ha?s1
"WendyE.Mackay"
"IreneGreif":type
:fn?s
:ha
• Challenge: Need to track thepath source and des*na*onvariablesinthegraphpakern
• Solu,on: Use SemStorm’sdatastructures for trackingvariables.
1. QueryExpression• Goal:Minimizedisrup*ontoexis*nginfrastructure,e.g.parser.• Solu>on:Usesyntac*csugartorepresentpathoperator.
2. QueryPlanning• Goal: Integrate graph pakern matching with path problemsolving
• Solu>on:Modifygraphpakernmatchingqueryplanbyaddingalgebraicpathqueryingoperatorstoproduceagpqsqueryplan.
3. QueryExecu,on• Goal:Executegpqs.• Solu>on: Translate gpq logical query plan into physical queryplanbyintroducingrequiredphysicalqueryoperators.
OverviewofApproach14
LogicalPlanExampleasatree15
https://jena.apache.org/
(prefix ((akt:http://www.aktors.org/ontology/portal#) (rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns#)) (project (?s1 ?s ?d) (product (join (BGP [triple ?s1 rdf:type akt:Affiliated-Person] [triple ?s1 akt:full-name "Wendy E. Mackay"] ) (BGP [triple ?s akt:has-author ?s1] )) (BGP [triple ?s2 akt:full-name "Irene Greif"] [triple ?s2 akt:has-affiliation ?d] )))
ApacheJenaQueryPlan(SparqlSyntaxExpression–SSE)
BGP?s1 BGP?dBGP?s
join
product
project
LogicalQueryPlanasatree
?s1
Disconnectedsub-graphpakernsleadstocross-product
16
• Removecross-productandprojec*onoperatorsandintroducepathqueryoperators.
pathfilter
pathoperator
LogicalPlanTransformationExample
BGP?s1 BGP?dBGP?s
join
product
project
BGP?s1 BGP?dBGP?s
join?s1 ?s1
1. QueryExpression• Goal:Minimizedisrup*ontoexis*nginfrastructure,e.g.parser.• Solu>on:Usesyntac*csugartorepresentpathoperator.
2. QueryPlanning• Goal: Integrate graph pakern matching with path problemsolving
• Solu>on:Modifygraphpakernmatchingqueryplanbyaddingalgebraicpathqueryingoperatorstoproduceagpqsqueryplan.
3. QueryExecu,on• Goal:Executegpqs.• Solu>on: Translate gpq logical query plan into physical queryplanbyintroducingrequiredphysicalqueryoperators.
OverviewofApproach17
ImplementationStrategy18
• PhysicalqueryoperatorsforourplalormareTezver*ces
• PhysicalqueryplanisrepresentedbytheTezDAG.
• Thefollowingphysicalqueryoperatorswereintroduced
§ Annotator Vertex for source, des,na,on and constraintvariables iden*fy the source, des*na*on or constraintvariables, allowing only the bindings for that variable topassthrough.
§ PathComputerVertexisthepathoperatorwhichperformsthefinalpathcomputa*on.
ExampleQuery&DAG19
TypeScanner::?s
TypeScanner::?s1
Annotator-Source:[[0,-1]]
Annotator-Des*na*on:[[2,1]]
PathComputer::Src:?s,Dst:?d
TypeScanner::?s2
Packager::?s,?s1
Annotator(?s1):0
Annotator(?s1):1
PREFIX akt: <http://www.aktors.org/ontology/ portal#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT * WHERE { ?s1 rdf:type akt:Affiliated-Person . ?s1 akt:full-name "Wendy E. Mackay" . ?s akt:has-author ?s1 .
?s2 akt:full-name "Irene Greif" . ?s2 akt:has-affiliation ?d . ?s ?pathVar ?d . }
ExamplegpqPhysicalqueryplan
Evaluation20
TestSetup:
• Wecomparedourintegratedsystemwithanexis*ngplalormon1. Expressiveness.2. Query compila*on *me comparison with and without path
operator.3. Performance.4. Completenessofresults.
• Evalua*onwasconductedonsinglenodeserver§ runningHDFS§ withXeonoctacorex8664CPU(2.33GHz),§ 40GBRAM,§ twoHDDs(3.6TBand445GB).
IssueswithExistingPlatforms21
Neo4j:• ThefastBFSalgorithmisonlyforfindingshortestpath.
• Forfindingallpathstheslowerexhaus,veDFSisused.
• Gpqscouldnotbe runonNeo4jas itwas runningoutof resourcesandcrashing.
Stardog:• Uses algebraicoperatorsbutappliespathfilterfirstand then joins
withgraphpakern.
• Thisisefficientonlyifthepathfilterishighlyrestric*ve.
• Limitedsupportforpathconstraints.
• Hence,couldnotcompareconstrainedqueries.
DatasetandQueries22
• Our queries were evaluated on the BTC500M dataset (size0.5GB,2.5milliontriples).
• Thequerieswere formulated tofindpaths that are at leastthreehopslong.
• Thequeriesvary fromsmall setof sourcesanddes*na*onstoverylargesetofsourcesanddes*na*ons.
NumberofsourcesandDes,na,ons
Queries Sources Des,na,ons Queries Sources Des,na,ons
SmallQuery1 25 2 LargeQuery1 13641 907
SmallQuery2 4 6 LargeQuery2 29974 32583
SmallQuery3 4 3 LargeQuery3 11793 6
SmallQuery4 29 7 LargeQuery4 29974 2290
SmallQuery5 26 31 LargeQuery5 2290 32582
LargeConstrainedQueryExecu,onTimeComparison
ConstrainedQuery UnconstrainedQuery
0 200 400 600 800 1000
LQ1
LQ2
LQ3
LQ4
LQ5
Execu,onTime(seconds)
SmallConstrainedQueryExecu,onTimeComparison
ConstrainedQuery UnconstrainedQuery
0 3 6 9 12 15
SQ1
SQ2
SQ3
SQ4
SQ5
Execu,on,me(seconds)
Constrainedvs.UnconstrainedGPQs23
• Stardoghaslimitedsupportforpathconstraints.
• Neo4jhaspredicatefunc*ons(all,any,exists,none,single)similartoconstraints.
• Theconstrainedqueriestooklonger*metocompletesincetheseincludeanextrafilteringstep.
QueryCompilationTimeComparison24
• Thepathoperatordoesnothavemucheffectonthequerycompila*on*me.
• Inmostcasesthecompila*on*meincreasedbylessthanonesecond.
0
0,5
1
1,5
2
2,5
3
SQ1 SQ2 SQ3 SQ4 SQ5 LQ1 LQ2 LQ3 LQ4 LQ5
Time(secon
ds)
Querycompila,on,mecomparison
WithPathOperator WithoutPathOperator
PerformanceEvaluation25
• Stardogperformsbekerintermsofabsolute*metaken.
• However, for most queries the number of paths found byStardogismuchless.
• Hence,weplokedthe*metakenperpathiden*fied.
0,00
300,00
600,00
900,00
1200,00
SQ1
Timepe
rpath(m
illisecs)
SmallQueries,meperpath
Sem-Ser Stardog
0,00
20,00
40,00
60,00
80,00
100,00
120,00
140,00
SQ2 SQ3 SQ4 SQ5
Timepe
rpath(m
illisecs)
LargeQueries,meperpath
Sem-Ser Stardog
0,00
2,00
4,00
6,00
8,00
10,00
12,00
LQ1 LQ3 LQ4
Timepe
rpath(m
illisecs)
0,000,050,100,150,200,250,300,350,400,45
LQ2* LQ5
Timepe
rpath(m
illisecs)
Algebraicpathevalua*onisalsomoreMQOamenable
CompletenessofResults26
• Stardogproducedincompleteresults.
• BTChasalotofself-loops:tripleslike〈acm:58567 akt:has-publication-reference acm:58567〉.
• Stardogdoesnotconsiderthesetriplesinitspaths.
• Stardogresultsalsocontainduplicatepaths.
SmallQueriesNumberofpathsComparison
Sem-Ser Stardog
0
500
1000
1500
2000
2500
3000
SQ2Num
bero
fpaths
LargeQueriesNumberofpathsComparison
Sem-Ser Stardog
0
50
100
150
200
250
300
350
SQ1 SQ3 SQ4 SQ5
Num
bero
fpaths
0
2000
4000
6000
8000
10000
LQ1 LQ3
Num
bero
fpaths
0
150000
300000
450000
600000
750000
900000
LQ2* LQ4 LQ5
Num
bero
fpaths
Conclusion • An algebraic query evalua*on strategy forgeneralizedpathquerieswithdeclara*velydefinedsources,des*na*onsandconstraints.
• A general framework to integrate any graphpakern matching plalorm with a pathcomputa*onplalorm.
• An example implementa*on of an integratedplalorm.
• Performance comparison of integrated plalormwithapopularexis*ngplalorm.
• Thework presented here is par*ally funded byNSF grantIIS-1218277andCNS-1526113.
27
ThankYou!
28
PE(1,4)⋃()PE(s,wi)⋃()→SA[wi]
PE(1,5)⋃()
ExamplePathSequenceandSolveAlgorithm
29
1
4
2
3
7
5
8
6
ka
bi
c
d
fh
g
e
j
1: (1,4,k) 2: (2,3,b)
3: (2,4,i) 4: (3,4,a•k∪c)
5: (4,5,d) 6: (4,7,f)
7: (5,6,e∪j) 8: (5,7,h)
9: (7,8,g) 10: (3,1,a)
Solving(s=1,d):Ini*alize:PE(s,s)=λ→SA[s],PE(s,d)=∅ford ≠𝑠→SA[d]Stepi(itera,oni): PE(vi,wi)PE(s,vi)•Step1(s=1,v1=1,w1=4): PE(1,1)• PE(1,4)
=SA[4]⋃(SA[1]•PE(1,4))
Step5(s=1,v5=4,w5=5): PE(1,4)•PE(4,5)=SA[5]⋃(SA[4]•PE(4,5))
PE(1,6)⋃()Step7(s=1,v7=4,w7=5): PE(1,5)•PE(5,6)=∅⋃((k•d)•(e⋃j)=(k•d•e)⋃(k•d•j)→SA[6]
……
…… =SA[6]⋃(SA[5]•PE(5,6))
=∅⋃(λ•k)=k→SA[4]
=∅⋃(k•d)=k•d→SA[5]