34
SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant No. IIS-0325464, entitled “SemDIS: Discovering Complex Relationships in the Semantic Web”. ESWC 2007, Innsbruck, Austria June 4, 2007

SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Embed Size (px)

Citation preview

Page 1: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

SPARQLeR: Extended Sparql for Semantic

Association DiscoveryKrzysztof Kochut and Maciej Janik

Work supported by the National Science Foundation Grant No. IIS-0325464, entitled “SemDIS: Discovering Complex

Relationships in the Semantic Web”.

ESWC 2007, Innsbruck, Austria

June 4, 2007

Page 2: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Paths in RDF

Directed path

Undirected path

Undirected path,but with specific properties anddirectionality

Page 3: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Why are paths interesting ?

• A path describes how entities are related.– Relationships on the path define meaning of this

connection.– Entities on the path specify the content.

• Do you have migraine? Try taking magnesium!– Path discovered by Dr. D.R.Swanson from partial

information available in PubMed publications• stress can lead to loss of magnesium in the human body• migraine patients seem to be experiencing stress

… that’s why …• migraine could lead to a loss of magnesium, so …

take magnesium to fight migraine!

Swanson, R.D. Migraine and Magnesium: Eleven Neglected Connections. Perspectives in Biology and Medicine, 31 (4). 526-557.

Page 4: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Formally, what is a simple path ?• Simple directed path between resources

r0 and rn in a description base R:– sequence r0 p1 r1 p2 r2 , … , pn-1 rn-1 pn rn (n>0)– r0 p1 r1, r1 p2 r2 , … , rn-2 pn-1 rn-1, rn-1 pn rn (n>0) are triples in R.– all of the resources ri (0 ≤i ≤ n) in the path are distinct

• Simple undirected path between resources r0 and rn in R:– sequence r0 p1 r1 p2 r2 , … , pn-1 rn-1 pn rn (n>0)– for each ri-1 pi ri (0 < i ≤ n) in the path, either ri-1 pi ri or ri pi

ri-1 is a triple in R– all of the resources ri (0 ≤i ≤ n) in the path are distinct

Page 5: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Paths and SPARQL

• SPARQL query can express only static graph patterns.– Some flexibility is introduced by an OPTIONAL

part, but it does not solve path problems.

• No support for flexible length path expressions.

– Glycan biosynthesis pathway in biology has a specific pattern (properties), but its length may be unknown.

– Path discovery may be of unknown length and pattern, like in Dr. Swanson’s example.

Page 6: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

What we need to discover paths?• Knowledge discovery needs more flexible

patterns.– Patterns may be partially known or even

unknown (unrestricted path).– Properties on the path, their order and

directionality create a specific meaning.– Entities on the path provide content.– Relationships to entities outside of the path

give an additional context.

Page 7: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Proposed extensions

• A path may have a flexible length– For computational reasons, length is limited.

• Constraints on properties– Specific properties must appear in the path.– Their order and directionality is meaningful.– They can form a repeating pattern.

• Constraints on resources– Specific resources must be on the path.– They can be anywhere on the path or at

specific positions.

Page 8: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

SPARQLeR

• Extension of SPARQL for semantic association discovery.

• Seamlessly integrated into the SPARQL syntax.

• Graph patterns incorporating simple paths with constraints.

• Constraints are based on regular expressions over properties.

Page 9: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

What is a path in SPARQLeR ?• Path is a meta-property that connects two

resources.– Defined as a sequence of interleaving properties and

resources.– Starts and ends with properties (endpoint resources are

not included).– A path of length 1 is a sequence with just one property.

<rdf:Class rdf:about="http://meta.org/rdf-meta-schema#Path">

<rdfs:isDefinedBy rdf:resource="http://meta.org/rdf-meta-schema#"/>

<rdfs:subClassOf rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>

<rdfs:subClassOf rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq"/>

<rdfs:label>Path</rdfs:label>

<rdfs:comment>The class of RDFMS paths.</rdfs:comment>

</rdf:Class>

Page 10: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Path patterns in SPARQLeR

• Meta-property – similar concept to a property– Resource –[property] Resource– Resource –[path] Resource

• Path as a Sequence– Test if a resource is in the path:

• rdfs:member– Test if a resource is at a specific position in the path:

• rdf:_2, rdf:_4, ...

• SPARQLeR-specific path properties– Test all resources or all properties in the path:

• rdfms:entityResource and rdfms:propertyResourceExample: all resources on a path must be of type foo:Person

Page 11: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Path pattern anatomyp1

p1

p1 p1 p2p2

p2 p3

rdfs:member rdf:_6

p3

rdf:_3rdfs:memberp2

12

3 5 764

length: 4elements: 7

rdfms:entityResource

p1rdfms:propertyResource

Path patterns(match of path variable)

Page 12: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Path types in SPARQLeR

• Directionality of relationships in the path defines its specific semantics.

• SPARQLeR allows definition of the following path types– As defined in graph theory

• Directed• Undirected

– SPARQLeR specific extension• Defined directionality path

(includes directed path)

Page 13: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Directionality of properties in path• Defined directionality paths:

– Neither directed nor undirected– Each property in a path has a specified directionality

• Example: simple graph with p relationship(a) X p* Y, directed path(b) X p* Y, undirected path(c) X ( p p-1 )* Y, directional path

X Y

(a)(b)(c) p p p p

p p p p

Page 14: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Inverse property operator

• In standard SPARQL there is no need for inverse property operator

– Pattern syntax is based on individual statements, so it is easy to reverse direction.

• Defining path constraints requires the inverse operator

– A pPath expression defines constraints on properties, not on individual statements.

– Without the inverse property operator some paths constraints would be impossible to express (as shown in the previous example).

Page 15: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

RegExp in path constraints

• Path constraints on properties are based on regular expressions– Uses syntax similar to lex– Easy for grep users

• Examples:a c* d a+ (b|c) a

[abc] c? d ( b a-1 )+ c

Page 16: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Path constraints in SPARQLeR• Defined as regular path expressions

– Can specify patterns of properties in the path– Directionality requirement needs the inverse operator

(‘-’ minus) –p

• Supported regular expressionsp (single property)-p (the inverse of p)[p1 p2 ... pn]

(class of properties) -[p1 p2 ... pn]

(class of inverse properties)[^p1 p2 .. pn]

(complement of properties)-[^p1 p2 .. pn]

(inverse of complement of properties)

. (wildcard) x | y (alternative)xy (concatenation)x* (Kleene star);x+ (one or more repetition)(x) (match a path matched by x)

Page 17: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Path constraints (cont’d)

• Class of properties and inverse operator– Complement operator can be applied only to

defined properties, not their inverses– Inverse operator

• Not allowed inside class of properties• Inverses set created from defined properties

– Example:properties: q r s t[^rt] q s–[^qr] t-1 s-1 (inverses)([^st] | –[^t]) q r q-1 r-1 s-1

Page 18: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Integrating paths into SPARQL• Path variable binds a path

– Name begins with ‘%’ instead of ‘?’

• Simple patterns – path between two resources

SELECT ?prop WHERE {<r> ?prop <s>}

SELECT %path WHERE {<r> %path <s>}

• Single source path

SELECT %path, ?res WHERE {<r> %path ?res}

Page 19: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Integrating paths into SPARQL• Resources on the path

SELECT %path WHERE{<r> %path <s> . %path rdfs:member <e>}

SELECT %path WHERE{<r> %path <s> . %path rdf:_1 <p>}

• Listing path elements – list operatorSELECT list(%path) WHERE {<r> %path <s>}

Page 20: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Expressing path constraints

• Bounded path length – only constants allowed

FILTER(length(%path)<5)

FILTER(length(%path)>3 && length(%path)<7)

Page 21: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Expressing path constraints

• Constraints added as a regular expression filter (existing syntax in SPARQL)

regex( pathvariable, pathexpr, pathflags )

FILTER(regex(%path,”.*foo:prop.*”,”uis”))

– Flags: i (instances) s (schema) l (literals) h (match using hierarchy) d (set directionality) u (undirected)

– Default flags: d i

Page 22: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Some examplesSELECT list(%path), ?res WHERE{<r> %path ?res . %path rdfs:member ?x . ?x foo:locatedIn wiki:Europe FILTER(regex(%path,”foo:prop+”)}

SELECT list(%path) WHERE{<r> %path <s> . %path rdfms:entityResource ?x . ?x rdf:type foo:Person FILTER(regex(%path,”(foo:prop|foo:rel)+”,”u”)}

SELECT list(%path) WHERE{<r> %path <s> FILTER(length(%path)<=6 && length(%path)>=4 && regex(%path,”(foo:prop -foo:rel)+”)}

Page 23: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

SPARQLeR Prototype Implementation• Prototype implementation is based on

BRAHMS – RDF/S main memory storage.

• Path search based on a bi-directional BFS for simple paths.

• Checking of path constraints in regex is implemented as a simulation of DFAs.

Janik, M. and Kochut, K., BRAHMS: A WorkBench RDF Store And High PerformanceMemory System for Semantic Association Discovery. ISWC 2005

Page 24: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Implementation details

• Each path expression (FILTER regex) is translated into a DFA.– For path between two resources, partial

constraints are checked while building the search trie from both endpoints

– forward and reverse DFAs– When a path is connected,

the forward DFA used to check the full (path) constraint.

Page 25: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Experiments: biology pathway• Biosynthesis paths in biology (glycomics)• How specific glyco peptide is created from a basic

structure?– Find pathway between dolichol phosphate and glyco

peptide G00009• Path has 15 reactions (30 hops, as each reaction is

represented by its substrates and products)• Only undirected path connects the endpoint resources, but

a specific directionality pattern is present RDF representation: sample reactions in the path

Page 26: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Experiments : biology pathway• Functionality test - proof of concept

N-glycan biosynthesis pathwaySELECT list(%path) WHERE { glyco:dolichol_phosphate %path glyco:glyco_peptide_G00009 . %path rdfs:member enzyo:R05969 FILTER ( length(%path) <= 30 && regex(%path, "((-glyco:has_acceptor_substrate| -glyco:has_reactant) glyco:has_product)*" ) ) }

Ontology: GlycOLength: 30 hopsConsists of: 15 reactionsSearch time: milliseconds (less than 1 tick)...

courtesy of Dr. Alison Vandersall-Nairn, University of Georgia

Page 27: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Experiments

• Scalability– Modified DBLP datasets in RDF (added random citations)– Test on increasing dataset (adding older years of

publications)– Search for cited publications (transitive)

PREFIX opus: <http://lsdis.cs.uga.edu/projects/semdis/opus#>

SELECT ?end_publication WHERE {<http://dblp.uni-trier.de/rec/bibtex/journals/ai/Huber06>

%path ?end_publicationFILTER ( length(%path)<=26 &&

regex(%path, "(opus:cites_publication)*" ) ) }

B. Aleman-Meza et. al. Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. (WWW2006)

Page 28: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Experiments – dataset characteristics

Page 29: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Experiments – results: single source paths

Search paths up to length 26

Page 30: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Experiments – results: two endpoint paths

Page 31: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

More complex uses of path expressions• Discover connecting paths with a shared node

– Path between A and B, length up to 4– Path between C and D, length up to 4– Both paths have a shared resource

A

B

C

D

C %path_2 Dlength(%path_2) <= 4

A %path_1 Blength(%path_1) <= 4

%path_1 rdfs:member ?x%path_2 rdfs:member ?x

?x

Potential subgraph discovery

Page 32: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

SPARQLeR summary

• Path expressions– use of regular expressions over properties

• Flexible path specification– Undirected– Defined directionality paths

• Directed

– Length restricted

• Complex path patterns– Test of resources and properties on the path– Intersecting paths

Page 33: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

Conclusion and future work

• SPARQLeR extension fits seamlessly into the current SPARQL syntax.

• Performance of path queries is acceptable (if defined expression is highly selective).

• Optimization of path queries, complex expressions and multiple paths in query.

• Inclusion of context.

Page 34: SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik Work supported by the National Science Foundation Grant

Computer Science DepartmentUniversity of Georgia

SPARQLeR Krys Kochut, Maciej Janik

Thank you