Upload
dexter
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT. PNNL: David Haglin , Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric Goodman , Edward Jimenez, Greg Mackey. Recap from August Review. We built a simple automatic query translator - PowerPoint PPT Presentation
Citation preview
CASS-MT Review: 6-Apr-2011Task 3: Semantic Databases on the XMT
PNNL: David Haglin, Bob Adolf, Sinan al-Saffar, Cliff JoslynCray: David Mizell
SNL: Eric Goodman, Edward Jimenez, Greg Mackey
1
Recap from August Review
We built a simple automatic query translatorMuch of the work was done by hand
Lessons learned from experiments:Query optimization must happen early and oftenAn efficient semantic search engine will almost certainly need data-driven and query-driven optimization
Now that BTC is largely passed, we are continuing forward with these goals
Query Optimization Research Agenda
Prerequisites for query optimization research:Build out an end-to-end query engine
Enables: validation, measurement, profilingBuild a simple research compiler
Enables: rapid prototyping, attribute aggregation
Not to be construed as standing up a productGlue code is not engineered to be robustCompiler is a first pass at using intermediate forms
A Modular Query Engine
Progress: Data IngestPortability important
Using MTGL on multiple systems:Cray XMT Threadstorm nodesCray XMT service node (Linux)Cray CX-1
Endian-ness an issue for storing/retrieving binary triplesEnsure First triple has small (<232) “Subject”Upon reading triples, if first integer >= 232, then swap bytes on all integers read in. Do this on all systems.
Swap 100,000,000 uint64_tIdentical code compiled and run on each platform
Cray CX-1 XMT Login XMT 16 Proc3.59s 5.06s 5.07s
Progress: Graph Representation
Work in progress: use MTGL and sample code from SPEED-MT to build out components:
Build a compressed_sparse_row<BidirectionalS> representation of the RDF graphFocus on an ability to memory map graph data structures for fast reloading (rememd).Adapt Search-Space Recursive Descent code (described in August, 2010 review) to the MTGL-based data structure.Redesign Dictionary Encoding storage on disk:
Use only one file that supports an “Endian-ness hack”Avoid the need to parse strings from char-array to rebuild in-memory data dictionary.
A Transformation-oriented Query Compiler
Progress: Parser
Query ParsingSPARQL 1.0 implemented as ANTLR LL* grammarTested using SPARQL Performance Benchmark (SP2Bench) and Data Access Working Group (DAWG) tests
Currently passes 175 of 214 tests (81%)We are not currently working to improve the coverage
SPARQL parsing is not a priorityWe needed enough coverage to get interesting properties
(OPTIONAL, UNION, FILTER, blank nodes, etc.)
Progress: Intermediate RepresentationQuery language is not amenable to optimization
So we lower into a more comfortable form
GPIR: Graph Pattern Intermediate RepresentationQuery is represented as a graphEntity references are unified (all ?x refer to the same thing)Entities are tagged with language attributes
e.g.- all triples from a UNION statement are tagged with a union ID and a common union group ID
EPIR: Execution Plan Intermediate RepresentationStill very much a work-in-progress
Progress: Intermediate RepresentationPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?predicateWHERE { { ?person rdf:type foaf:Person . ?subject ?predicate ?person } UNION { ?person rdf:type foaf:Person . ?person ?predicate ?object }}
# Query Graph <21148736> in GPIR9:0-5-83:0-1-26:4-5-07:0-1-2%0:variable:T0:label:"?person"1:label:"http://www.w3.org/1999/02/22-rdf-syntax-ns#type"2:label:"http://xmlns.com/foaf/0.1/Person"3:union_group:13:optional:F3:union_id:04:variable:T4:label:"?subject"5:variable:T5:select:T5:label:"?predicate"6:union_group:16:optional:F6:union_id:07:union_group:17:optional:F7:union_id:18:variable:T8:label:"?object"9:union_group:19:optional:F9:union_id:1
Parses to:
Progress: Transforms
Transforms operate on an IRInput and output are same format, so they can be chained
Example transform: xform_rem_uniRemoves union group attributes which only have one memberThink of this as algebraic simplification on math expressions (A+0 => A), except for SPARQL UNION statements
Potential Optimizing Transforms
Longer-term, we are looking at several different types of transforms to attempt. Here are some examples:
Impossible query identification: a triple pattern, constraint, or inferred interaction does not exist in the dataDeterministic bind: if a property is known to be unique (e.g.- rdf:type is usually unique), a traversal can avoid nondeterminism while satisfying that constraintSelectivity-based strategy detection: if the pattern does (or can be reduced to) not include complex interactions, a simpler execution strategy can be chosen on-the-fly.
Future Work: specific directionsContinue working with Larry Holder (WSU) to find common ground on frequent subgraph mining and semantic database queryWork with Bill Howe on query language and hybrid search strategiesExpand our collaboration with Task 1.Support Task 16 (Mayo)Engage with Bioinformatics domain to find/build interestingly large and complex Bio dataset (i.e., more complex than uniprot)Find collections of complex queriesContinue work on search engine comparison:
Array-basedSubgraph-isomorphism (MTGL)Sprinkle-SPARQLQuery-optimization infused with pattern matching
Extend study of larger path types (n=4,5) and/or non-linear motifs