25
Kendall Clark, CEO Clark & Parsia, LLC 1 Thursday, March 17, 2011

Stardog talk-dc-march-17

Embed Size (px)

DESCRIPTION

Stardog is a fast, scalable, lightweight RDF database for complex SPARQL queries. It features OWL 2 reasoning, transactions, a robust security layer, integrity constraint validation via Pellet 3, and world-class support.

Citation preview

Page 1: Stardog talk-dc-march-17

Kendall Clark, CEOClark & Parsia, LLC

1Thursday, March 17, 2011

Page 2: Stardog talk-dc-march-17

About C&P• We build semantic technology infrastructure

and enterprise solutions

• Pellet, the leading OWL reasoner

• POPS Expertise Location system

• Bootstrapped since 2005

• Offices in DC and Cambridge, MA

• Government & enterprise customers

• First talk ever was at LOC in 2005 :)

2Thursday, March 17, 2011

Page 3: Stardog talk-dc-march-17

3Thursday, March 17, 2011

Page 4: Stardog talk-dc-march-17

TLDR?• Java RDF database (“quad store”) (no

native code)

• Freemium model:

• enterprise & community editions

• OEM

• Performance for complex SPARQL queries

• Best available reasoning support

4Thursday, March 17, 2011

Page 5: Stardog talk-dc-march-17

NoSQL and SemWeb• Semweb is schemaless and schema-rich

• As agile as NoSQL stores

• More expressive than SQL

• Standards based

• Graph DBs are all ad hoc

• Query Language and, you know, joins

• Do you really want to write map-reduce programs...only?! We sure don’t...!

5Thursday, March 17, 2011

Page 6: Stardog talk-dc-march-17

Why another RDF DB?

• We’re scratching our itch for fast query for integration & decision support apps

• aimed at db-reasoner “tweener” space

• operationally agile

• There’s a hole in the market; or: markets are normal distributions (probably)

• Gives us a complete semantic application platform

6Thursday, March 17, 2011

Page 7: Stardog talk-dc-march-17

Commercial Market• 6 products

• Technically homogenous:

• Sagan-like scale obsession

• Mostly ad hoc reasoning

• Weak perf on complex queries

• Ho-hum feature sets & integrations

• See http://bit.ly/92P8eN for more

7Thursday, March 17, 2011

Page 8: Stardog talk-dc-march-17

Stardog1.0: Overview• Fast

• Lightweight

• Rich API support

• Logical & statistical inference

• Transactions

• Full-text search

• Graph algorithms and path language

• awesome mascot!

8Thursday, March 17, 2011

Page 9: Stardog talk-dc-march-17

Fast? No, Really Fast!

• First design goal in Stardog is performance of complex SPARQL query eval on single machine in the default configuration

• Next, total total queries per second

• In-memory mode available, when needed

• Early testing is promising: fastest RDF DB on SP2B benchmark. Often several times faster.

9Thursday, March 17, 2011

Page 10: Stardog talk-dc-march-17

Performance• Do yr own testing; the only queries that

matter are yours; don’t trust, test.

• It’s not ready till it’s very, very fast.

• Flatten the RDF performance tax

• About 256 GB for ~2B triples in main-memory mode, i.e., $20k Dell box.

• When in doubt: Add. More. RAM.

10Thursday, March 17, 2011

Page 11: Stardog talk-dc-march-17

Scalability• Stardog 1.0: scale up

• Disk-based joins for very large intermediate structures

• Triples compression

• Ideally efficient on-disk indices

• Stardog 2.0: scale out (shared-disk cluster)

• We think it’s easier to scale a fast DB than to speed up a scalable one...

11Thursday, March 17, 2011

Page 12: Stardog talk-dc-march-17

Lightweight• ~34 KLOC for core system, ~10 KLOC of

tests (1034 unit tests)

• Trivially simple installation:

• copy JAR & restart servlet container

• If you’ve ever used Sesame...

• May run: embedded, client-server; main memory or disk-backed modes; any combination of these

12Thursday, March 17, 2011

Page 13: Stardog talk-dc-march-17

Interfaces

• SNARL (Stardog Native API for RDF Language)

• Avro RPC—esp. the low-level TCP transport (coming soon...)—for Java & non-Java

• Sesame & Jena

• SPARQL Protocol (HTTP)

13Thursday, March 17, 2011

Page 14: Stardog talk-dc-march-17

Logical Inference1. OWL 2 QL, EL, and RL “query-time”

reasoning

• No materialization (so: fast bulk loading)

• reasoning enabled per-query

2. OWL 2 DL reasoning via Pellet 3.0

• in-memory, schema reasoning

3. Integrity Constraint Validation via OWL2

4. user-defined & SWRL rules

14Thursday, March 17, 2011

Page 15: Stardog talk-dc-march-17

OWL validation of RDF• Use OWL ontologies to validate RDF

instance data in Stardog.

• May be used as a guard to database modifications (so, if resulting data is invalid, transaction fails).

• W3C Member Submission to formalize this approach; stay tuned for details.

• See http://clarkparsia.com/pellet/icv/ for details

15Thursday, March 17, 2011

Page 16: Stardog talk-dc-march-17

OWL 2 Support

• Stardog 1.0: query-time, query rewriting reasoner for SPARQL entailment regimes

• It will support all of OWL 2 QL, EL, and RL, with exceptions:

• limited support for datatypes reasoning

• i.e., won’t support user-defined datatypes

• will depend on customer demand

16Thursday, March 17, 2011

Page 17: Stardog talk-dc-march-17

Statistical Inference• Corleone is a machine learning system for

RDF and OWL

• Optimized for Stardog

• Multiple classifier & cluster algorithms

• Clusters (similarity) and classifies (predicts) by RDF class & individual

• Machine learning must still be tuned; no magic bullets

17Thursday, March 17, 2011

Page 18: Stardog talk-dc-march-17

Transactions

• Supports optional ACID transactions on database mutations

• 2-phase commit based on Java Transaction API

• Tx’d writes 2x to 8x slower, depending on lots of variables

• Writes may be asynchronous & queued

18Thursday, March 17, 2011

Page 19: Stardog talk-dc-march-17

Search• Indexes RDF individuals and literals

• Results are 2-tuples (url|value, score)

• Based on Lucene: very fast, very scalable

• Can use 1 of 6 algorithms to partition RDF individuals from a graph

• via SPARQL DESCRIBE hook

• Will be integrated with SPARQL syntax...

19Thursday, March 17, 2011

Page 20: Stardog talk-dc-march-17

RDF as Graph• SPARQL isn’t ideal for every use case

• Graph algorithm processing on RDF purely as a graph

• Stardog supports Gremlin, the ad hoc standard for graph database query languages

• Gremlin makes graph algorithms easy to write

• More optimized Gremlin support for 1.0

20Thursday, March 17, 2011

Page 21: Stardog talk-dc-march-17

Implementations

Sesame Jena Empire

HTTP API Native API Avro API

Stardog API

SPI Runtime

Transactions

Stardog RDF

Stardog Core

Query

Exec

Optimizer

Plan Filter API

Query Rewriting/Reasoning

Index API SPI

CP Util IO Util Stardog Util Sesame Ext

Plan API

!"#$%&'#&("'

21Thursday, March 17, 2011

Page 22: Stardog talk-dc-march-17

Status

• Stardog 0.4.6 alpha release to alpha testers on 15 March 2011

• It feels damn good to ship code, even if it’s just an alpha! :)

• Weekly updates till beta period starts, then bimonthly updates till 1.0 release

22Thursday, March 17, 2011

Page 23: Stardog talk-dc-march-17

The Private Beta• Doin’ it old school: private beta, invitation

only

• Helps us keep commercial focus

• ~1 April to 30 May

[email protected] if yr interested: give name, org, area of interest, etc.

• Rolling releases, new features, bug fixes, etc

• ~90 organizations signed up for beta so far

23Thursday, March 17, 2011

Page 24: Stardog talk-dc-march-17

Roadmap• 1.0 in mid-Summer

• SPARQL 1.1, MRMW

• stored procedures in any JVM lang

• Shiro-based security layer

• native OWL 2 RL reasoner

• provenance API

• graph algorithms & an RDF path language

• performance improvements continuously

24Thursday, March 17, 2011

Page 25: Stardog talk-dc-march-17

Thanks! Questions?• http://stardog.com/

• http://clarkparsia.com/

• http://twitter.com/candp

• http://twitter.com/stardog_db

25Thursday, March 17, 2011