22
SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff [email protected] @avometric Many thanks to: Mike Dean, Ian Emmons, Gail Mitchell, Doug Reid, Rick Schantz from BBN Hanspeter Pfister from Harvard SEAS Phil Zeyliger from Cloudera Prakash Manghwani

SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

  • Upload
    dangdat

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

SHARD Triple-Store:

Tools for Web-Scale SemWebKurt Rohloff

[email protected]

@avometric

Many thanks to:

Mike Dean, Ian Emmons, Gail Mitchell,

Doug Reid, Rick Schantz from BBN

Hanspeter Pfister from Harvard SEAS

Phil Zeyliger from Cloudera

Prakash Manghwani

Page 2: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

2

Semantic Web / Graph Data

• Vision from Tim Berners-Lee at W3C.

• Create a web of data– Support use by intelligent agents.– Data described using ontologies.– Data represented as digraphs.– “Web 3.0.”

• Emerging commercially– Use by NYTimes, BBC, Pharma, …– Numerous startups.– Oracle, MySQL have SemWeb support.

• Government use…

Page 3: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Object Graph Example

BBN

City

Person

Massachusetts

elmer

Cambridge

Company

“BBN Technologies”

“Tad Elmer”

president

headquarters

US

rdf:type

rdf:type

rdf:type

locatedIn

rdf:type name

name

State

locatedIn

Countryrdf:type

Organization

rdfs:subClassOf

Page 4: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

SemWeb Layer Cake

4

Knowledge

Storage

Querying

Reasoning

Page 5: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

W3C Resource Description

Framework (RDF)

subject object

predicate

• RDF graph is made up of individual statements.

• Subject and predicate are Uniform Resource Identifiers

(URIs).

• You can also make statements about statements (e.g.

timestamp, confidence, etc.)

Page 6: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

RDF/XML

6

<rdf:RDF

xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns="http://example.org/business-ont#">

<Company rdf:ID="BBN">

<name>BBN Technologies</name>

<headquarters rdf:resource="http://www.state.ma.us/cities#Cambridge"/>

<president rdf:resource="http://www.bbn.com/management#elmer"/>

</Company>

</rdf:RDF>BBN

elmer

Cambridge

“BBN Technologies”

president

headquarters

name

Page 7: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

SPARQL Query

All people who own a car made in Detroit:

SELECT ?person

WHERE {

?person :owns ?car .

?car a :Car .

?car :madeIn :Detroit .

}

?person ?carowns

madeIn Detroit

Cara

7

Page 8: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Answering Queries

Kurt car0 Ford

ownsmadeBy

madeIn

Detroit

livesIn

Cambridge

aa

City

Car

a

?person ?car

owns

madeIn

Detroit

Cara

8

Page 9: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Sample of Triple-Stores

• Parliament by BBN (from DAPRA DAML.)

• OWLIM by OntoText (several versions.)

• Allegrograph from Franz.

• MySQL and Oracle Solutions.

• LarKC by DERI Galway.

• Mulgara.

• Hive- and Pig-based experimental triple-stores.

• Etc…

9

Page 10: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Triple-Store Design Considerations

• Scalable – web-scale?

• High Assurance.

• Cost Effective – commodity hardware?

• Modular inferred data separation.

• Robustness.

• Considerations as endless as applications.

10

Page 11: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Map-Reduce Triple-Store Proof of

Concept

11

Page 12: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

SHARD Triple-Store Built on Hadoop

Prioritized goals:

•Commodity hardware, ONLY.

•Web scalable.

•Robust.

12

Page 13: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

More Specifically

• Cloud-based triple-store on HDFS.

– Method calls at client.

– Processing in cloud.

– Move results to local machine.

• Massively scalable.

• SPARQL queries.

• Basic inferencing.

13

Page 14: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Data Persistence Advice from SHARD

• Down to “bare metal” in HDFS for efficiency.

– No Berkeley DB, no C-stores, …. Nothing.

• Simple data storage as flat files.

– Lists of (predicate, object) pairs for every subject by line.

– Ex: Kurt owns car0 livesin Cambridge

• Simple often really is better…

14

Page 15: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

HDFS Graph Storage

Kurt car0 Ford

ownsmadeBy

madeIn

Detroit

livesIn

Cambridge

aa

City

Car

a

15

Graphs saved as flat-file in HDFS:

Kurt owns car0 livesIn Cambridge

Car0 a Car madeBy Ford madeIn Detroit

Cambridge a City

Detroit a City

Page 16: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Query Processing

• BBN-developed query processor.

– Starting integration with “standard” interfaces

• Jena, Sesame.

• SHARD supports “most” of SPARQL.

– Like most commercial triple-stores.

• Large performance improvements possible with

improved query reordering.

16

Page 17: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Iterative Query Response Construction

Source Data

s op

s op

s op

s op

s op

s op

?person ?carowns

madeIn

Detroit

Cara

?person ?car

owns

1st clause results

s op

s op

s op

s op

2nd clause results

s op

s op

op

op

?person ?carowns

Cara

2nd clause results

s op

s op

op

op

op

op 17

Page 18: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Test Data

• Deployed code on Amazon EC2 cloud.

– 19 XL nodes.

• 6000 LUBM university dataset.

– Approximately 800 million edges in graph.

• In general, performed comparably to

“industrial” monolithic triple-stores.

18

Page 19: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

SHARD Open-Source Release

• BSD license.

• Check:

– My webpage

– Sourceforge (SHARD-3store)

19

Page 20: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

More info?

• Tim Berners-Lee’s seminal SciAmerican article.

• W3C for “recommended” standards.

• Jena and Sesame frameworks.

• SemWebCentral for other open-source.

• Please come up and talk with me for more info!

20

Page 21: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Thanks!

Questions?

Kurt Rohloff

[email protected]

@avometric

21

Page 22: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff

Performance Comparison

• Proof o’ Concept: For 6000 universities

(approx. 800 million triples):

Query 1: 404 sec. (approx 0.1 hr.)

Query 9: 740 sec. (approx 0.2 hr.)

Query 14: 118 sec. (approx 0.03 hr.)

• Sesame+DAMLDB:

Query 1: approx 0.1hr,

Query 9: approx 1 hr

Query 14: approx. 1 hr

• Jena+DAMLDB for 550 million triples:

Query 1: approx 0.001 hr,

Query 9: approx 1 hr

Query 14: approx. 5 hr 22