Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences

Scaling Jena in a commercial environment

The Ingenta MetaStore Project

Purpose

● Give an example of a big, commercial app using Jena.● Share experiences and problems

What is the metastore?

An RDF triple store which is :

• Holds Ingenta's bibliographic metadata• Centralised System• Flexible Format • Scalable• Distributable • Easily Integratable

Existing Systems

4.3 million Article Headers

8 million References

Publishers/TitlesDatabase

IngentaConnect Website

Preprints

20 million External Holdings

Publishers

Other Aggregators

Article Headers, live system


An RDF triple store which is:



An RDF triple store which is:


Architecture of new system

RDF Triplestore(PostgreSQL)

Master

Slave Slave

XML API (read only) (Jena)

PrimaryLoader(Jena)

JMS Queue

IngentaConnect Other Clients

JMS Queue Other Systems

CustomerData

Other loaders /

enhancers

RDFS Modelling – What was the data anyway?

Standard Vocabularies• Dublin Core • PRISM • FOAF

Custom Vocabularies • Identifiers • Structure • Branding

Some stats about schemas

28 Classes

72 Properties

4/18th from Standard Vocabs

Journal XML, with highlights

Hosting description XML, with highlights

OK, enough about your project, tell me something about Jena!

OK....

● How did we choose an RDF Engine?● Why did we choose Jena?● What problems did we have?● Did we solve any of them?● How did it scale?

How did we choose an RDF Engine?

Experimented with Java APIS

• Jena + PostGreSQL• Sesame + PostGreSQL• Kowari + native

Method of testing

Why did we choose Jena

• Relational Database backend• Usability, Support• Easy to debug• Schemagen• Scalablity

What problems did we have?

1. Insertion - performance

2. Ontologies – memory

3. Encapsulation – limiting flexibility

(most problems due to scale..)

The Project - Scale

Number of triples = ~200 million and keeps growing

Size on disk = 65 Gb

Result of loading 4.3 million articles and references

Some details of database tables

jena_long_lit – ~4.5 million records

jena_long_uri - ~0.14 million records

Prob 1. Insertion performance

* Task - load backdata

* What does that actually involve?

For each article:– Get metadata from database 1.– Add metadata from database 2.– Reform into new RDFS model– Query the store – look for relevant resources– Model.read

* Problem

* Possible Solutions?

- Turn off index rebuild

- Turn off duplicate checking

- Batching

Our solution - Batching* What is batching?

* Quantitative effect?

* Costs

1

10

100

1000

0 10000 20000 30000 40000 50000

batch s ize (triples)

ins

ert

tim

e p

er

trip

le (

ms

)

Prob 2. Ontologies – memory problems

* Advantages of ontologies for us?

* How did we start?

* What was the problem?

* Solutions?

Prob 3. Encapsulation – limiting flexibility

* Not really a problem with Jena – an experience

* Why are we encapsulating the Jena code?

* What is the problem with that?

* Solutions?

Performance Testing SPARQL

Standard query – TITLE TYPE QUERY

SELECT ?title ?issue ?articleWHERE {?title rdf:type struct:Journal .?title dc:identifier <http://metastore.ingenta.com/content/issn/02670836> .?issue prism:isPartOf ?title .?issue prism:volume ?volumeLiteral .?issue prism:number ?issueLiteral .?article prism:isPartOf ?issue .?article prism:startingPage ?firstPageLiteral .FILTER ( ?volumeLiteral = "20" )FILTER ( ?issueLiteral = "4" )FILTER ( ?firstPageLiteral = "539" )}


SELECT ?title ?issue ?article WHERE {

?title dc:identifier <http://metastore.ingenta.com/content/issn/02670836> .

?issue prism:isPartOf ?title .?issue prism:volume ?volumeLiteral .?issue prism:number ?issueLiteral .

?article prism:isPartOf ?issue .?article prism:startingPage ?firstPageLiteral .

FILTER ( ?volumeLiteral = "20" )FILTER ( ?issueLiteral = "4" )FILTER ( ?firstPageLiteral = "539" )}

NO TYPES QUERY







?title rdf:type struct:Journal .

TITLE TYPE QUERY







?title rdf:type struct:Journal .

?issue rdf:type struct:Issue .

?article rdf:type struct:Article .

ALL TYPES QUERY


0.00

5.00

10.00

15.00

20.00

25.00

30.00

0 20 40 60 80 100 120 140 160

Size of store: (millions of triples)

Tim

e (

secs

)

NO types (secs) ALL types (secs) TITLE type only (secs)

Title type only - <1.5 secs for 150 million triples

TEST CONDITIONS

Jena 2.3PostgreSQL 7DebianIntel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives 4G RAM

Where are we now with the project?

Recent Work

* Loaded 4.3 million through batching process, ongoing in place

* Non-journal content modelled

* REST API

Current Work

* Replication

* SPARQL merging

* Phase out batching and use queues instead

Conclusions

With a very large triple store:

* Loading performance is a challenge

* Inferencing is a challenge

* SPARQL queries need TLC

* Jena scales to 200 million triples

* Jena is a good choice for a commercial triplestore

The End

Documents

Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences