Upload
arnold-boyd
View
214
Download
0
Embed Size (px)
Citation preview
Scaling Jena in a commercial environment
The Ingenta MetaStore Project
Purpose
● Give an example of a big, commercial app using Jena.● Share experiences and problems
What is the metastore?
An RDF triple store which is :
• Holds Ingenta's bibliographic metadata• Centralised System• Flexible Format • Scalable• Distributable • Easily Integratable
Existing Systems
4.3 million Article Headers
8 million References
Publishers/TitlesDatabase
IngentaConnect Website
Preprints
20 million External Holdings
Publishers
Other Aggregators
Article Headers, live system
What is the metastore?
An RDF triple store which is:
• Holds Ingenta's bibliographic metadata• Centralised System• Flexible Format • Scalable• Distributable • Easily Integratable
What is the metastore?
An RDF triple store which is:
• Holds Ingenta's bibliographic metadata• Centralised System• Flexible Format • Scalable• Distributable • Easily Integratable
Architecture of new system
RDF Triplestore(PostgreSQL)
Master
Slave Slave
XML API (read only) (Jena)
PrimaryLoader(Jena)
JMS Queue
IngentaConnect Other Clients
JMS Queue Other Systems
CustomerData
Other loaders /
enhancers
RDFS Modelling – What was the data anyway?
Standard Vocabularies• Dublin Core • PRISM • FOAF
Custom Vocabularies • Identifiers • Structure • Branding
Some stats about schemas
28 Classes
72 Properties
4/18th from Standard Vocabs
Journal XML, with highlights
Hosting description XML, with highlights
OK, enough about your project, tell me something about Jena!
OK....
● How did we choose an RDF Engine?● Why did we choose Jena?● What problems did we have?● Did we solve any of them?● How did it scale?
How did we choose an RDF Engine?
Experimented with Java APIS
• Jena + PostGreSQL• Sesame + PostGreSQL• Kowari + native
Method of testing
Why did we choose Jena
• Relational Database backend• Usability, Support• Easy to debug• Schemagen• Scalablity
What problems did we have?
1. Insertion - performance
2. Ontologies – memory
3. Encapsulation – limiting flexibility
(most problems due to scale..)
The Project - Scale
Number of triples = ~200 million and keeps growing
Size on disk = 65 Gb
Result of loading 4.3 million articles and references
Some details of database tables
jena_long_lit – ~4.5 million records
jena_long_uri - ~0.14 million records
Prob 1. Insertion performance
* Task - load backdata
* What does that actually involve?
For each article:– Get metadata from database 1.– Add metadata from database 2.– Reform into new RDFS model– Query the store – look for relevant resources– Model.read
* Problem
* Possible Solutions?
- Turn off index rebuild
- Turn off duplicate checking
- Batching
Our solution - Batching* What is batching?
* Quantitative effect?
* Costs
1
10
100
1000
0 10000 20000 30000 40000 50000
batch s ize (triples)
ins
ert
tim
e p
er
trip
le (
ms
)
Prob 2. Ontologies – memory problems
* Advantages of ontologies for us?
* How did we start?
* What was the problem?
* Solutions?
Prob 3. Encapsulation – limiting flexibility
* Not really a problem with Jena – an experience
* Why are we encapsulating the Jena code?
* What is the problem with that?
* Solutions?
Performance Testing SPARQL
Standard query – TITLE TYPE QUERY
SELECT ?title ?issue ?articleWHERE {?title rdf:type struct:Journal .?title dc:identifier <http://metastore.ingenta.com/content/issn/02670836> .?issue prism:isPartOf ?title .?issue prism:volume ?volumeLiteral .?issue prism:number ?issueLiteral .?article prism:isPartOf ?issue .?article prism:startingPage ?firstPageLiteral .FILTER ( ?volumeLiteral = "20" )FILTER ( ?issueLiteral = "4" )FILTER ( ?firstPageLiteral = "539" )}
Performance Testing SPARQL
SELECT ?title ?issue ?article WHERE {
?title dc:identifier <http://metastore.ingenta.com/content/issn/02670836> .
?issue prism:isPartOf ?title .?issue prism:volume ?volumeLiteral .?issue prism:number ?issueLiteral .
?article prism:isPartOf ?issue .?article prism:startingPage ?firstPageLiteral .
FILTER ( ?volumeLiteral = "20" )FILTER ( ?issueLiteral = "4" )FILTER ( ?firstPageLiteral = "539" )}
NO TYPES QUERY
Performance Testing SPARQL
SELECT ?title ?issue ?article WHERE {
?title dc:identifier <http://metastore.ingenta.com/content/issn/02670836> .
?issue prism:isPartOf ?title .?issue prism:volume ?volumeLiteral .?issue prism:number ?issueLiteral .
?article prism:isPartOf ?issue .?article prism:startingPage ?firstPageLiteral .
FILTER ( ?volumeLiteral = "20" )FILTER ( ?issueLiteral = "4" )FILTER ( ?firstPageLiteral = "539" )}
?title rdf:type struct:Journal .
TITLE TYPE QUERY
Performance Testing SPARQL
SELECT ?title ?issue ?article WHERE {
?title dc:identifier <http://metastore.ingenta.com/content/issn/02670836> .
?issue prism:isPartOf ?title .?issue prism:volume ?volumeLiteral .?issue prism:number ?issueLiteral .
?article prism:isPartOf ?issue .?article prism:startingPage ?firstPageLiteral .
FILTER ( ?volumeLiteral = "20" )FILTER ( ?issueLiteral = "4" )FILTER ( ?firstPageLiteral = "539" )}
?title rdf:type struct:Journal .
?issue rdf:type struct:Issue .
?article rdf:type struct:Article .
ALL TYPES QUERY
Performance Testing SPARQL
0.00
5.00
10.00
15.00
20.00
25.00
30.00
0 20 40 60 80 100 120 140 160
Size of store: (millions of triples)
Tim
e (
secs
)
NO types (secs) ALL types (secs) TITLE type only (secs)
Title type only - <1.5 secs for 150 million triples
TEST CONDITIONS
Jena 2.3PostgreSQL 7DebianIntel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives 4G RAM
Where are we now with the project?
Recent Work
* Loaded 4.3 million through batching process, ongoing in place
* Non-journal content modelled
* REST API
Current Work
* Replication
* SPARQL merging
* Phase out batching and use queues instead
Conclusions
With a very large triple store:
* Loading performance is a challenge
* Inferencing is a challenge
* SPARQL queries need TLC
* Jena scales to 200 million triples
* Jena is a good choice for a commercial triplestore
The End