33
By Orri Erling Virtuoso Program Manager OpenLink Software Virtuoso: The Prometheus of RDF-based Relational Data Management

Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Embed Size (px)

DESCRIPTION

Virtuoso, The Prometheus of RDF presented by Orri Erling (Virtuoso Program Manager)

Citation preview

Page 1: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

By Orri ErlingVirtuoso Program Manager

OpenLink Software

Virtuoso: The Prometheus of

RDF-based Relational Data Management

Page 2: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Linked Data at Dawn The Promise and the Practice The Science of Speed The Structure which Is Ongoing Research

License CC-BY-SA 4.0 (International).

Page 3: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Linked Data Promises

RDF is a generic, minimalistic model for describing things

RDF has global identifiers and data is self-describing

URIs may be dereferenceable

RDF is flexible to query, does not force a single hierarchical view like XML

License CC-BY-SA 4.0 (International).

Page 4: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Linked Data Scenarios

RDF is used because of

schema flexibility

global identifiers

Inference, if present, is usually trivial

Subclass

Sub-property

License CC-BY-SA 4.0 (International).

Page 5: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Where Triples Come From

Relational extracts or web content is converted to and stored as triples

NLP extraction

New applications with RDF as primary data model

Doing SPARQL against data in RDBs is possible but is rare and does not deliver the flexibility

License CC-BY-SA 4.0 (International).

Page 6: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Linked Data Verticals and Patterns

Publishing: tagging & annotations, evolving vocabularies

Archives: self description, long term identifiers, many versions of schema

Semantic search: structured, semi-structured, and full text, all in one

Business intelligence: many sources, ease of adding sources, no 6 month DW schema change cycle

E-science, often in life sciences: common interchange format, nano-publications, NLP extracts, different users cook their data differently, provenance

License CC-BY-SA 4.0 (International).

Page 7: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

The Hopes and Perceptions

The age of ad hoc

Find insight in any data, when you need it, from any source, any format

No data warehouse planning cycles; make your own from the pieces you need, when you need it

Still, data integration remains hard work; quality and coverage of sources vary

Flexibility may be there, but is performance and scalability on the level?

License CC-BY-SA 4.0 (International).

Page 8: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Yes, But ...

Web and Big Data: Everybody reinvents the triple. Self-description, long term identifiers, key-value pairs in many non-RDF use cases

SPARQL and RDF would be the natural, standards-compliant choice if did beat SQL, information retrieval, custom big data, key value, map reduce solutions

Is this intrinsic to linked data or is this lack of engineering?

Linked data has unique advantages in breadth of coverage and expressivity but performance must not lag behind.

License CC-BY-SA 4.0 (International).

Page 9: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

What is the RDF Tax?

90% of bad performance comes from non-optimal query plans

Some comes from indexing too much (e.g., SQL bulk load with no indices is 50x faster than the equivalent in RDF with all indexed)

Some comes from string ops on URIs, literals

Some comes from having a join for every attribute. Vectoring and right plans help, though

License CC-BY-SA 4.0 (International).

Page 10: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

The Bane of the TripleWhen data is stored as triples:

There is structure still but it is harder to exploit. Schema re-emerges as correlations

More joins make more possible query plans, bigger errors in plan cost estimation

More joining reduces locality

Lack of schema causes needless indexing; data takes more space

A URI for everything takes space and time

For the same workload, Virtuoso SQL can also be 2–20x faster than Virtuoso SPARQL

License CC-BY-SA 4.0 (International).

Page 11: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

The Question is Raised

LOD2 FP7, now ending: RDF Performance parity with relational?

SQL is the senior science. Who ignores history is bound to repeat it

Integral mastery of RDB science is a prerequisite, but do not forget the subtle twists of schema-less-ness

License CC-BY-SA 4.0 (International).

Page 12: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Virtuoso RDF Relational DBMS Leadership

2000–2006, v1.x–4.x: SQL row store with SQL federation and XML

2007–2008, v5.x–6.x: SPARQL, adapted for RDF quads with more compression, bitmap indices, special data types, RDF awareness in query optimization

2009, v6.x: Scale-out cluster-capable

2010–2013, v7.x: Column store, vectored execution, 3x more space efficient, 10+x more speed

2013: Star Schema benchmark with SPARQL, 100x MySQL SQL, 0.8x MonetDB SQL

2014: Top of the line SQL analytics, 500 Gtriples, Structure Awareness

License CC-BY-SA 4.0 (International).

Page 13: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Triples Done Right, so?

Column-store techniques are a good fit; index-based triple storage does not get much better

RAM-only pointer-based techniques can be faster but cost 10–100x more to scale up

To take RDF to SQL parity, Virtuoso must first be on the level with the best in SQL

TPC-H is the checklist for mastery of DW and query optimization; who survives shall not fear

Parity is achieved when running with triples, just like with tables

License CC-BY-SA 4.0 (International).

Page 14: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Structure is Everywhere

CWI in LOD2:

90% of triples in Common Crawl fall into 20 tables

All relational extractions are 100% tables

Even DBpedia is 90% covered by 500 tables, but is unusually heterogeneous, albeit not very large

License CC-BY-SA 4.0 (International).

Page 15: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

The Glorious Dawn:Structure is the Servant, not the Tyrant A set of subjects with all the same single-valued

properties is in fact a table. So, store it as a table Allow exceptions, e.g., sometimes multiple values,

different values in different graphs, extra properties, etc. If it is big, it has repeating structure All RDF semantics are preserved; any triple is possible,

but the common ones are SQL compact and SQL fast With tables, query optimization returns to SQL

complexity and is much more reliable So, more tricks from the SQL analytics bag become

safe and applicable License CC-BY-SA 4.0 (International).

Page 16: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Gains from Structure Awareness

3+x Load Speed

2x more space efficiency

SPARQL queries against regular data within 10–20% of SQL speeds

Just declare which properties tend to occur together; no strict schema-first like with SQL

Later, self configuration

License CC-BY-SA 4.0 (International).

Page 17: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

The Cycle of Adventure Rebels: SQL not cool, too rigid,

drop ACID, go key-value, map-reduce, the triple is all there is, semantic web

Pioneers: Life on the frontier is hard, infrastructure missing or bad

Same everyday problems also in Utopia

Recognizing the objective values, e.g., schema freedom and identifiers, no AI. Do the job, forget dogma

Reconciliation: schema-first and schema-last converge in structure awareness License CC-BY-SA 4.0 (International).

Page 18: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Present FP7 Research LDBC — Transparency and Relevance for

Graph DB, RDF performance

GeoKnow — GeoData is everywhere, how to carry the planet in your pocket

LOD2 — Where no triple has gone before (and come back)

Open PHACTs — A Data Platform for Drug Discovery

License CC-BY-SA 4.0 (International).

Page 19: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

LDBC - Linked Data Benchmark Council

Rebels: SQL not cool, too rigid, drop ACID, go key-value, map-reduce, the triple is all there is, semantic web

Pioneers: Life on the frontier is hard, infrastructure missing or bad

Same everyday problems also in Utopia Recognizing the objective values, e.g.,

schema freedom and identifiers, no AI. Do the job, forget dogma

Reconciliation: Some of the rebel thinking becomes mainstream, e.g., schema-first and schema-last converge in structure awareness

License CC-BY-SA 4.0 (International).

Page 20: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

LDBC, Independent Industry Forum for Benchmarking

The TPC for the frontiers of database

Bootstrapped in the LDBC FP7, continues as independent industry association

OpenLink, Ontotext, Neo Technologies, Sparsity as founding members

IBM, Oracle Labs, Systap, SPARQL City already joined

DB superstars Peter Boncz and Thomas Neumann as founders and scientific lead

License CC-BY-SA 4.0 (International).

Page 21: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

LDBC Benchmarks

Social Network

Online — Lookups, updates, analysis of social environment

Business Intelligence — Spotting trends, key players, big query

Graph analytics — Community detection, Page rank, graph metrics

Semantic Publishing

Modeled after the BBC linked data portal, online lookups, drill downs and updates

License CC-BY-SA 4.0 (International).

Page 22: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

GeoKnow - The Planet in your Pocket

Ms. Globe and Mr. Cube have a thing going on:

Mr. Cube: Desiloization ... integrated metadata ... Explicit semantics .

Ms. Globe: I can feel it ... but are you man enough? ... you need to show me.

License CC-BY-SA 4.0 (International).

Page 23: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Planet Scale Roadmap

Jan 2014:

Virtuoso SPARQL outperforms PostGIS in map lookups with planet-wide Open Street Map

Virtuoso SQL adds 5x more power

License CC-BY-SA 4.0 (International).

Page 24: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Next: Jan 2015

Parity between SPARQL and SQL via structure awareness

Geospatial data clustering

Graph analytics close to the data — Pregel, Giraph, etc., in the DB itself

Adding fine-grained geo dimension to LDBC social network benchmark

License CC-BY-SA 4.0 (International).

Page 25: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

The LOD2 scaling adventures

Experiments at CWI’s Scilens cluster Jan 2013: 150 Gtriples (8 x 256GB

RAM) Aug 2014: 500 Gtriples (12 x 256GB

RAM) Some trillion-triple claims exist, but

do not detail any query workload

BSBM explore and BI workloads 10x speed gains for BI queries

between 2013 and 2014

Bulk load at 6M triples/s All done in triples, structure

awareness will go further stillLicense CC-BY-SA 4.0 (International).

Page 26: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Open PHACTsPartners:

License CC-BY-SA 4.0 (International).

Page 27: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Virtuoso NowSnapshot of RDF Linked Data customers in the Enterprise:

Data.Gov (U.S. Govt. Open Linked Data initiative)

Bank of America Booz Allen Hamilton Northrop Grumman Elsevier French National Library Samsung Globo

Daimler Benz Johnson & Johnson Bayer St Jude's Medical Fuijitsu Syngenta and many more

License CC-BY-SA 4.0 (International).

Page 28: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Virtuoso Availability

Most capabilities as open source

Commercial adds Cluster scale-out SQL Federation Replication (SQL & RDF) Advanced RDF security; ABAC & RBAC (ACLs) Wide tables and more

Up to the minute tech previews via v7fasttrack on github, e.g., superfast TPC-H implementation

License CC-BY-SA 4.0 (International).

Page 29: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Virtuoso Future

Preview of structure-aware RDF store in fall 2014 via v7fasttrack

Integrated graph analytics framework

Embed complex graph algorithms, e.g., community detection, shortest path inside SPARQL/SQL

Comparison of SQL and SPARQL for big data analytics

License CC-BY-SA 4.0 (International).

Page 30: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Linked Data Now

Adoption across major industries

Superior flexibility and time to solution

Dramatic performance gains in the last 5 years

Benchmarking will continue to drive progress, to the benefit of users and vendors alike

Run circles around most open source SQL in SPARQL:

Virtuoso SPARQL beats MySQL in SSB by 100x

With structure awareness, SPARQL to match the best in SQL for data warehousing, OLTP

Linked Data no longer a long shot but a technology that makes sense License CC-BY-SA 4.0 (International).

Page 31: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

About OpenLink SoftwareOpenLink Software is a privately-held company founded in 1992 by its President & CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas:

License CC-BY-SA 4.0 (International).

ODBC, JDBC, ADO.NET, and OLE DB compliant Data Access Drivers for Oracle, Microsoft SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL

High-Performance & Scalable Multi-Model (Relational & Graph) Database Technology

Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats)

Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)

Web Application Server Technology

Linked Data Deployment & Management

Identity Management

Page 32: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Office Locations

USA

OpenLink Software, Inc 10 Burlington Mall Road Suite 265 Burlington, MA 01803 Tel.: +1 781 273 0900 Fax: +1 781 229 8030

UK

OpenLink Software Ltd. Airport House Purley Way Croydon, Surrey CR0 0XZ Tel.: +44 (0)20 8681 7701 Fax: +44 (0)20 8681 7702

License CC-BY-SA 4.0 (International).

Page 33: Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote

Additional InformationWeb Sites

OpenLink Software

YouID – Digital Identity Card (Certificate) Generator

OpenLink Data Spaces – Semantically enhanced Personal & Enterprise Data Spaces & Collaboration Platform

OpenLink Virtuoso - Hybrid Data Management, Integration, Application, and Identity Server

Universal Data Access Drivers - High-Performance ODBC, JDBC, ADO.NET, and OLE-DB Drivers

LDAP and NetID-TLS – How to use LDAP scheme URIs with NetID-TLS Authentication

Social Media Data spaces

http://www.openlinksw.com/weblog/oerling/ (Orri Erling weblog)

http://kidehen.blogspot.com (Kingsley Idehen weblog)

http://www.openlinksw.com/blog/~kidehen/ (Kingsley Idehen weblog)

https://twitter.com/OpenLink (Twitter)

Hashtags: #LinkedData #SemanticWeb #BigData #RDF (Anywhere).

License CC-BY-SA 4.0 (International).