Click here to load reader

A Scalable RDBMS-Based Inference Engine for RDFS/OWL

  • View
    33

  • Download
    3

Embed Size (px)

DESCRIPTION

A Scalable RDBMS-Based Inference Engine for RDFS/OWL. Oracle New England Development Center [email protected] 1. Agenda. Background 10gR2 RDF 11g RDF/OWL 11g OWL support RDFS++, OWLSIF, OWLPrime Inference design & implementation in RDBMS Performance - PowerPoint PPT Presentation

Text of A Scalable RDBMS-Based Inference Engine for RDFS/OWL

  • A Scalable RDBMS-Based Inference Engine for RDFS/OWL Oracle New England Development [email protected]*

  • AgendaBackground10gR2 RDF11g RDF/OWL11g OWL supportRDFS++, OWLSIF, OWLPrimeInference design & implementation in RDBMSPerformanceCompleteness evaluation through queriesFuture work

    *

  • StorageUse DMLs to insert triples incrementallyinsert into rdf_data values (, sdo_rdf_triple_s(1, , , ));Use Fast Batch Loader with a Java interfaceInference (forward chaining based)Support RDFS inferenceSupport User-Defined rulesPL/SQL API create_rules_indexQuery using SDO_RDF_MATCHSelect x, y from table(sdo_rdf_match((?x rdf:type :Protein) (?x :name ?y) .));Seamless SQL integrationShipped in 2005Oracle 10gR2 RDFOracle Database*

  • Oracle 11g RDF/OWLNew featuresBulk loaderNative OWL inference support (with optional proof generation)Semantic operatorsPerformance improvementMuch faster compared to 10gR2LoadingQueryInference

    Shipped (Linux platform) in 2007

    Java API support (forthcoming)Jena & Sesame

    *

  • Oracle 11g OWL is a scalable, efficient, forward-chaining based reasoner that supports an expressive subset of OWL-DL*

  • Why?Why inside RDBMS?Size of semantic data grows really fast.RDBMS has transaction, recovery, replication, security, RDBMS is efficient in processing queries.Why OWL-DL?It is a widely adopted ontology language standard.Why OWL-DL subset?Have to support large ontologies (with large ABox)Hundreds of millions of triples and beyondNo existing reasoner handles complete DL semantics at this scaleNeither Pellet nor KAON2 can handle LUBM10 or ST ontologies on a setup of 64 Bit machine, 4GB HeapWhy forward chaining?Efficient query supportCan accommodate any graph query patterns

    1 The summary Abox: Cutting Ontologies Down to Size. ISWC 2006*

  • OWL Subsets SupportedThree subsets for different applicationsRDFS++RDFS plus owl:sameAs and owl:InverseFunctionalProperty

    OWLSIF (OWL with IF semantics)Based on Dr. Horsts pD* vocabulary

    OWLPrimerdfs:subClassOf, subPropertyOf, domain, rangeowl:TransitiveProperty, SymmetricProperty, FunctionalProperty, InverseFunctionalProperty, owl:inverseOf, sameAs, differentFromowl:disjointWith, complementOf,owl:hasValue, allValuesFrom, someValuesFromowl:equivalentClass, equivalentProperty

    Jointly determined with domain experts, customers and partners

    1 Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary OWL DLOWL LiteOWLPrime*

  • RDFS has 14 entailment rules defined in the SPEC.E.g. rule : aaa rdfs:domain XXX . uuu aaa yyy . uuu rdf:type XXX .

    OWLPrime has 50+ entailment rules.E.g. rule : aaa owl:inverseOf bbb . bbb rdfs:subPropertyOf ccc . ccc owl:inverseOf ddd . aaa rdfs:subPropertyOf ddd . xxx owl:disjointWith yyy . a rdf:type xxx . b rdf:type yyy . a owl:differentFrom b . Semantics Characterized by Entailment Rules These rules have efficient implementations in RDBMS*

  • One very heavily used space is that where RDFS plus some minimal OWL is used to enhance data mapping or to develop simple schemas. -James Hendler

    Complexity distribution of existing ontologies Out of 1,200+ real-world OWL ontologiesCollected using Swoogle, Google, Protg OWL Library, DAML ontology library 43.7% (or 556) ontologies are RDFS30.7% (or 391) ontologies are OWL Lite20.7% (or 264) ontologies are OWL DL.Remaining OWL FULLApplications of Partial DL Semantics1 http://www.mindswap.org/blog/2006/11/11/an-alternative-view-for-owl-11/2 A Survey of the web ontology landscape. ISWC 2006*

    Chart2

    556

    391

    264

    61

    Sheet1

    RDFSLiteDLFull

    55639126461

    Sheet1

    0

    0

    0

    0

    Sheet2

    Sheet3

  • Option1: add user-defined rulesBoth 10gR2 RDF and 11g RDF/OWL supports user-defined rules in this form (filter is supported)

    E.g. to support core semantics of owl:intersectionOf

  • Option2: Separation in TBox and ABox reasoningTBox tends to be small in sizeGenerate a class subsumption tree using complete DL reasoners (like Pellet, KAON2, Fact++, Racer, etc)ABox can be arbitrarily largeUse Oracle OWL to infer new knowledge based on the class subsumption tree from TBox

    Support Semantics beyond OWLPrime (2)TBoxTBox & Complete class treeABoxDL reasoner OWL Prime*

  • 11g OWL Inference PL/SQL APISEM_APIS.CREATE_ENTAILMENT(Index_namesem_models(GraphTBox, GraphABox, ), sem_rulebases(OWLPrime),passes,Inf_components,Options)Use PROOF=T to generate inference proof

    SEM_APIS.VALIDATE_ENTAILMENT(sem_models((GraphTBox, GraphABox, ), sem_rulebases(OWLPrime),Criteria,Max_conflicts,Options)Above APIs can be invoked from Java clients through JDBCTypical Usage:First load RDF/OWL dataCall create_entailment to generate inferred graphQuery both original graph and inferred data Inferred graph contains only new triples! Saves time & resourcesTypical Usage:First load RDF/OWL dataCall create_entailment to generate inferred graphCall validate_entailment to find inconsistencies

    *

  • Advanced OptionsGive users more control over inference processSelective inference (component based)Allows more focused inference. E.g. give me only the subClassOf hierarchy.

    Set number of passesNormally, inference continue till no further new triples foundUsers can set the number of inference passes to see if what they are interested has already been inferredE.g. I want to know whether this person has more than 10 friends

    Set tablespaces used, parallel index build

    Change statistics collection scheme

    *

  • 11g OWL Usage ExampleCreate an application tablecreate table app_table(triple sdo_rdf_triple_s);Create a semantic modelexec sem_apis.create_sem_model(family,app_table,triple);Load dataUse DML, Bulk loader, or Batch loaderinsert into app_table (triple) values(1, sdo_rdf_triple_s(family', , , ));Run inferenceexec sem_apis.create_entailment(family_idx,sem_models(family), sem_rulebases(owlprime));Query both original model and inferred dataselect p, o from table(sem_match('( ?p ?o)', sem_models(family'), sem_rulebases(owlprime), null, null));

    *After inference is done, what will happen if- New assertions are added to the graphInferred data becomes incomplete. Existing inferred data will be reused if create_entailment API invoked again. Faster than rebuild.- Existing assertions are removed from the graphInferred data becomes invalid. Existing inferred data will not be reused if the create_entailment API is invoked again.

  • Separate TBox and ABox ReasoningUtilize Pellet and Oracles implementation of Jena Graph APICreate a Jena Graph with Oracle backendCreate a PelletInfGraph on top of itPelletInfGraph.getDeductionsGraphIssues encountered: no subsumption for anonymous classes from Pellet inference.
  • Soundness Soundness of 11g OWL verified throughComparison with other well-tested reasonersProof generationA proof of an assertion consists of a rule (name), and a set of assertions which together deduce that assertion.Option PROOF=T instructs 11g OWL to generate proof

    TripleID1 :emailAddress rdf:type owl:InverseFunctionaProperty .TripleID2 :John :emailAddress :John_at_yahoo_dot_com .TripleID3 :Johnny :emailAddress :John_at_yahoo_dot_com .

    :John owl:sameAs :Johnny (proof := TripleID1, TripleID2, TripleID3, IFP)

    *

  • Design & Implementation *

  • Design Flow Extract rulesEach rule implemented individually using SQLOptimization SQL TuningRule dependency analysisDynamic statistics collectionBenchmarkingLUBMUniProtRandomly generated test cases

    TIPAvoid incremental index maintenancePartition data to cut cost Maintain up-to-date statistics

    *

  • Background- Storage schemeTwo major tables for storing graph dataVALUES table stores mapping from URI (etc) to integersIdTriplesTable stores basically SID, PID, OID

    Execution Flow Inference StartCheck/Fire Rule 1

    Check/Fire Rule 2

    Check/Fire Rule n InsertNew triples?Y Create1234N5Exchange Table

    . Exchange Partition IdTriplesTable

    Partition for a semantic modelNew Partition for inferred graph6Un-indexed, Partitioned Temporary Table SID PID OID

    . *

  • Performance Evaluation*

  • Database SetupLinux based commodity PC (1 CPU, 3GHz, 2GB RAM)Database installed on machine semperf3

    semperf3

    semperf1

    semperf2Giga-bit NetworkDatabase 11g Two other PCs are just serving storage over network*

  • Machine/Database ConfigurationNFS configurationrw,noatime,bg,intr,hard,timeo=600,wsize=32768,rsize=32768,tcpHard disks: 320GB SATA 7200RPM (much slower than RAID). Two on each PCDatabase (11g release on Linux 32bit platform)*

  • Tablespace ConfigurationCreated bigfile (temporary) tablespacesLOG files located on semperf3 diskA

    *

  • Inference Performance Results collected on a single CPU PC (3GHz), 2GB RAM (1.4G dedicate to DB), Multiple Disks over NFS2.52k triples/s6.49k triples/sAs a reference (not a comparison)

    BigOWLIM loads, inferences, and stores (2GB RAM, P4 3.0GHz, java -Xmx1600)

    - LUBM50 in 26 minutes - LUBM1000 in 11h 20min

    Note: Our inference time does not include loading time! Also, set of rules is different.1 From OWLIM Pragmatic OWL Semantic Repository slides, Sept. 2007*

    Ontology (size)RDFSOWLPrimeOWLPrime + Pellet on TBox#Triples inferred(millions)Time#Triples inferred(millions)Time#Triples inferred(millions)Time

    LUBM506.8 million2.7512min 14s3.058m 01s3.258min 21sLUBM1000133.6 million55.097h 19min61.257hr 0min65.257h 12mUniProt20 million3.424min 06s50.83hr 1minNANA

    Chart1

    8.3

    212

    432

    Inference time (minutes)

    Number of universities

    Inference Time (minutes)

    OWLPrime Inference (with Pellet on TBox)

    Sheet1

    UniversitiesTime (minutes)Speed

    508.36.49

    500212

    10004322.52

    Sheet1

    Inference time (minutes)

    Number of universities

    Inference Time (minutes)

    OWLPrime Inference (with Pellet on TBox)

    Sheet2

    Sheet3

  • Query Answering After Inference

    LUBM ontology has intersectionOf, Restriction etc. that are not supported by OWLPrime*

    Ontology LUBM50

    6.8 million & 3+ million inferred LUBM Benchmark QueriesQ1Q2Q3Q4Q5Q6Q7

    OWLPrime# answers413063471939373059Complete?YYYYYNNOWLPrime + Pellet on TBox# answers413063471951984267Complete?YYYYYYY

  • Query Answering After Inference (2)*

    Ontology LUBM50

    6.8 million & 3+ million inferred LUBM Benchmark QueriesQ8Q9Q10Q11Q12Q13Q14

    OWLPrime# answers5916653802240228393730Complete?NNNYNYYOWLPrime + Pellet on TBox# answers779013639422415228393730Complete?YYYYYYY

  • Query Answering After Inference (3)*

    Ontology LUBM1000

    133 million & 60+ million inferred LUBM Benchmark QueriesQ1Q2Q3Q4Q5Q6Q7

    OWLPrime# answers42528634719792476559Complete?YUnknownYYYNNOWLPrime + Pellet on TBox# answers425286347191044738167Complete?YUnknownYYYUnknownY

  • Query Answering After Inference (4)*

    Ontology LUBM1000

    133 million & 60+ million inferred LUBM Benchmark QueriesQ8Q9Q10Q11Q12Q13Q14

    OWLPrime# answers59161319690224047607924765Complete?NNNYNUnknownUnknownOWLPrime + Pellet on TBox# answers779027298242241547607924765Complete?YUnknownYYYUnknownUnknown

  • Implement more rules to cover even richer DL semanticsFurther improve inference performanceSeek a standardization of the set of rules.To promote interoperability among vendorsLook into schemes that cut the size of ABox Look into incremental maintenance

    Future Work*

  • For More Informationhttp://search.oracle.comorhttp://www.oracle.com/semantic technologies*

  • Appendix*

    *