24
Validating statistical index data represented in RDF using SPARQL queries WESO Research Group University of Oviedo, Spain Jose Emilio Labra Gayo Jose María Álvarez Rodríguez

Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Embed Size (px)

DESCRIPTION

Presented at W3c RDF Validation Workshop https://www.w3.org/2012/12/rdf-val/

Citation preview

Page 1: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Validating statistical index data represented in RDF using

SPARQL queries

WESO Research GroupUniversity of Oviedo, Spain

Jose Emilio Labra Gayo Jose María Álvarez Rodríguez

Page 2: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Motivation

The WebIndex ProjectMeasure impact of the Web in different countriesFirst publication: 2012, 2 web sites:Visualization

http://thewebindex.org

Data portalhttp://data.webfoundation.org

Page 3: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

WebIndex Workflow

Raw DataExcel

Computed dataExcel

Externalsources

All dataRDF Datastore

Visualizations

Page 4: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Technical details

Index made from61 countries (100 planned for 2013)85 indicators:

51 Primary (questionnaires)34 Secondary (external sources)

WebIndex RDF dataModeled on top of RDF Data CubeMore than 1 million triplesLinked data: DBPedia, Organizations, etc.Records data computation

Page 5: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Country 2009 2010 2011

Spain 4 5 3

Finland 4 5 6

Armenia 1 1 1

Chile 6 8 10.6

Country 2009 2010 2011

Spain 4 5 3

Finland 4 5 6

Armenia 1 1 1

Chile 6 8 10.6

Country 2009 2010 2011

Spain 4 5 3

Finland 4 5 6

Armenia 1 1 1

Chile 6 8 10.6

Country 2009 2010 2011

Spain 4 5 3

Finland 4 5 6

Armenia 1 1 1

Chile 6 8 10.6

Country 2009 2010 2011

Spain 4 5 3

Finland 4 6

Armenia 1

Chile 6 8

Country 2009 2010 2011

Spain 4 5 3

Finland 4 6

Armenia 1

Chile 6 8

WebIndex computation process (1)

Simplified with one indicator, 3 years and 4 countries

Raw Data

Country 2009 2010 2011

Spain 4 5 3

Finland 4 5 6

Armenia 1 1 1

Chile 6 8 10.6

Imputed Data

Filtered Data (Indicator A)Country 2009 2010 2011

Spain -0.57 -0.57 -0.92

Finland -0.57 -0.57 -0.14

Chile 1.15 1.15 1.06

Normalized Data (z-scores)

Country 2009 2010 2011

Spain 4 5 3

Finland 4 6

Armenia 1

Chile 6 8

Country 2009 2010 2011

Spain 4 5 3

Finland 4 5 6

Armenia 1 1 1

Chile 6 8 10.6

Country 2009 2010 2011

Spain -0.57 -0.57 -0.92

Finland -0.57 -0.57 -0.14

Chile 1.15 1.15 1.06

Country 2009 2010 2011

Spain -0.57 -0.57 -0.92

Finland -0.57 -0.57 -0.14

Chile 1.15 1.15 1.06

Mean

Average growth

z-score

More details can be found here: http://thewebindex.org/about/methodology/computation/

Page 6: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

WebIndex computation Process (2)

Simplified with one indicator, 3 years and 4 countries

Country 2009 2010 2011

Spain -0.57 -0.57 -0.92

Finland -0.57 -0.57 -0.14

Chile 1.15 1.15 1.06

Normalized Data (z-scores)

Country 2009 2010 2011

Spain -0.57 -0.57 -0.92

Finland -0.57 -0.57 -0.14

Chile 1.15 1.15 1.06

Country 2009 2010 2011

Spain -0.57 -0.57 -0.92

Finland -0.57 -0.57 -0.14

Chile 1.15 1.15 1.06

More details can be found here: http://thewebindex.org/about/methodology/computation/

Adjusted data

Country A B C D ...

Spain 8 7 9.1 7.1 ...

Finland 7 8 7.1 8 ...

Chile 8 9 7.6 6 ...

Group indicators

Country Readiness Impact Web Composite

Spain 5.7 3.5 5.1 4.5

Finland 5.5 3.9 7.1 4.9

Chile 6.7 4.5 7.6 5.1

Rankings

Country Readiness Impact Web Composite

Spain 2 3 3 3

Finland 3 2 2 2

Chile 1 1 1 1

𝑥𝑖=𝑥 𝑖+𝛿

Page 7: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Example of datarepresentation in RDF

Example:obs:obsM23 a qb:Observation ; cex:computation [ a cex:Z-Score ; cex:observation obs:obsA23 ; cex:slice slice:sliceA09 ; ] ; cex:value -0.57 ; cex:md5-checksum "2917835203..." ; cex:indicator indicator:A ; cex:concept country:ESP ; qb:dataSet dataset:A-Normalized ; # ... other declarations omitted for brevity

It declares that the valueof this observation was obtained asz-score of obs:obsA23 over slice:sliceA09

Each observation follows the RDF Data Cube vocabulary extented with metadata about how it was obtained

Page 8: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Computex

Vocabulary for statistical computationsExtends RDF Data CubeOntology at: http://purl.org/weso/computex Some terms:

cex:Conceptcex:Indicatorcex:Computationcex:WeightSchemaqb:Observation...

Page 9: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Validation process

Last year (2012) Template based validationMD5 checksum of each observation and some fields

This year (2013)SPARQL based validation3 levels of validation

RDF Data CubeShapes of dataComputation process

Ultimate goal: automatically compute the index

Page 10: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Validation approach

SPARQL CONSTRUCT queries instead of ASKIF no error THEN empty model

ELSE RDF graph with error informationCONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "obs"; cex:value ?obs ] , [cex:name "value1"; cex:value ?value1 ] , [cex:name "value2"; cex:value ?value2 ] ; cex:msg "Observation has two different values" . ]} WHERE { ?obs a qb:Observation . ?obs cex:value ?value1 . ?obs cex:value ?value2 . FILTER ( ?value1 != ?value2 )}

CONSTRUCT queries enable debugging

Page 11: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

ASK WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] }}

SPARQL queriesRDF Data Cube

We converted RDF Data Cube integrity constraints from ASK to CONSTRUCT queries

CONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "dim"; cex:value ?dim ] ; cex:msg "Every Dimension must have a declared range" . ]} WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] }}

Page 12: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

SPARQL expressivity

SPARQL can express complex validation patterns. Example: MeanCONSTRUCT { [ a cex:Error ; cex:errorParam # ...omitted cex:msg "Mean value does not match" ] .} WHERE { ?obs a qb:Observation ; cex:computation ?comp ; cex:value ?val . ?comp a cex:Mean . { SELECT (AVG(?value) as ?mean) ?comp WHERE { ?comp cex:observation ?obs1 . ?obs1 cex:value ?value ; } GROUP BY ?comp } FILTER( abs(?mean - ?val) > 0.0001) }}

Page 13: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Limitations of SPARQL expressivity

Some built-in functions are not standardizedExample: z-score employs standard deviation.

It requires built-in function: sqrtAvailable in some SPARQL implementations

Ranking of values was not obvious (*)

2 solutions:Using GROUP_CONCATCheck a value against all the other values

Neither solution is efficient

(*) http://stackoverflow.com/questions/17313730/how-to-rank-values-in-sparql

Page 14: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

SPARQL expressivity

Handling series with RDF CollectionsAverage growth:

CONSTRUCT { # ... omitted for brevity } WHERE { ?obs cex:computation [a cex:AverageGrowth; cex:observations ?ls] ; cex:value ?val . ?ls rdf:first [cex:value ?v1 ] . { SELECT ( SUM(?v_n / ?v_n1)/COUNT(*) as ?meanGrowth) WHERE { ?ls rdf:rest* [ rdf:first [ cex:value ?v_n ] ; rdf:rest [ rdf:first [ cex:value ?v_n1 ]]] . } } FILTER (abs(?meanGrowth * ?v1 - ?val) > 0.001) } * This solution was provided by Joshua Taylor

Page 15: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

RDF Profiles

Descriptions (with constraints) of dataset familiesCan be used to validate (and compute) RDF graphsExamples:

Cube (Statistical data)Normalization Update queries + Integrity constraints

Computex (statistical computations)Adds more queries

Other user defined profiles

Cube

Computex

WebIndexCloudIndex

Page 16: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Cube Profile

W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/)

1 base ontology + RDF Schema inference2 update queries (normalization algorithm)21 integrity constraint queries

cex:cube a cex:ValidationProfile ; cex:name "RDF Data Cube" ; cex:ontologyBase cube:cube.ttl ; cex:expandSteps ( [ cex:name "Closure" ; cex:uri cube:closure.ru ] [ cex:name "Flatten" ; cex:uri cube:flatten.ru ] ) ; cex:integrityQuery [ cex:name "IC-1. Unique DataSet" ; cex:uri cube:q1.sparql ]; # ... .

Page 17: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Computex Profile

1 base ontology http://purl.org/weso/computex

Imports RDF Data Cubecex:computex a cex:ValidationProfile ; cex:name "Computex" ; cex:ontologyBase cex:computex.ttl ; cex:import profile:cube.ttl ; cex:expandSteps ( [ cex:name "Copy Raw"; cex:uri cex:copyRaw.sparql ] # Other UPDATE queries ) ; cex:integrityQuery [ cex:name "Adjusted" cex:uri cex:adjusted.sparql ] , # Other integrity queries

Page 18: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Computex Validation tool

Available at: http://computex.herokuapp.com

FeaturesInformative error messages

Option to generate EARL report

2 validation profiles (Cube and Computex)Source code in Scala available at:

http://github.com/weso/computex

Based on profiles

Page 19: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Computex Validation Toolhttp://computex.herokuapp.com

Page 20: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Conclusions

WebIndex = use case for RDF ValidationSPARQL queries as an implementation approachChallenges:

Expressivity limits of SPARQLComplexity of some queries

Concept of RDF Profiles Declarative, Turtle syntaxIntermediate level between OWL and SPARQL

Page 21: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

END OF PRESENTATION

Page 22: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

COMPLEMENTARY SLIDES

Page 23: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Limits of SPARQL expressivity

Ranking of values using GROUP_CONCAT

SELECT ?x ?v ?ranking { ?x :value ?v . { SELECT (GROUP_CONCAT(?x;separator="") as ?ordered) { { SELECT ?x { ?x :value ?v . } ORDER BY DESC(?v) } }}BIND (str(?x) as ?xName)BIND (strbefore(?ordered,?xName) as ?before)BIND ((strlen(?before) / strlen(?xName)) + 1 as ?ranking)} ORDER BY ?ranking

Page 24: Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Limits of SPARQL expressivity

Ranking of values comparing all values

SELECT ?x ?v (COUNT(*) as ?ranking) WHERE { ?x :value ?v . [] :value ?u . FILTER( ?v <= ?u )}GROUP BY ?x ?vORDER BY ?ranking

* This solution was provided by Joshua Taylor