NLP Data Cleansing Based on Linguistic OntologyConstraints
Dimitris Kontokostas13 Martin Brümmer1 Sebastian Hellmann13
Jens Lehmann1 Lazaros Ioannidis2
1AKSW, University of Leipzig
2Aristotle University of Thessaloniki
3DBpedia Association
2014-05-27
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33
LOD Cloud (2011)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 2 / 33
LOD Cloud (2011)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 3 / 33
Linguistic Communities
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 4 / 33
Linguistic workshops & conferences
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 5 / 33
Linguistic workshops & conferences
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 6 / 33
Linguistic LOD Cloud (LLOD Cloud)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 7 / 33
Problem de�nition
Linguistic (related) Data
Purpose-Driven de�nition
Increasing Data, ontologies & vocabularies
New-comers → hard to understand the ontologies / follow updates
Validation is essential
Many di�erent pipelines (parsing, annotation, disambiguation, etc)
Errors are propagated
Partially provided by maintainers (incomplete)
Focus on Lemon & NIF (proof of concept)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 8 / 33
Lemon - Lexicon Model for Ontologies
Models lexicon and machine-readabledictionaries
RDF-native form
Linguistically sound structure (LMF)
Separation of the lexicon andontology layers
Linking to data categories →arbitrarily complex linguisticdescription
Principle of least power - the lessexpressive the language, the morereusable the data.
http://lemon-model.net/
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 9 / 33
Lemon - Example
: l e x i c o n a lemon : Lex i con ;lemon : e n t r y : P izza , : T o r t i l l a .
: P i z za a lemon : L e x i c a l E n t r y ;lemon : s en s e [ lemon : r e f e r e n c e
<ht tp :// dbped ia . org / r e s o u r c e /Pizza> ] .
: T o r t i l l a a lemon : L e x i c a l E n t r y ;lemon : s en s e [ lemon : r e f e r e n c e
<ht tp :// dbped ia . org / r e s o u r c e / T o r t i l l a > ] .
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 10 / 33
Lemon - Example (Correct)
: l e x i c o n a lemon : Lex i con ;lemon : l anguage "en" ;lemon : e n t r y : P izza , : T o r t i l l a .
: P i z za a lemon : L e x i c a l E n t r y ;lemon : canon i ca lFo rm [lemon : wr i t t enRep " P i z za "@en ] ;
lemon : s en s e [ lemon : r e f e r e n c e<ht tp :// dbped ia . org / r e s o u r c e /Pizza >] .
: T o r t i l l a a lemon : L e x i c a l E n t r y ;lemon : canon i ca lFo rm [lemon : wr i t t enRep " T o r t i l l a "@en ] ;
lemon : s en s e [ lemon : r e f e r e n c e<ht tp :// dbped ia . org / r e s o u r c e / T o r t i l l a >] .
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 11 / 33
NIF - NLP Interchange Format
RDF/OWL-based format that aims to achieve interoperability betweenNatural Language Processing (NLP) tools, language resources andannotationsIn a nutshell:
Logical formalisation of strings and annotations
Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147
Reuse of RDF tool stack
Decreases development cost for integration
Integrated in:
DBpedia Spotlight, Stanford Core NLP, OpenNLP, RDFace, Validator,ConLL converter , . . .
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 12 / 33
NIF - Overview
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 13 / 33
NIF - Example
<http :// abc . com/doc#char=0,17>a n i f : Context ;a n i f : RFC147Str ing ;n i f : b e g i n I nd e x "0" ;n i f : end Index "17" ;n i f : i s S t r i n g "My dog l i k e s p i z z a " .
<ht tp :// abc . com/doc#char=2,7>a n i f : RFC5147Str ing ;n i f : anchorOf " dog " ;n i f : r e f e r e n c eCon t e x t <ht tp :// abc . com/doc#char=0,17> .i t s r d f : t aC l a s sR e f dbo : Animal ;
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 14 / 33
NIF - Example (Correct)
<http :// abc . com/doc#char=0,18>a n i f : Context ;a n i f : RFC5147 S t r i n g ;n i f : b e g i n I nd e x "0"^^xsd : n onNega t i v e I n t e g e r ;n i f : end Index "18"^^xsd : n onNega t i v e I n t e g e r ;n i f : i s S t r i n g "My dog l i k e s p i z z a "^^xsd : s t r i n g .
<ht tp :// abc . com/doc#char=2,7>a n i f : RFC5147Str ing ;n i f : b e g i n I nd e x "2"^^xsd : n onNega t i v e I n t e g e r ;n i f : end Index "7"^^xsd : n onNega t i v e I n t e g e r ;n i f : anchorOf " dog "^^xsd : s t r i n g ;n i f : r e f e r e n c eCon t e x t <ht tp :// abc . com/doc#char=0,27> .i t s r d f : t aC l a s sR e f dbo : Animal ;
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 15 / 33
Maintainer validation
Lemon
Python script
24 tests for structural criteria
too slow on big datasetsnot good reporting
NIF
SPARQL queries
11 tests for common errors
not complete
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 16 / 33
Built on previous work
Test-driven evaluation of linked data quality. Dimitris Kontokostas, PatrickWestphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, RolandCornelissen, and Amrapali J. Zaveri in WWW 2014.
Horizontal, multi-domain data quality assessment
Massive detection of errors for �ve large-scale LOD data sets
291 vocabularies, independent of their domain or purpose
New contributions:
Relation to OWL reasoners
Test Driven Data Engineering Ontology
Domain-speci�c validation
Quickly improving existing validation options provided by maintainers
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 17 / 33
Test-Driven Data Development Methodology
Test case: a data constraint that involves one or more triples
Test suite: a set of test cases for testing a dataset
Status: Success, Fail, Timeout (complexity) or Error (e.g. network)
Fail: Error, warning or notice
RDF: basis for both data and schema
Uni�ed model facilitates automatic test case generationSPARQL serves as the test case de�nition language
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 18 / 33
Example test case
A nif:RFC5147String should never have a nif:beginIndex greater thannif:endIndex
Test cases are written in SPARQL
SELECT ? s WHERE {? s n i f : b e g i n I nd e x ? v1 .? s n i f : end Index ? v2 .FILTER ( ? v1 > ?v2 ) }
We query for errors
Success: Query returns empty result set
Fail: Query returns results
Every result we get is a violation instance
Timeout / Error: needs further investigation on SPARQL Enginecapabilities, query syntax or query complexity
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 19 / 33
Patterns & Bindings
Data Quality Test Patterns (DQTP)abstract patterns, which can be further re�ned into concrete data qualitytest cases using test pattern bindings
Existing library of 20 patterns
SELECT ? s WHERE {? s %%P1%% ?v1 .? s %%P2%% ?v2 .FILTER ( ? v1 %%OP%% ?v2 ) }
Bindingsmapping of variables to valid pattern replacement
P1 => n i f : b e g i n I n d e x | SELECT ? s WHERE {P2 => n i f : end Index | ? s n i f : b e g i n I nd e x ? v1 .OP => > | ? s n i f : end Index ? v2 .
| FILTER ( ? v1 > ?v2 ) }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 20 / 33
Test Auto Generators (TAGs)
RDF(s) & OWL (partial) support
Query schema for supported axioms
SELECT DISTINCT ?T1 ?T2 WHERE {?T1 owl : d i s j o i n tW i t h ?T2 . }
For every result a binding to a pattern is generated & a test caseinstantiated
Supported axioms at the moment:
RDFS: domain & rangeOWL: minCardinality, maxCardinality, cardinality, functionalProperty,InverseFunctionalProperty, disjointClass, propertyDisjointWith,AsymmetricProperty and deprecated
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 21 / 33
Test Case Elicitation Work�ow
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 22 / 33
TD(D)D vs Reasoners
SPARQL test cases detect a subset of validation errors detectable byan OWL reasoner. Limited by
SPARQL endpoint reasoning supportlimitations of the OWL-to-SPARQL translation.
SPARQL test cases detect validation errors not expressible in OWL
OWL reasoning is often not feasible on large datasets.
Datasets are already deployed and accessible via SPARQL endpoints
Pattern library more user friendly approach for building validation rulescompared to modelling OWL axioms.
requires familiaritynon-common validations require manual SPARQL test cases
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 23 / 33
Data Engineering Ontology
Input / Output entirely in RDF
Model the methodology in OWL
test suites, test cases, patterns, auto generators
Strict to serve as a validation layer
Four di�erent levels of error reporting
simple test case report (success, fail) / enriched with countsviolation instance reporting / enriched with annotations
Reuse dcterms, prov, spin, rlog
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 24 / 33
Data Engineering Ontology - De�nition & Generation
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 25 / 33
Data Engineering Ontology - Result Representation
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 26 / 33
Lemon & NIF Test case elicitation
RDFUnit Suite implements our methodology
Run on Lemon & NIF ontologies
TAGs could not yet handle some complex owl:Restrictions
owl:unionOf, owl:allValuesFrom, owl:someValuesFrom,owl:hasSelf and some rdfs:subPropertyOf cases
Manual test cases for constraints not captured in OWL.
Total Domain Range Datatype Card. Disj. Func. I. Func. Manual
Lemon 182 40 34 1 29 64 3 1 10
NIF 96 42 24 4 6 10 10
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 27 / 33
Example of manual Lemon test case
lemon:narrower denotes that one sense of a word is narrower than theother and must never be symmetric or contain cycles.
SELECT DISTINCT ? s WHERE {? s lemon : na r rowe r+ ? na r rowe r .? na r rowe r lemon : na r rowe r+ ? s . }
lemon:language must not have a language tag (RDF1.1 to the rescue)
SELECT DISTINCT ? s WHERE {? s lemon : l anguage ? v1 .FILTER ( l ang (? v1 ) !="" ) }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 28 / 33
Example of manual NIF test case
Ensure that nif:beginIndex & nif:endIndex index are correct
SELECT DISTINCT ? s WHERE {? s n i f : anchorOf ? anchorOf ;
n i f : b e g i n I nd e x ? b eg i n I nd e x ;n i f : end Index ? end Index ;n i f : r e f e r e n c eCon t e x t
[ n i f : i s S t r i n g ? r e f e r e n c e S t r i n g ] .BIND (SUBSTR(? r e f e r e n c e S t r i n g ,
? b eg i n I nd e x ,(? end Index − ? b eg i n I nd e x ) ) AS ? t e s t ) .
FILTER ( s t r (? t e s t ) != s t r (? anchorOf ) ) . }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 29 / 33
Evaluation Datasets
Name Description Ontology Type
lemon datasets
LemonUby Wiktionary EN Conversion of the English Wiktionary into UBY-LMF model lemon,UBY-LMF
Dictionary
LemonUby Wiktionary DE Conversion of the German Wiktionary into UBY-LMF model lemon,UBY-LMF
Dictionary
LemonUby Wordnet Conversion of the Princeton WordNet 3.0 into UBY-LMFmodel
lemon,UBY-LMF
WordNet
DBpedia Wiktionary Conversion of the English Wiktionary into lemon lemon Dictionary
QHL Multilingual translation graph from more than 50 lexicons lemon Dictionary
NIF datasets
Wikilinks sample of 60976 randomly selected phrases linked toWikipedia articles
NIF NER
DBpedia Spotlight dataset 58 manually NE annotated natural language sentences NIF NER
KORE 50 evaluationdataset
50 NE annotated natural language sentences from the AIDAcorpus
NIF NER
News-100 100 manually annotated German news articles NIF NER
RSS-500 500 manually annotated sentences from 1,457 RSS feeds NIF NER
Reuters-128 128 news articles manually curated NIF NER
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 30 / 33
Evaluation results
Size SC FL TO ER Auto Errors Man Errors MWarn MInfo
WiktDBp 60M 177 5 - - 3.746.103 7.521.791 - 3.582.837
WktEN 8M 168 14 - - 752.018 394.766 - 633.270
WktDE 2M 170 12 - - 273.109 66.268 - 155.598
Wordnet 4M 166 16 - - 257.228 36 - 257.204
QHL 3M 170 11 - 1 433.118 538.933 - 538.016
Wikilinks 0.6M 91 4 - 1 141.528 21.246 - -
News-100 13K 91 2 - 3 3.510 - - -
RSS-500 10K 91 2 - 3 3.000 - - -
Reuters-128 7K 91 2 - 3 2.016 - - -
Spotlight 3K 92 3 - 1 662 68 - -
KORE50 2K 89 6 - 1 301 55 - -
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 31 / 33
Conclusion
Extended a previously introduced methodology for test-driven qualityassessment
Data engineering ontology
Devised 277 test cases for NLP datasets using the Lemon and NIFvocabularies
Revealed a substantial number of errors for Lemon & NIF datasets
Future directions
extend the test cases to more NLP ontologies (MARL, NERD, ITSRDF)automatic dependencies between test caseswrap RDFUnit for NLP services (integrated in NIF)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 32 / 33
Thank you!
Dimitris KontokostasWith kind support of
John McCrae (Lemon model)
http://rdfunit.aksw.org
http://github.com/AKSW/RDFUnit
#eswc2014kontokostas
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 33 / 33