Download pdf - NLP Data Cleansing Based on Linguistic Ontology Constraints

NLP Data Cleansing Based on Linguistic OntologyConstraints

Dimitris Kontokostas13 Martin Brümmer1 Sebastian Hellmann13

Jens Lehmann1 Lazaros Ioannidis2

1AKSW, University of Leipzig

2Aristotle University of Thessaloniki

3DBpedia Association

2014-05-27

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33

LOD Cloud (2011)


LOD Cloud (2011)


Linguistic Communities


Linguistic workshops & conferences


Linguistic workshops & conferences


Linguistic LOD Cloud (LLOD Cloud)


Problem de�nition

Linguistic (related) Data

Purpose-Driven de�nition

Increasing Data, ontologies & vocabularies

New-comers → hard to understand the ontologies / follow updates

Validation is essential

Many di�erent pipelines (parsing, annotation, disambiguation, etc)

Errors are propagated

Partially provided by maintainers (incomplete)

Focus on Lemon & NIF (proof of concept)


Lemon - Lexicon Model for Ontologies

Models lexicon and machine-readabledictionaries

RDF-native form

Linguistically sound structure (LMF)

Separation of the lexicon andontology layers

Linking to data categories →arbitrarily complex linguisticdescription

Principle of least power - the lessexpressive the language, the morereusable the data.

http://lemon-model.net/


http://lemon-model.net/

Lemon - Example

: l e x i c o n a lemon : Lex i con ;lemon : e n t r y : P izza , : T o r t i l l a .

: P i z za a lemon : L e x i c a l E n t r y ;lemon : s en s e [ lemon : r e f e r e n c e

<ht tp :// dbped ia . org / r e s o u r c e /Pizza> ] .

: T o r t i l l a a lemon : L e x i c a l E n t r y ;lemon : s en s e [ lemon : r e f e r e n c e

<ht tp :// dbped ia . org / r e s o u r c e / T o r t i l l a > ] .


Lemon - Example (Correct)

: l e x i c o n a lemon : Lex i con ;lemon : l anguage "en" ;lemon : e n t r y : P izza , : T o r t i l l a .

: P i z za a lemon : L e x i c a l E n t r y ;lemon : canon i ca lFo rm [lemon : wr i t t enRep " P i z za "@en ] ;

lemon : s en s e [ lemon : r e f e r e n c e<ht tp :// dbped ia . org / r e s o u r c e /Pizza >] .

: T o r t i l l a a lemon : L e x i c a l E n t r y ;lemon : canon i ca lFo rm [lemon : wr i t t enRep " T o r t i l l a "@en ] ;

lemon : s en s e [ lemon : r e f e r e n c e<ht tp :// dbped ia . org / r e s o u r c e / T o r t i l l a >] .


NIF - NLP Interchange Format

RDF/OWL-based format that aims to achieve interoperability betweenNatural Language Processing (NLP) tools, language resources andannotationsIn a nutshell:

Logical formalisation of strings and annotations

Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147

Reuse of RDF tool stack

Decreases development cost for integration

Integrated in:

DBpedia Spotlight, Stanford Core NLP, OpenNLP, RDFace, Validator,ConLL converter , . . .


NIF - Overview


NIF - Example

<http :// abc . com/doc#char=0,17>a n i f : Context ;a n i f : RFC147Str ing ;n i f : b e g i n I nd e x "0" ;n i f : end Index "17" ;n i f : i s S t r i n g "My dog l i k e s p i z z a " .

<ht tp :// abc . com/doc#char=2,7>a n i f : RFC5147Str ing ;n i f : anchorOf " dog " ;n i f : r e f e r e n c eCon t e x t <ht tp :// abc . com/doc#char=0,17> .i t s r d f : t aC l a s sR e f dbo : Animal ;


NIF - Example (Correct)

<http :// abc . com/doc#char=0,18>a n i f : Context ;a n i f : RFC5147 S t r i n g ;n i f : b e g i n I nd e x "0"^^xsd : n onNega t i v e I n t e g e r ;n i f : end Index "18"^^xsd : n onNega t i v e I n t e g e r ;n i f : i s S t r i n g "My dog l i k e s p i z z a "^^xsd : s t r i n g .

<ht tp :// abc . com/doc#char=2,7>a n i f : RFC5147Str ing ;n i f : b e g i n I nd e x "2"^^xsd : n onNega t i v e I n t e g e r ;n i f : end Index "7"^^xsd : n onNega t i v e I n t e g e r ;n i f : anchorOf " dog "^^xsd : s t r i n g ;n i f : r e f e r e n c eCon t e x t <ht tp :// abc . com/doc#char=0,27> .i t s r d f : t aC l a s sR e f dbo : Animal ;


Maintainer validation

Lemon

Python script

24 tests for structural criteria

too slow on big datasetsnot good reporting

NIF

SPARQL queries

11 tests for common errors

not complete


Built on previous work

Test-driven evaluation of linked data quality. Dimitris Kontokostas, PatrickWestphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, RolandCornelissen, and Amrapali J. Zaveri in WWW 2014.

Horizontal, multi-domain data quality assessment

Massive detection of errors for �ve large-scale LOD data sets

291 vocabularies, independent of their domain or purpose

New contributions:

Relation to OWL reasoners

Test Driven Data Engineering Ontology

Domain-speci�c validation

Quickly improving existing validation options provided by maintainers


Test-Driven Data Development Methodology

Test case: a data constraint that involves one or more triples

Test suite: a set of test cases for testing a dataset

Status: Success, Fail, Timeout (complexity) or Error (e.g. network)

Fail: Error, warning or notice

RDF: basis for both data and schema

Uni�ed model facilitates automatic test case generationSPARQL serves as the test case de�nition language


Example test case

A nif:RFC5147String should never have a nif:beginIndex greater thannif:endIndex

Test cases are written in SPARQL

SELECT ? s WHERE {? s n i f : b e g i n I nd e x ? v1 .? s n i f : end Index ? v2 .FILTER ( ? v1 > ?v2 ) }

We query for errors

Success: Query returns empty result set

Fail: Query returns results

Every result we get is a violation instance

Timeout / Error: needs further investigation on SPARQL Enginecapabilities, query syntax or query complexity


Patterns & Bindings

Data Quality Test Patterns (DQTP)abstract patterns, which can be further re�ned into concrete data qualitytest cases using test pattern bindings

Existing library of 20 patterns

SELECT ? s WHERE {? s %%P1%% ?v1 .? s %%P2%% ?v2 .FILTER ( ? v1 %%OP%% ?v2 ) }

Bindingsmapping of variables to valid pattern replacement

P1 => n i f : b e g i n I n d e x | SELECT ? s WHERE {P2 => n i f : end Index | ? s n i f : b e g i n I nd e x ? v1 .OP => > | ? s n i f : end Index ? v2 .

| FILTER ( ? v1 > ?v2 ) }


Test Auto Generators (TAGs)

RDF(s) & OWL (partial) support

Query schema for supported axioms

SELECT DISTINCT ?T1 ?T2 WHERE {?T1 owl : d i s j o i n tW i t h ?T2 . }

For every result a binding to a pattern is generated & a test caseinstantiated

Supported axioms at the moment:

RDFS: domain & rangeOWL: minCardinality, maxCardinality, cardinality, functionalProperty,InverseFunctionalProperty, disjointClass, propertyDisjointWith,AsymmetricProperty and deprecated


Test Case Elicitation Work�ow


TD(D)D vs Reasoners

SPARQL test cases detect a subset of validation errors detectable byan OWL reasoner. Limited by

SPARQL endpoint reasoning supportlimitations of the OWL-to-SPARQL translation.

SPARQL test cases detect validation errors not expressible in OWL

OWL reasoning is often not feasible on large datasets.

Datasets are already deployed and accessible via SPARQL endpoints

Pattern library more user friendly approach for building validation rulescompared to modelling OWL axioms.

requires familiaritynon-common validations require manual SPARQL test cases


Data Engineering Ontology

Input / Output entirely in RDF

Model the methodology in OWL

test suites, test cases, patterns, auto generators

Strict to serve as a validation layer

Four di�erent levels of error reporting

simple test case report (success, fail) / enriched with countsviolation instance reporting / enriched with annotations

Reuse dcterms, prov, spin, rlog


Data Engineering Ontology - De�nition & Generation


Data Engineering Ontology - Result Representation


Lemon & NIF Test case elicitation

RDFUnit Suite implements our methodology

Run on Lemon & NIF ontologies

TAGs could not yet handle some complex owl:Restrictions

owl:unionOf, owl:allValuesFrom, owl:someValuesFrom,owl:hasSelf and some rdfs:subPropertyOf cases

Manual test cases for constraints not captured in OWL.

Total Domain Range Datatype Card. Disj. Func. I. Func. Manual

Lemon 182 40 34 1 29 64 3 1 10

NIF 96 42 24 4 6 10 10


Example of manual Lemon test case

lemon:narrower denotes that one sense of a word is narrower than theother and must never be symmetric or contain cycles.

SELECT DISTINCT ? s WHERE {? s lemon : na r rowe r+ ? na r rowe r .? na r rowe r lemon : na r rowe r+ ? s . }

lemon:language must not have a language tag (RDF1.1 to the rescue)

SELECT DISTINCT ? s WHERE {? s lemon : l anguage ? v1 .FILTER ( l ang (? v1 ) !="" ) }


Example of manual NIF test case

Ensure that nif:beginIndex & nif:endIndex index are correct

SELECT DISTINCT ? s WHERE {? s n i f : anchorOf ? anchorOf ;

n i f : b e g i n I nd e x ? b eg i n I nd e x ;n i f : end Index ? end Index ;n i f : r e f e r e n c eCon t e x t

[ n i f : i s S t r i n g ? r e f e r e n c e S t r i n g ] .BIND (SUBSTR(? r e f e r e n c e S t r i n g ,

? b eg i n I nd e x ,(? end Index − ? b eg i n I nd e x ) ) AS ? t e s t ) .

FILTER ( s t r (? t e s t ) != s t r (? anchorOf ) ) . }


Evaluation Datasets

Name Description Ontology Type

lemon datasets

LemonUby Wiktionary EN Conversion of the English Wiktionary into UBY-LMF model lemon,UBY-LMF

Dictionary

LemonUby Wiktionary DE Conversion of the German Wiktionary into UBY-LMF model lemon,UBY-LMF

Dictionary

LemonUby Wordnet Conversion of the Princeton WordNet 3.0 into UBY-LMFmodel

lemon,UBY-LMF

WordNet

DBpedia Wiktionary Conversion of the English Wiktionary into lemon lemon Dictionary

QHL Multilingual translation graph from more than 50 lexicons lemon Dictionary

NIF datasets

Wikilinks sample of 60976 randomly selected phrases linked toWikipedia articles

NIF NER

DBpedia Spotlight dataset 58 manually NE annotated natural language sentences NIF NER

KORE 50 evaluationdataset

50 NE annotated natural language sentences from the AIDAcorpus

NIF NER

News-100 100 manually annotated German news articles NIF NER

RSS-500 500 manually annotated sentences from 1,457 RSS feeds NIF NER

Reuters-128 128 news articles manually curated NIF NER


Evaluation results

Size SC FL TO ER Auto Errors Man Errors MWarn MInfo

WiktDBp 60M 177 5 - - 3.746.103 7.521.791 - 3.582.837

WktEN 8M 168 14 - - 752.018 394.766 - 633.270

WktDE 2M 170 12 - - 273.109 66.268 - 155.598

Wordnet 4M 166 16 - - 257.228 36 - 257.204

QHL 3M 170 11 - 1 433.118 538.933 - 538.016

Wikilinks 0.6M 91 4 - 1 141.528 21.246 - -

News-100 13K 91 2 - 3 3.510 - - -

RSS-500 10K 91 2 - 3 3.000 - - -

Reuters-128 7K 91 2 - 3 2.016 - - -

Spotlight 3K 92 3 - 1 662 68 - -

KORE50 2K 89 6 - 1 301 55 - -


Conclusion

Extended a previously introduced methodology for test-driven qualityassessment

Data engineering ontology

Devised 277 test cases for NLP datasets using the Lemon and NIFvocabularies

Revealed a substantial number of errors for Lemon & NIF datasets

Future directions

extend the test cases to more NLP ontologies (MARL, NERD, ITSRDF)automatic dependencies between test caseswrap RDFUnit for NLP services (integrated in NIF)


Thank you!

Dimitris KontokostasWith kind support of

John McCrae (Lemon model)

http://rdfunit.aksw.org

http://github.com/AKSW/RDFUnit

#eswc2014kontokostas


http://rdfunit.aksw.org

http://github.com/AKSW/RDFUnit