NIF 2.0 Tutorial: Content Analysis and the Semantic Web

PowerPoint-Prsentation - Folie 1

Creating Knowledge out of Interlinked Data

LOD2 Presentation . 02.09.2010 . Page

http://lod2.eu

AKSW, Universitt Leipzig

Sebastian Hellmann

Content Analysis
and the Semantic Web

NIF 2.0 Tutorial

http://nlp2rdf.orghttp://lod2.eu

http://slideshare.net/kurzum

Sebastian Hellmann researcher working on LOD2 EU ProjectAKSW Agile Knowledge and the Semantic Web research group in Leipzig - http://aksw.orgInfAI Institute for Applied Informatics - http://infai.org

ALL DEMOS ARE AVAILABLE AT:http://nlp2rdf.org/leipzig-24-9-2013

Introduction

Introduction


End users have tasks for NLP, but:

Each new tool is a challenge:How to download and start it?

What kind of annotations does it use?

How good does it perform (on my domain)?

If badly, are there any alternatives? How can I find them?

Open source?

Lot's of know-how needed to exploit NLP.

Lot's of data needed to exploit NLP.

Barriers to NLP

The Semantic Gap

Part 1: exploiting free, open and interoperable (FOI)
language resources

Part 2: Connecting text to these resources

Part 3: tools, demos, infrastructure

From a walled garden to
an interoperable infrastructure

Part 1: exploiting free, open and interoperable (FOI)
language resources


http://lod-cloud.net

Linguistic/NLP Data currently filed under cross-domain

http://lod-cloud.net

Linked Open Data

- All datasets provide open access to individual records via HTTP- Many are free (no payment required, as in royalty-free)- Some are openly licensed, e.g. CC-0 or CC-BY-SA

=> Open access also applies to published HTML on the WWW, but in LOD the data itself is published unrendered via RDF

Question:Who knows how to add a new bubble to the LOD cloud?


Who knows how to add a new bubble to the LOD cloud?
http://datahub.io/group/linguisticshttps://github.com/jmccrae/llod-cloud.pyhttp://validator.lod-cloud.net/validate.php


Question:What are the most important data sets and ontologies for NLP?

Who has used what?

FOI data

Analysis of mentions of Wikipedia / DBpedia at LREC 2012:https://www.google.com/webhp?q=site:http%3A%2F%2Fwww.lrec-conf.org%2Fproceedings%2Flrec2012+wikipedia+filetype%3Apdf 163 papers

https://www.google.com/webhp?q=site:http%3A%2F%2Fwww.lrec-conf.org%2Fproceedings%2Flrec2012+dbpedia+filetype%3Apdf 24 papers

FOI data 1: Wikipedia / DBpedia

Training data for NLP, e.g. URI, surrounding text, surface form

Probabilities:P(sf|URI): P that apple refers to wikipedia:Apple_Inc.

P(URI|sf): P that wikipedia:Apple_Inc. is apple in text

FOI data 1: Wikipedia / DBpedia

http://wiki.dbpedia.org/Datasets/NLP

FOI data: Wikipedia / DBpedia

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=sodium

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=sodium

Available data for Sodium

http://dbpedia.org/snorql

select ?labels where { rdfs:label ?labels . } LIMIT 100

select ?altlabel where { ?redirect dbpedia-owl:wikiPageRedirects . ?redirect rdfs:label ?altlabel .} LIMIT 100

http://lcl.uniroma1.it/babelnet/explore.jsp?word=sodium&lang=EN

Wiktionary2RDF Mediator Wrapper

http://dbpedia.org/Wiktionary





MediatorLemon


http://lcl.uniroma1.it/babelnet/explore.jsp?word=sodium&lang=EN

https://en.wiktionary.org/wiki/sodium#English

http://wiktionary.dbpedia.org/resource/sodium

Lemon Ontology - http://lemon-model.net

Lemon Ontology - http://lemon-model.net

IntersectiveDataPropertyAdjective ("extinct" ,
dbpedia:conservationStatus ,"EX")IntersectiveDataPropertyAdjective ("endangered" ,
dbpedia:conservationStatus ,"EN")

https://github.com/cunger/lemon.dbpedia

Christina Unger, John Mccrae, Sebastian Walter, Sara Winter and Philipp Cimiano (2013): A lemon lexicon for DBpedia. NLP & DBpedia Workshop

Part 2: Connecting text to these resources



https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki


Overview of existing tools:http://en.wikipedia.org/wiki/Knowledge_extraction#Tools


Developers nightmare:All tools belong to similar class of NLP tools Wikifier or Named Entity Linking, SOA principle

But they all have:

Heterogeneous output formats (JSON, XML)

Heterogeneous API parameters

Heterogeneous ways of annotating text:Some remove HTML internally, offsets not usable

Some use byte offset instead of char offset


Demohttp://rdface.aksw.org/new/tinymce/examples/rdface.html

ITS 2.0 - http://www.w3.org/TR/its20/

The Internationalization Tag Set (ITS) 2.0 enhances the foundation to integrate automated processing of human language into core Web technologies.Currently last call

Driven by localization industry

Embed translation aids into HTML and XML

Robust way to encode NLP information in HTML

ITS 2.0 describes 20 data categories ontology

NIF overview

SummaryMotivated the Walled Garden problem

Overview of the emerging Web of Language resources

Motivated the NLP tool heterogeneity problem

Introduction of ITS 2.0 Use case for NIF

Now: NIF 2.0

The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.

Reuse of existing standards such as RDF, OWL 2, the PROV Ontology, LAF (ISO 24612), Unicode and RFC 5147

Standardize access parameters, annotations (e.g. tokenization), validation and log messages.

A NIF workflow, however, can obviously not provide any better performance (F-measure, speed) than a properly configured UIMA or GATE pipeline with the same components.

Lower entry barrier, easy data integration, reusability of tools and conceptualisation, off-the-shelf solutions for common tasks.

NIF Overview

Relation of NIF and UIMA and Gate

A Formal Framework for Linguistic Annotation (2000) by Steven Bird, Mark Liberman take home message: generic annotation formats should be based on graphs

Ontologies in NIF (e.g. OliA, lemon) can be hard compiled for internal use (as is done in Stanbol)

WP3 Task 3.2 Community work: NLP2RDF

Not primarily aimed atincreasing features or performance (F-Measure)

WP3 Task 3.2 NIF overview

NIF turns out to have a Unique selling proposition regarding NLP and RDF

NIF will be the recommended RDF conversion of the Internationalisation Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/

There was no alternative RDF vocabulary for this conversion available.

NIF Overview

WP3 Task 3.2 Community work: NLP2RDF

RDFa parsers loose all provenance information:

dc:title ''Wikinomics'' .

https://en.wikipedia.org/wiki/RDFa

Available resources:http://persistence.uni-leipzig.org/nlp2rdf/

Disclaimer

Migration to the online presence is still on-going, but there are 15 scientific publications, e.g.

Integrating NLP using Linked Data. Sebastian Hellmann, Jens Lehmann, Sren Auer, and Martin Brmmer. 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia, (2013) - http://svn.aksw.org/papers/2013/ISWC_NIF/public.pdf

NIF Overview

Question:

What is a String?

NIF Basics

Counting strings is more difficult than it seems:

Three ways to count Unicode:Code Units

Code Points

Graphems

Encoding:UTF-8, 16, 32

NIF Basics Unicode

Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Unicode Normal Form C

http://unicode.org/reports/tr15/#Norm_Forms

Unicode

Recommendation for RDF Literals

http://unicode.org/reports/tr15/#Norm_Forms

Unicode Normal Form C

NIF uses Unicode Normal Form C

NIF counts in Code Points

Unicode

Sadly, there are still implementation problems:

Java length() vs. PHP strlen() function

curl --data-urlencode i="" -d f=text "http://nlp2rdf.lod2.eu/nif-ws.php"

Korean Character is URL encoded (#%EB%8C%80) and counted as 3 characters (not NFC in PHP)

Demo


Now some RDF (finally):

Note that in NIF the document is != content of the document.

two different documents can have the same content=> must not have the same URI

Context

Annotations

Tokenization

Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations. Language Resources and Evaluation 46(1): 53-74 (2012)

NIF

Demo:http://nlp2rdf.lod2.eu/demo.php

SPARQL queries produce (find) errors

http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.ttl

RLOG An RDF Logging Ontology

./validate.jar -i nif-erroneous-model.ttl -t file

Demo character count

Demo all errors

Validation over specification


NIF

Demo:http://nlp2rdf.lod2.eu/demo.php

NIF

http://www.w3.org/TR/its20/#conversion-to-nif

http://www.w3.org/TR/its20/#nif-backconversion

NIF

Demo

Load Terminological model or Inference Model

Reasoning

Open Community All feedback is welcome!

http://slideshare.net/kurzumWebsites:

http://dbpedia.org

http://nlp2rdf.org

http://lod2.eu

Thanks for your attention


NIF Tutorial 2013/09/24 Page

http://lod2.eu

LOD2 Title . 02.09.2010 . Page

http://lod2.eu

http://lod2.eu

ISSLOD 2011/09/15 Page

http://lod2.eu

Table of Contents


http://lod2.eu


http://lod2.eu

Address

University of LeipzigFaculty of Mathematics and Computer ScienceInstitute of Computer ScienceDepartment of Business Information SystemsPostfach 10092004009 LeipzigGermany

Thanks for your attention!

Contact

Creating Knowledge out of Interlinked Data

Sren Auer The Data Web 24.5.2012 Page http://lod2.euClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelMastertextformat bearbeitenZweite Ebene

Dritte Ebene

Vierte Ebene

Fnfte Ebene

Click to edit the title text formatClick to edit Master title style

Technology

NIF 2.0 Tutorial: Content Analysis and the Semantic Web