If you can't read please download the document
Upload
sebastian-hellmann
View
4.143
Download
0
Embed Size (px)
Citation preview
PowerPoint-Prsentation - Folie 1
Creating Knowledge out of Interlinked Data
LOD2 Presentation . 02.09.2010 . Page
http://lod2.eu
AKSW, Universitt Leipzig
Sebastian Hellmann
Content Analysis
and the Semantic Web
NIF 2.0 Tutorial
http://nlp2rdf.orghttp://lod2.eu
http://slideshare.net/kurzum
Sebastian Hellmann researcher working on LOD2 EU ProjectAKSW Agile Knowledge and the Semantic Web research group in Leipzig - http://aksw.orgInfAI Institute for Applied Informatics - http://infai.org
ALL DEMOS ARE AVAILABLE AT:http://nlp2rdf.org/leipzig-24-9-2013
Introduction
Introduction
ALL DEMOS ARE AVAILABLE AT:http://nlp2rdf.org/leipzig-24-9-2013
End users have tasks for NLP, but:
Each new tool is a challenge:How to download and start it?
What kind of annotations does it use?
How good does it perform (on my domain)?
If badly, are there any alternatives? How can I find them?
Open source?
Lot's of know-how needed to exploit NLP.
Lot's of data needed to exploit NLP.
Barriers to NLP
The Semantic Gap
Part 1: exploiting free, open and interoperable (FOI)
language resources
Part 2: Connecting text to these resources
Part 3: tools, demos, infrastructure
From a walled garden to
an interoperable infrastructure
Part 1: exploiting free, open and interoperable (FOI)
language resources
From a walled garden to
an interoperable infrastructure
http://lod-cloud.net
Linguistic/NLP Data currently filed under cross-domain
http://lod-cloud.net
Linked Open Data
- All datasets provide open access to individual records via HTTP- Many are free (no payment required, as in royalty-free)- Some are openly licensed, e.g. CC-0 or CC-BY-SA
=> Open access also applies to published HTML on the WWW, but in LOD the data itself is published unrendered via RDF
Question:Who knows how to add a new bubble to the LOD cloud?
From a walled garden to
an interoperable infrastructure
Who knows how to add a new bubble to the LOD cloud?
http://datahub.io/group/linguisticshttps://github.com/jmccrae/llod-cloud.pyhttp://validator.lod-cloud.net/validate.php
From a walled garden to
an interoperable infrastructure
Question:What are the most important data sets and ontologies for NLP?
Who has used what?
FOI data
Analysis of mentions of Wikipedia / DBpedia at LREC 2012:https://www.google.com/webhp?q=site:http%3A%2F%2Fwww.lrec-conf.org%2Fproceedings%2Flrec2012+wikipedia+filetype%3Apdf 163 papers
https://www.google.com/webhp?q=site:http%3A%2F%2Fwww.lrec-conf.org%2Fproceedings%2Flrec2012+dbpedia+filetype%3Apdf 24 papers
FOI data 1: Wikipedia / DBpedia
Training data for NLP, e.g. URI, surrounding text, surface form
Probabilities:P(sf|URI): P that apple refers to wikipedia:Apple_Inc.
P(URI|sf): P that wikipedia:Apple_Inc. is apple in text
FOI data 1: Wikipedia / DBpedia
http://wiki.dbpedia.org/Datasets/NLP
FOI data: Wikipedia / DBpedia
http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=sodium
http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=sodium
Available data for Sodium
http://dbpedia.org/snorql
select ?labels where { rdfs:label ?labels . } LIMIT 100
select ?altlabel where { ?redirect dbpedia-owl:wikiPageRedirects . ?redirect rdfs:label ?altlabel .} LIMIT 100
http://lcl.uniroma1.it/babelnet/explore.jsp?word=sodium&lang=EN
Wiktionary2RDF Mediator Wrapper
http://dbpedia.org/Wiktionary
http://dbpedia.org/Wiktionary
http://dbpedia.org/Wiktionary
Wiktionary2RDF Mediator Wrapper
http://dbpedia.org/Wiktionary
MediatorLemon
Wiktionary2RDF Mediator Wrapper
http://lcl.uniroma1.it/babelnet/explore.jsp?word=sodium&lang=EN
https://en.wiktionary.org/wiki/sodium#English
http://wiktionary.dbpedia.org/resource/sodium
Lemon Ontology - http://lemon-model.net
Lemon Ontology - http://lemon-model.net
IntersectiveDataPropertyAdjective ("extinct" ,
dbpedia:conservationStatus ,"EX")IntersectiveDataPropertyAdjective
("endangered" ,
dbpedia:conservationStatus ,"EN")
https://github.com/cunger/lemon.dbpedia
Christina Unger, John Mccrae, Sebastian Walter, Sara Winter and Philipp Cimiano (2013): A lemon lexicon for DBpedia. NLP & DBpedia Workshop
Part 2: Connecting text to these resources
From a walled garden to
an interoperable infrastructure
From a walled garden to
an interoperable infrastructure
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
From a walled garden to
an interoperable infrastructure
Overview of existing tools:http://en.wikipedia.org/wiki/Knowledge_extraction#Tools
From a walled garden to
an interoperable infrastructure
Developers nightmare:All tools belong to similar class of NLP tools Wikifier or Named Entity Linking, SOA principle
But they all have:
Heterogeneous output formats (JSON, XML)
Heterogeneous API parameters
Heterogeneous ways of annotating text:Some remove HTML internally, offsets not usable
Some use byte offset instead of char offset
From a walled garden to
an interoperable infrastructure
Demohttp://rdface.aksw.org/new/tinymce/examples/rdface.html
ITS 2.0 - http://www.w3.org/TR/its20/
The Internationalization Tag Set (ITS) 2.0 enhances the foundation to integrate automated processing of human language into core Web technologies.Currently last call
Driven by localization industry
Embed translation aids into HTML and XML
Robust way to encode NLP information in HTML
ITS 2.0 describes 20 data categories ontology
NIF overview
SummaryMotivated the Walled Garden problem
Overview of the emerging Web of Language resources
Motivated the NLP tool heterogeneity problem
Introduction of ITS 2.0 Use case for NIF
Now: NIF 2.0
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.
Reuse of existing standards such as RDF, OWL 2, the PROV Ontology, LAF (ISO 24612), Unicode and RFC 5147
Standardize access parameters, annotations (e.g. tokenization), validation and log messages.
A NIF workflow, however, can obviously not provide any better performance (F-measure, speed) than a properly configured UIMA or GATE pipeline with the same components.
Lower entry barrier, easy data integration, reusability of tools and conceptualisation, off-the-shelf solutions for common tasks.
NIF Overview
Relation of NIF and UIMA and Gate
A Formal Framework for Linguistic Annotation (2000) by Steven Bird, Mark Liberman take home message: generic annotation formats should be based on graphs
Ontologies in NIF (e.g. OliA, lemon) can be hard compiled for internal use (as is done in Stanbol)
WP3 Task 3.2 Community work: NLP2RDF
Not primarily aimed atincreasing features or performance (F-Measure)
WP3 Task 3.2 NIF overview
NIF turns out to have a Unique selling proposition regarding NLP and RDF
NIF will be the recommended RDF conversion of the Internationalisation Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/
There was no alternative RDF vocabulary for this conversion available.
NIF Overview
WP3 Task 3.2 Community work: NLP2RDF
RDFa parsers loose all provenance information:
dc:title ''Wikinomics'' .
https://en.wikipedia.org/wiki/RDFa
Available resources:http://persistence.uni-leipzig.org/nlp2rdf/
Disclaimer
Migration to the online presence is still on-going, but there are 15 scientific publications, e.g.
Integrating NLP using Linked Data. Sebastian Hellmann, Jens Lehmann, Sren Auer, and Martin Brmmer. 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia, (2013) - http://svn.aksw.org/papers/2013/ISWC_NIF/public.pdf
NIF Overview
Question:
What is a String?
NIF Basics
Counting strings is more difficult than it seems:
Three ways to count Unicode:Code Units
Code Points
Graphems
Encoding:UTF-8, 16, 32
NIF Basics Unicode
Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.
Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.
Unicode Normal Form C
http://unicode.org/reports/tr15/#Norm_Forms
Unicode
Recommendation for RDF Literals
http://unicode.org/reports/tr15/#Norm_Forms
Unicode Normal Form C
NIF uses Unicode Normal Form C
NIF counts in Code Points
Unicode
Sadly, there are still implementation problems:
Java length() vs. PHP strlen() function
curl --data-urlencode i="" -d f=text "http://nlp2rdf.lod2.eu/nif-ws.php"
Korean Character is URL encoded (#%EB%8C%80) and counted as 3 characters (not NFC in PHP)
Demo
ALL DEMOS ARE AVAILABLE AT:http://nlp2rdf.org/leipzig-24-9-2013
Now some RDF (finally):
Note that in NIF the document is != content of the document.
two different documents can have the same content=> must not have the same URI
Context
Annotations
Tokenization
Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations. Language Resources and Evaluation 46(1): 53-74 (2012)
NIF
Demo:http://nlp2rdf.lod2.eu/demo.php
SPARQL queries produce (find) errors
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.ttl
RLOG An RDF Logging Ontology
./validate.jar -i nif-erroneous-model.ttl -t file
Demo character count
Demo all errors
Validation over specification
ALL DEMOS ARE AVAILABLE AT:http://nlp2rdf.org/leipzig-24-9-2013
NIF
Demo:http://nlp2rdf.lod2.eu/demo.php
NIF
http://www.w3.org/TR/its20/#conversion-to-nif
http://www.w3.org/TR/its20/#nif-backconversion
NIF
Demo
Load Terminological model or Inference Model
Reasoning
Open Community All feedback is welcome!
http://slideshare.net/kurzumWebsites:
http://dbpedia.org
http://nlp2rdf.org
http://lod2.eu
Thanks for your attention
ALL DEMOS ARE AVAILABLE AT:http://nlp2rdf.org/leipzig-24-9-2013
NIF Tutorial 2013/09/24 Page
http://lod2.eu
LOD2 Title . 02.09.2010 . Page
http://lod2.eu
http://lod2.eu
ISSLOD 2011/09/15 Page
http://lod2.eu
Table of Contents
LOD2 Title . 02.09.2010 . Page
http://lod2.eu
LOD2 Title . 02.09.2010 . Page
http://lod2.eu
Address
University of LeipzigFaculty of Mathematics and Computer ScienceInstitute of Computer ScienceDepartment of Business Information SystemsPostfach 10092004009 LeipzigGermany
Thanks for your attention!
Contact
Creating Knowledge out of Interlinked Data
Sren Auer The Data Web 24.5.2012 Page http://lod2.euClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelMastertextformat bearbeitenZweite Ebene
Dritte Ebene
Vierte Ebene
Fnfte Ebene
Click to edit the title text formatClick to edit Master title style