What you Can Make Out of Linked Data

What you can make out of Linked DataMarco Fossati <fossati@spaziodati.eu> Steven R. Loomis <srloomis@us.ibm.com>

Let's meet the presenters first!

Marco FossatiNatural Language Processing

Advocate Recommender Systems

Aficionado Open Data Apologist

Steven R. LoomisIBMChair, Unicode ULI-TC !

Projects: ICU, CLDR, ULI

Outline

1. Linked Open Data 101

2. DBpedia

3. The ULI use case

Warning! Highly interactive tutorial

Let's get started!

Linked Open Data 101The Big Picture

What is data?Data is how we express facts in a reusable form

Why data? The ingredients for...

...InformationKnowledge

Wisdom

OK it's data, what else?

Billions of factsBig “Santa Clara is a city”

Richly structuredLinked

Open Open licenses

Facts, not words

A fact is...

An assertion about the world

Subject + predicate + object

A triple

Natural language

Human mind

!Machine

Human mindPerceiving relationships between entities

Natural language"Elvis Presley sings Jailhouse Rock"

MachineThe triple

Elvis Presley

Jailhouse Rock!

The graphRich structure made of triples

From the web of documents...

...to the web of entities

The web of entities

An entity can be...

Identified

Described through relationships

Understood both by humans and machines

Towards a WWW of entitiesIdentify via HTTP URIs

http://dbpedia.org/resource/Elvis_Presley

Describe via RDF statements

:Elvis_presley :sings :Jailhouse_Rock

Understand via

HTML for humans

RDF for machines

Hands-on Time!

https://pad.okfn.org/p/DBpediaULI

Next in line…

DBpediaExtracting Knowledge from Wikipedia

DBpedia is…

A. …a data extraction framework

from Wikipedia semi-structured data

B. …an open-source community effort

Wikipedia can’t answer simple questions“What do Santa Clara and San Francisco have in common?”

Wikipedia can’t answer complex questions“Which are the black and white movies produced in Italy that have soundtracks which were composed by musicians who were born in a city of the Trentino-Alto-Adige region with less than 40,000 inhabitants?”

The story so far

Project started in 2007

From good ol’ PHP to Java + Scala

Steadily growing community

Internationalization Committee

Freely available on GitHub

Data in WikipediaTitle

Short abstract

Long abstract29

Structure in WikipediaInfobox Images

Structure in Wikipedia

Categories31

Structure in WikipediaInterlanguage Links

Much more at

http://dbpedia.org/Datasets

DBpedia Extraction Framework (DEF)

Wikipedia dump Extractors RDF graph

Extractors

Article Features

Abstract, redirects, categories, geo-coordinates, interlanguage links, etc.

Infobox

Mapping-based

Raw Infobox Extractor:Elvis_Presley

:born “Elvis Aaron Presley…”

:died “August 16, 1977…”

:restingPlace “Graceland…”

:education “L.C. Humes…”

:occupation “Singer…”

The Big IssuesData is heterogeneous! Data is multilingual!

Solution• The DBpedia ontology as a multilingual glue • Wikipedia-to-ontology Mapping

DBpedia OntologyEncoding the worldwide encyclopedic knowledge

Mapping-based Extractor

Combines what belongs together

Separates what is different

DIEF -Mapping-Based Infobox extractor

The Mappings WikiAnybody can contribute to mappings.dbpedia.org

Download the latest DBpedia dump at

http://downloads.dbpedia.org/current/

English SPARQL endpointdbpedia.org/sparql

Language chaptersDBpedia in your mother tongue

Active chapters

International (English-based)

Basque, Czech, Dutch, French, German, Greek, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Spanish

Host your own language chapter!

ApplicationsGet the best out of DBpedia data

Knowledge GraphsHighly informative summaries in your own language

Question Answering“Who is Bram Stoker?”

Entity LinkingDetecting Things in Text

Language and Domain-specific Resources for Short Sentences Classification

Automatic Huge Gazetteers

DBpedia StakeholdersWho is using the knowledge base?

Open GovernmentLinking Local Data

Digital LibrariesEnriching the Catalogue

Data-driven JournalismBuilding Infographics

Hands-on Time!

https://pad.okfn.org/p/DBpediaULI

And now the final part!

The ULI use casePutting Linked Open Data to work

What’s wrong with Localization Interoperability?

Inconsistent application, implementation, and interpretation of standards

Lack of clear requirements for localization data interchange

Unicode Localization Interoperability

Technical Committee of Unicode

Focus Areas:

1. Translation memory

2. Translation source strings / translations

3. Segmentation rules

ULI: Segmentation

Given: Thanks to Dr. Jones for this effort.

ULI Suppression: Abbreviations English

Mr.Mrs.Dr.St.…

Spanish

Sr.Dto.Sra.Avda.…

Russian

проф.февр.тел.кв.…

Demo: ULI Breaks

http://demo.icu-project.org/icu-bin/icusegments

DBpedia applied to ULI(University of Leipzig)Sebastian Hellman,Martin Brümmer,Dimitris Kontokostas

Opportunity:

Help segmentation by supplying abbreviation data

Evaluation shows that especially for small texts, abbreviations can contribute to precision and recall of segmentation

Success rate

multilingual with over 100 languages

structured data eases extraction

additional data like entity types and categories

Example: Mr.

“MR” disambiguation page links to “Mr.” article. !

Ends in full stop, so may be an abbreviation.

The “Mr.” SPARQL querySELECT ?entryExample ?exampleTested ?indegreeRanking WHERE { <http://dbpedia.org/resource/Mr.> rdfs:label ?entryExample ; rdfs:comment ?exampleTested . FILTER ( lang(?entryExample) = lang(?exampleTested) ) #subselect: { SELECT count(?in) as ?indegreeRanking WHERE { ?in ?p <http://dbpedia.org/resource/Mr.> } } } LIMIT 100

Example DBpedia data (English)

Street

<http://en.wikipedia.org/wiki/Street>

<http://schema.org/Place><http://dbpedia.org/ontology/Place><http://dbpedia.org/ontology/PopulatedPlace>

Example DBpedia data (Russian)

Проф.

Профессор (Professor)

<http://ru.wikipedia.org/wiki/Профессор>

Get abbreviation URIs

Load DBpedia data into local DB

SPARQL Query data and tsv output

22859 abbreviations with 78197 meanings in 99 languages

Long Tail !

Only 25 languages >100 abbrevs. Only 7 languages >1000 abbrevs. !

22859 abbreviations with 78197 meanings in 99 languages

Long tail (total abbrevs)

Long tail (total abbrevs) (zoom)

ULI ProcessDBpedia

Wikipedia

ULI Review

Extraction

Translation Memory

Translation MemoryTranslation

Memory

Comparison

"Lupa.na.encyklopedii" by Julo - Own work. Licensed under Public domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Lupa.na.encyklopedii.jpg#mediaviewer/File:Lupa.na.encyklopedii.jpg

Manual review

CLDR Suppressions

Comparison with Translation Memory

Entry % in TMCorp. 0.0307%

St. 0.0023%P.T.T.C. 0%

"Trichtermitfilter" by Gmhofmann - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Trichtermitfilter.jpg#mediaviewer/

File:Trichtermitfilter.jpg

CLDR Input

Extract abbreviations from CLDR localized data

Days of week: Sun. Mon. Tue. Wed. Thu. …

Months: Jan. Feb. Mar. …

etc…

Manual Review

CLDR output format <segmentations> <segmentation type="SentenceBreak">  <suppressions type="standard"> <suppression>Port.</suppression> <suppression>Alt.</suppression> <suppression>Di.</suppression> <suppression>Ges.</suppression> <suppression>frz.</suppression>

CLDR 26 Output

http://cldr.unicode.org

“Break Suppression”

de 239en 151es 164fr 82it 45pt 170ru 18

Challenges

"Long Tail" Languages

harder to find existing TM data

harder to find linguistic rules/review

harder to find tagged corpora to benchmark

Systematic issues with using redirects/disambiguation

OpportunityScope:

Non-full stop punctuation- "Yahoo!"

Language specific abbreviation rules

Context (Medical, Business, …)

Leverage

Schema/Taxonomy ( “Place” vs “Person” etc. ) to filter

DBpedia lists

Additional LOD

Thank You!

Further Q&A?

Slides & contact info: https://pad.okfn.org/p/DBpediaULI

What you Can Make Out of Linked Data

Technology

Bollywood Celebrities with Make Up and with out Make Up

linked data for biopharma 14FEB2013 · The 5 Stars of Open Linked Data W3C/TBL Guidance ★Make your stuff available on the web (any format) ★★ make it available as structured

10 tips to make your linked in profile work for you

Linked List - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~ppd/course/pds/Lect-23-24-Linked-List.pdf · Linked List A completely different way to represent a list Make

HIM Professionals: How to Make Linked In Work for You

Linked in presentation 4 mistakes recruiters make

Linked In Marketing Techniques To Stand Out From The Crowd Jasmine Sandler

Digital Booklet - Make Out

Starting Out with C++, 3 rd Edition 1 Chapter 17 Linked Lists

Visualizing Linked Open Data Andra Waagmeester. Overview Context: Pathways Howto: Linked data Make sense of linked data Visualizing linked data

MAKE OUT - i-D Magazine

The Winner Takes it All? -APIs and Linked Data Battle It Out

How You Can Make the Transition from MARC to Linked Data Easier

Make the Most Out of

Make your home stand out!

Getting The Most Out Of Linked In

5 … · Web viewCopy out the Spelling Challenge grid into your exercise book (linked below) Make your own collection of suitable nouns and adjectives which could go into the grid

Starting Out with C++, 3 rd Edition 1 Linked Lists

OCLC’s Linked Data Initiative: Using Schema.org to Make ... · EUROPEANA˜MOVING˜TO˜ LINKED˜OPEN˜DATA ... Linked Data strategy when it revealed that Schema.org markup had been

Linked In or Left Out: Profile Review and Workshop