What you Can Make Out of Linked Data

Preview:

DESCRIPTION

Tutorial given at the 38th Internationalization & Unicode Conference.

Citation preview

Text

What you can make out of Linked DataMarco Fossati <fossati@spaziodati.eu> Steven R. Loomis <srloomis@us.ibm.com>

1

Let's meet the presenters first!

2

Marco FossatiNatural Language Processing

Advocate Recommender Systems

Aficionado Open Data Apologist

3

Steven R. LoomisIBMChair, Unicode ULI-TC !

Projects: ICU, CLDR, ULI

Outline

1. Linked Open Data 101

2. DBpedia

3. The ULI use case

5

Warning! Highly interactive tutorial

6

Let's get started!

7

Text

Linked Open Data 101The Big Picture

8

What is data?Data is how we express facts in a reusable form

9

Why data? The ingredients for...

...InformationKnowledge

Wisdom

10

OK it's data, what else?

Billions of factsBig “Santa Clara is a city”

Richly structuredLinked

Open Open licenses

11

Facts, not words

A fact is...

An assertion about the world

Subject + predicate + object

A triple

Natural language

Human mind

!Machine

12

Human mindPerceiving relationships between entities

13

Natural language"Elvis Presley sings Jailhouse Rock"

14

MachineThe triple

Elvis Presley

Jailhouse Rock!

sings

15

The graphRich structure made of triples

16

Text

From the web of documents...

17

Text

...to the web of entities

18

The web of entities

An entity can be...

Identified

Described through relationships

Understood both by humans and machines

19

Towards a WWW of entitiesIdentify via HTTP URIs

http://dbpedia.org/resource/Elvis_Presley

Describe via RDF statements

:Elvis_presley :sings :Jailhouse_Rock

Understand via

HTML for humans

RDF for machines

20

Hands-on Time!

https://pad.okfn.org/p/DBpediaULI

21

Next in line…

22

Text

DBpediaExtracting Knowledge from Wikipedia

23

DBpedia is…

A. …a data extraction framework

from Wikipedia semi-structured data

B. …an open-source community effort

24

Why?

25

Wikipedia can’t answer simple questions“What do Santa Clara and San Francisco have in common?”

26

Wikipedia can’t answer complex questions“Which are the black and white movies produced in Italy that have soundtracks which were composed by musicians who were born in a city of the Trentino-Alto-Adige region with less than 40,000 inhabitants?”

27

The story so far

Project started in 2007

From good ol’ PHP to Java + Scala

Steadily growing community

Internationalization Committee

Freely available on GitHub

28

Data in WikipediaTitle

Short abstract

Long abstract29

Structure in WikipediaInfobox Images

30

Structure in Wikipedia

Links

Categories31

Structure in WikipediaInterlanguage Links

32

Much more at

http://dbpedia.org/Datasets

33

DBpedia Extraction Framework (DEF)

Wikipedia dump Extractors RDF graph

34

Extractors

Article Features

Abstract, redirects, categories, geo-coordinates, interlanguage links, etc.

Infobox

Raw

Mapping-based

35

Raw Infobox Extractor:Elvis_Presley

:born “Elvis Aaron Presley…”

:died “August 16, 1977…”

:restingPlace “Graceland…”

:education “L.C. Humes…”

:occupation “Singer…”

36

The Big IssuesData is heterogeneous! Data is multilingual!

37

38

Solution• The DBpedia ontology as a multilingual glue • Wikipedia-to-ontology Mapping

39

DBpedia OntologyEncoding the worldwide encyclopedic knowledge

40

Mapping-based Extractor

Combines what belongs together

Separates what is different

41

DIEF -Mapping-Based Infobox extractor

42

The Mappings WikiAnybody can contribute to mappings.dbpedia.org

43

Download the latest DBpedia dump at

http://downloads.dbpedia.org/current/

44

English SPARQL endpointdbpedia.org/sparql

45

Language chaptersDBpedia in your mother tongue

46

Active chapters

International (English-based)

Basque, Czech, Dutch, French, German, Greek, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Spanish

47

Host your own language chapter!

48

ApplicationsGet the best out of DBpedia data

49

Knowledge GraphsHighly informative summaries in your own language

50

Text

Question Answering“Who is Bram Stoker?”

51

Text

Entity LinkingDetecting Things in Text

52

Language and Domain-specific Resources for Short Sentences Classification

Automatic Huge Gazetteers

53

DBpedia StakeholdersWho is using the knowledge base?

54

Open GovernmentLinking Local Data

55

Digital LibrariesEnriching the Catalogue

56

Data-driven JournalismBuilding Infographics

57

Hands-on Time!

https://pad.okfn.org/p/DBpediaULI

58

And now the final part!

59

Text

The ULI use casePutting Linked Open Data to work

What’s wrong with Localization Interoperability?

Inconsistent application, implementation, and interpretation of standards

Lack of clear requirements for localization data interchange

Unicode Localization Interoperability

Technical Committee of Unicode

Focus Areas:

1. Translation memory

2. Translation source strings / translations

3. Segmentation rules

ULI: Segmentation

Given: Thanks to Dr. Jones for this effort.

UAX#11 Segmentation: |Thanks to Dr.| Jones for this effort.| English: |Thanks to Dr. Jones for this effort.|

ULI Suppression: Abbreviations English

Mr.Mrs.Dr.St.…

Spanish

Sr.Dto.Sra.Avda.…

Russian

проф.февр.тел.кв.…

Demo: ULI Breaks

http://demo.icu-project.org/icu-bin/icusegments

DEMO

DBpedia applied to ULI(University of Leipzig)Sebastian Hellman,Martin Brümmer,Dimitris Kontokostas

Opportunity:

Help segmentation by supplying abbreviation data

Yes!

Evaluation shows that especially for small texts, abbreviations can contribute to precision and recall of segmentation

Success rate

multilingual with over 100 languages

!

structured data eases extraction

!

additional data like entity types and categories

Example: Mr.

“MR” disambiguation page links to “Mr.” article. !

Ends in full stop, so may be an abbreviation.

The “Mr.” SPARQL querySELECT ?entryExample ?exampleTested ?indegreeRanking WHERE { <http://dbpedia.org/resource/Mr.> rdfs:label ?entryExample ; rdfs:comment ?exampleTested . FILTER ( lang(?entryExample) = lang(?exampleTested) ) #subselect: { SELECT count(?in) as ?indegreeRanking WHERE { ?in ?p <http://dbpedia.org/resource/Mr.> } } } LIMIT 100

DEMO

Example DBpedia data (English)

St.

Street

<http://en.wikipedia.org/wiki/Street>

<http://schema.org/Place><http://dbpedia.org/ontology/Place><http://dbpedia.org/ontology/PopulatedPlace>

Example DBpedia data (Russian)

Проф.

Профессор (Professor)

<http://ru.wikipedia.org/wiki/Профессор>

1.

Get abbreviation URIs

2.

Load DBpedia data into local DB

3.

SPARQL Query data and tsv output

!

22859 abbreviations with 78197 meanings in 99 languages

!

Long Tail !

!

!

Only 25 languages >100 abbrevs. Only 7 languages >1000 abbrevs. !

!

!

22859 abbreviations with 78197 meanings in 99 languages

!

!

Long tail (total abbrevs)

Long tail (total abbrevs) (zoom)

ULI ProcessDBpedia

Wikipedia

ULI Review

Extraction

Translation Memory

Translation MemoryTranslation

Memory

Comparison

"Lupa.na.encyklopedii" by Julo - Own work. Licensed under Public domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Lupa.na.encyklopedii.jpg#mediaviewer/File:Lupa.na.encyklopedii.jpg

Manual review

CLDR

CLDR

abb

rs.

CLDR Suppressions

Comparison with Translation Memory

Entry % in TMCorp. 0.0307%

St. 0.0023%P.T.T.C. 0%

"Trichtermitfilter" by Gmhofmann - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Trichtermitfilter.jpg#mediaviewer/

File:Trichtermitfilter.jpg

CLDR Input

Extract abbreviations from CLDR localized data

Days of week: Sun. Mon. Tue. Wed. Thu. …

Months: Jan. Feb. Mar. …

etc…

Manual Review

CLDR output format <segmentations> <segmentation type="SentenceBreak"> <!--From ULI data, http://uli.unicode.org--> <suppressions type="standard"> <suppression>Port.</suppression> <suppression>Alt.</suppression> <suppression>Di.</suppression> <suppression>Ges.</suppression> <suppression>frz.</suppression>

CLDR 26 Output

http://cldr.unicode.org

“Break Suppression”

de 239en 151es 164fr 82it 45pt 170ru 18

Challenges

"Long Tail" Languages

harder to find existing TM data

harder to find linguistic rules/review

harder to find tagged corpora to benchmark

Systematic issues with using redirects/disambiguation

OpportunityScope:

Non-full stop punctuation- "Yahoo!"

Language specific abbreviation rules

Context (Medical, Business, …)

Leverage

Schema/Taxonomy ( “Place” vs “Person” etc. ) to filter

DBpedia lists

Additional LOD

Thank You!

Further Q&A?

!

Slides & contact info: https://pad.okfn.org/p/DBpediaULI

Recommended