25
citius.usc.es Centro Singular de Investigación en Tecnoloxías da Información UNIVERSIDADE DE SANTIAGO DE COMPOSTELA Graph based semantic annotation for enriching documents with linked data Supervisors: Manuel Lama, Juan C. Vidal Estefanía Otero García

Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

citius.usc.es

Centro Singular de Investigación en Tecnoloxías da Información

UNIVERSIDADE DE SANTIAGO DE COMPOSTELA

Graph based semantic annotation for enriching documents with linked data

Supervisors: Manuel Lama, Juan C. Vidal

Estefanía Otero García

Page 2: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

2

Description of the problem

Learning Fruits

Digital books with clear and consisttext. They have a friendly interface and interactive audiovisual resources:

• The content is identical to the textbooks, but with links to other content to access to relevant and complementary information.

• They are based on an XML format that describes the structure of the LF. It facilitates their use by the e-learning applications.

Page 3: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Description of the problem

33

Open problems

PROB1Get additional and complementary information about terms that:

• there are no links to complementary web pages or other content

• not provide explanations .

PROB2Search information inside the content of the LF:

• It is necessary to explore through links to external content and web pages

1oc

1oc

Page 4: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

ADEGA 1.0

Solution

4

Semantic annotation of the Learning Fruit content

Automatically associate the set of relevant terms of the LF to instances of a densely populated ontology

• Each relevant term is associated with a graph of instances that contains information about the term in the LF context

• Each instance has a link to an external web page

Page 5: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Required ontology characteristics

• Represents general purpose encyclopedic knowledge

• It is populated with a huge number of instances that have a link to a document or web page

Solution: Ontology

5

Page 6: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

DBpedia content is modeled using an shallow, corss-domain ontology based on standard vocabularies.• 529 classes and 2,333 properties

The population of the DBpedia ontology takes place extracting data from Wikipedia structured information• Infobox templates• Information from categories

Each DBpedia instance is linked to the Wikipedia page that the information has been extracted

Solution: DBpedia

6

Page 7: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Statistics (English version)

Ontology size2.46 billion RDF triples

4 million entities

Entity number

Persons 832,000 Films 78,000Places 639,000 Music Albums 116,000Organizations 209,000 Video games 18,500Species 226,000 Diseases 5,600

External links

Images 24,600,000External web pages 27,600,000RDF repositories 45,000,000Wikipedia Categories 67,000,000YAGO Categories 41,200,000

Solution: DBpedia

7

Page 8: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Annotation based on instance GENERATION

Description These proposals identify the terms of the document to annotate and automatically create instances in the ontology. This requires:

• Identify the concept of ontology which the instance belongs.

• Assign values to the attributes of the instance.

Advantage V1 It is no necessary ontologies with huge number of instances.

Limitations L1 They require complex language processing techniques, combined with machine learning techniques to detect the concepts that instances and attribute belongs .

L2 Apply to a very restricted number of general concepts (people, organizations and places).

State of the art

8

Page 9: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Annotation based on instance SEARCH

Description These proposals identify the terms of the document to annotate and then they search instances that best represents the semantics of a given term in a densely populated ontology. It is need:

• Apply disambiguation techniques between instances.

• Use context in searche.

Advantages V1 It is not required natural language processing techniques as complex as proposal of the annotation based on instances generation

• Named Entity Recognition• Syntactic and semantic similarity between terms

Limitations L1 It is necessary to have an populated ontology with a huge number of instances

State of the art

9

Page 10: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

It does not exist any context based proposal that properly annotates a term with a semantic graph

• DBpedia Ranker discard relevant taxonomic relations

State of the art

10

Page 11: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

It identifies relevant terms that characterize the LF: context

Each term is associated with a DBpedia instances

Semantic graph is obtained by filtering instances using the context

ADEGA 1.0

Framework

11

All your Documents Enriched with Graph Annotations

Page 12: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

POS Stanford

Morphology Analysis

Set of nouns, propernouns and compoundnouns extracted from the LF content

Pharaoh Egypt

Ancient Egypt Cleopatra VII

Ramesses II tomb

temple Piest

Horus Nile

God …

ADEGA 1.0

Framework: Context

12

Page 13: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Hybridization metricsTF-IDF + Jaro-Winkler

SoftTFIDF

Similarity Analysis

Cluster of terms that are composed by words

that share a similar meaning or arise from

the same root.

Pharaoh Egypt

Ancient Egypt Cleopatra VII

Ramesses II tomb

temple priest

Horus Nile

god …

{Egypt, egyptian, egyptologist}

{Cleopatra, Cleopatra VII}

{god, gods}

ADEGA 1.0

Framework: Context

13

Page 14: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

It is calculated using thefrequency weighted by therelevance of each LF field(pα)

Final relevance

Frequency Analysis

Number of times the term appears in theLF fields

{Egypt, egyptian, egyptologist}

{Cleopatra, Cleopatra VII}

{god, gods}

LF Context

ADEGA 1.0

Framework: Context

14∑ , ∙ ∙

Page 15: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

ANOTATION

CONTEXT EXTRACTION

FrequencySimilarity…

URI

IDENTIFICATION

LEVEL 1 2 3 4

Luxor forma parte de la antigua ciudad llamada Uaset (en egipcio antiguo), o también conocida como Tebas (en griego), denominada por Homero "La ciudad de las cien puertas", por las numerosas puertasEs la ciudad de los grandes templos del antiguo Egipto (Luxor y Karnak), y de las célebres necrópolis de la ribera ...

TebasNecropolisFaraones...

0.710.650.640.61

Paleolithic node

#level #nodes

1 5372 12,8253 156,7704 2,950,620

There are graph nodes that are not relevant to semantically describe the terms of the document

• Context is used to discriminate the relevant instances for the semantic description of the LF

ADEGA 1.0

Framework: Filter graph

15

Page 16: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Depth First Search algorithm (depth limited)

• The exploration determines the relevance of each node, which depends on the relevance of the children nodes that are connected

ADEGA 1.0

Framework: Filter graph

16

Text nodes

n4

n2 n7

n1

n3

wr23wr24

n5 n6

wr45 wr46

wr12 wr17

URI nodes

DBpedia relation weight

The node isrelevant if itexceeds a thresholdFrequency Diversity

Page 17: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Ancient Egypt Assyrianempyre Babylonia Canary

IslandCantabrianmountain Caspian sea Caucasus Cleopatra

VII

Desert Earth Egypt Enlil Euphrates Pharaoh Fossil Giza

God Guadalquivir Gudea Hammurabi Homo

ErectusHomo Habilis Horus Human

Ishtar Neanderthal Nile Oceanicclimate Osiris Paleolithic Prehistory ...

Learning Fruits #terms

The landscape of the earth 7

The river civilizations of Mesopotamia 13

The landscape of Spain and Europe 10

The Paleolithic and our remote ancestors 10

Ancient Egypt 10

ADEGA 1.0

Validation

17

Page 18: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

ADEGA 1.0

Validation: Results

18

Page 19: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Comparative: ADEGA vs RelFinder

• RelFinder is set with a exploration depth of 2 levels and the context terms are introduced as the input.

• F1-score is used as a comparisonparameter between ADEGA andRelFinder, using the same numberof instances in both algorithms.

ADEGA 1.0

Validation: Comparative

19

Page 20: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

ADEGA 1.0Computational Issues

20

Each additional level of exploration increments exponentially the number of visited nodes

Jump from level to 1 implies visiting nodes

Exploration results for 1 term

Variable Value

Averaged nodes visited 248,035.02

Average nodes discarded 199,461.73

Average nodes processed 48,573.29

Average literals processed 44,165.53

Average URL processed 4,407.76

Average number of SPARQL queries 22,882.64

Mean time per query in ms 9.91

Mean time of ADEGA (depht = 3) in ms 376,414.17

80% discarded nodes

91% text nodesThe most costly nodes

67% of computational time was used to query DBpedia

x10 terms (avg) = 50 min to obtain a solution

Page 21: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Journal Publications

"Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal, Estefanía Otero-García, Alberto Bugarín, and Senén Barro. Educational Technology & Society 15, no. 4 (2012): 47-61.JCR = 1.171

“Graph-based semantic annotation for enriching documents with linked data”. Juan C. Vidal, Manuel Lama, Estefanía Otero-García, Alberto Bugarín.Knowledge-Based Systems (2013)JCR = 4.104

26

Publications

Page 22: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Demo user interfaceADEGA

27

All your Documents Enriched with Graph Annotationshttp://tec.citius.usc.es/adega

Page 23: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

UNIVERSIA ANNOTATIONADEGA applications

28

Clasification of Universia resourceshttp://tec.citius.usc.es/universia/lookup/

Page 24: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

MENTOR EMPRENDEADEGA applications

29

MENTOR EMPRENDEhttp://www.redemprendia.org/mentor/mentoremprende

Page 25: Graph based semantic annotation for enriching documents with … · 2014-09-30 · "Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal,

Questions?