Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
citius.usc.es
Centro Singular de Investigación en Tecnoloxías da Información
UNIVERSIDADE DE SANTIAGO DE COMPOSTELA
Graph based semantic annotation for enriching documents with linked data
Supervisors: Manuel Lama, Juan C. Vidal
Estefanía Otero García
2
Description of the problem
Learning Fruits
Digital books with clear and consisttext. They have a friendly interface and interactive audiovisual resources:
• The content is identical to the textbooks, but with links to other content to access to relevant and complementary information.
• They are based on an XML format that describes the structure of the LF. It facilitates their use by the e-learning applications.
Description of the problem
33
Open problems
PROB1Get additional and complementary information about terms that:
• there are no links to complementary web pages or other content
• not provide explanations .
PROB2Search information inside the content of the LF:
• It is necessary to explore through links to external content and web pages
1oc
1oc
ADEGA 1.0
Solution
4
Semantic annotation of the Learning Fruit content
Automatically associate the set of relevant terms of the LF to instances of a densely populated ontology
• Each relevant term is associated with a graph of instances that contains information about the term in the LF context
• Each instance has a link to an external web page
Required ontology characteristics
• Represents general purpose encyclopedic knowledge
• It is populated with a huge number of instances that have a link to a document or web page
Solution: Ontology
5
DBpedia content is modeled using an shallow, corss-domain ontology based on standard vocabularies.• 529 classes and 2,333 properties
The population of the DBpedia ontology takes place extracting data from Wikipedia structured information• Infobox templates• Information from categories
Each DBpedia instance is linked to the Wikipedia page that the information has been extracted
Solution: DBpedia
6
Statistics (English version)
Ontology size2.46 billion RDF triples
4 million entities
Entity number
Persons 832,000 Films 78,000Places 639,000 Music Albums 116,000Organizations 209,000 Video games 18,500Species 226,000 Diseases 5,600
External links
Images 24,600,000External web pages 27,600,000RDF repositories 45,000,000Wikipedia Categories 67,000,000YAGO Categories 41,200,000
Solution: DBpedia
7
Annotation based on instance GENERATION
Description These proposals identify the terms of the document to annotate and automatically create instances in the ontology. This requires:
• Identify the concept of ontology which the instance belongs.
• Assign values to the attributes of the instance.
Advantage V1 It is no necessary ontologies with huge number of instances.
Limitations L1 They require complex language processing techniques, combined with machine learning techniques to detect the concepts that instances and attribute belongs .
L2 Apply to a very restricted number of general concepts (people, organizations and places).
State of the art
8
Annotation based on instance SEARCH
Description These proposals identify the terms of the document to annotate and then they search instances that best represents the semantics of a given term in a densely populated ontology. It is need:
• Apply disambiguation techniques between instances.
• Use context in searche.
Advantages V1 It is not required natural language processing techniques as complex as proposal of the annotation based on instances generation
• Named Entity Recognition• Syntactic and semantic similarity between terms
Limitations L1 It is necessary to have an populated ontology with a huge number of instances
State of the art
9
It does not exist any context based proposal that properly annotates a term with a semantic graph
• DBpedia Ranker discard relevant taxonomic relations
State of the art
10
It identifies relevant terms that characterize the LF: context
Each term is associated with a DBpedia instances
Semantic graph is obtained by filtering instances using the context
ADEGA 1.0
Framework
11
All your Documents Enriched with Graph Annotations
POS Stanford
Morphology Analysis
Set of nouns, propernouns and compoundnouns extracted from the LF content
Pharaoh Egypt
Ancient Egypt Cleopatra VII
Ramesses II tomb
temple Piest
Horus Nile
God …
ADEGA 1.0
Framework: Context
12
Hybridization metricsTF-IDF + Jaro-Winkler
SoftTFIDF
Similarity Analysis
Cluster of terms that are composed by words
that share a similar meaning or arise from
the same root.
Pharaoh Egypt
Ancient Egypt Cleopatra VII
Ramesses II tomb
temple priest
Horus Nile
god …
{Egypt, egyptian, egyptologist}
{Cleopatra, Cleopatra VII}
{god, gods}
…
ADEGA 1.0
Framework: Context
13
It is calculated using thefrequency weighted by therelevance of each LF field(pα)
Final relevance
Frequency Analysis
Number of times the term appears in theLF fields
{Egypt, egyptian, egyptologist}
{Cleopatra, Cleopatra VII}
{god, gods}
…
LF Context
ADEGA 1.0
Framework: Context
14∑ , ∙ ∙
ANOTATION
CONTEXT EXTRACTION
FrequencySimilarity…
URI
IDENTIFICATION
LEVEL 1 2 3 4
Luxor forma parte de la antigua ciudad llamada Uaset (en egipcio antiguo), o también conocida como Tebas (en griego), denominada por Homero "La ciudad de las cien puertas", por las numerosas puertasEs la ciudad de los grandes templos del antiguo Egipto (Luxor y Karnak), y de las célebres necrópolis de la ribera ...
TebasNecropolisFaraones...
0.710.650.640.61
Paleolithic node
#level #nodes
1 5372 12,8253 156,7704 2,950,620
There are graph nodes that are not relevant to semantically describe the terms of the document
• Context is used to discriminate the relevant instances for the semantic description of the LF
ADEGA 1.0
Framework: Filter graph
15
Depth First Search algorithm (depth limited)
• The exploration determines the relevance of each node, which depends on the relevance of the children nodes that are connected
ADEGA 1.0
Framework: Filter graph
16
Text nodes
n4
n2 n7
n1
n3
wr23wr24
n5 n6
wr45 wr46
wr12 wr17
URI nodes
DBpedia relation weight
The node isrelevant if itexceeds a thresholdFrequency Diversity
Ancient Egypt Assyrianempyre Babylonia Canary
IslandCantabrianmountain Caspian sea Caucasus Cleopatra
VII
Desert Earth Egypt Enlil Euphrates Pharaoh Fossil Giza
God Guadalquivir Gudea Hammurabi Homo
ErectusHomo Habilis Horus Human
Ishtar Neanderthal Nile Oceanicclimate Osiris Paleolithic Prehistory ...
Learning Fruits #terms
The landscape of the earth 7
The river civilizations of Mesopotamia 13
The landscape of Spain and Europe 10
The Paleolithic and our remote ancestors 10
Ancient Egypt 10
ADEGA 1.0
Validation
17
ADEGA 1.0
Validation: Results
18
Comparative: ADEGA vs RelFinder
• RelFinder is set with a exploration depth of 2 levels and the context terms are introduced as the input.
• F1-score is used as a comparisonparameter between ADEGA andRelFinder, using the same numberof instances in both algorithms.
ADEGA 1.0
Validation: Comparative
19
ADEGA 1.0Computational Issues
20
Each additional level of exploration increments exponentially the number of visited nodes
Jump from level to 1 implies visiting nodes
Exploration results for 1 term
Variable Value
Averaged nodes visited 248,035.02
Average nodes discarded 199,461.73
Average nodes processed 48,573.29
Average literals processed 44,165.53
Average URL processed 4,407.76
Average number of SPARQL queries 22,882.64
Mean time per query in ms 9.91
Mean time of ADEGA (depht = 3) in ms 376,414.17
80% discarded nodes
91% text nodesThe most costly nodes
67% of computational time was used to query DBpedia
x10 terms (avg) = 50 min to obtain a solution
Journal Publications
"Semantic Linking of Learning Object Repositories to Dbpedia”. Manuel Lama, Juan Carlos Vidal, Estefanía Otero-García, Alberto Bugarín, and Senén Barro. Educational Technology & Society 15, no. 4 (2012): 47-61.JCR = 1.171
“Graph-based semantic annotation for enriching documents with linked data”. Juan C. Vidal, Manuel Lama, Estefanía Otero-García, Alberto Bugarín.Knowledge-Based Systems (2013)JCR = 4.104
26
Publications
Demo user interfaceADEGA
27
All your Documents Enriched with Graph Annotationshttp://tec.citius.usc.es/adega
UNIVERSIA ANNOTATIONADEGA applications
28
Clasification of Universia resourceshttp://tec.citius.usc.es/universia/lookup/
MENTOR EMPRENDEADEGA applications
29
MENTOR EMPRENDEhttp://www.redemprendia.org/mentor/mentoremprende
Questions?