Upload
stefan-dietze
View
482
Download
3
Embed Size (px)
Citation preview
Backup
Semantic Linking & Retrieval for Digital Libraries
Dr. Stefan Dietze
11.02.2016
Institut für Informatik/Universität Bonn
29/03/16 1Stefan Dietze
Stefan Dietze
Overview: research/application context
Information (types)
Bibliographic (meta)data
Research information
Educational (meta)data
Web & social data
Stakeholders
Archival organisations
Digital libraries
Publishers
Resource providers/consumers
Domains
Life Sciences
Computer Science
Learning Analytics
...
Data-centric tasks
Publishing, preservation, annotation, crawling, search, retrieval ...
29/03/16 2Stefan Dietze
Overview: contents
Introduction & motivation
Publishing, linking and profiling
Publishing & linking (bibliographic) data
Dataset profiling & linking
Retrieval & search
Entity retrieval in large graphs
Embedded (bibliographic) Web data
Entity summarisation from Web markup
Outlook and future directions
Stefan Dietze
Information (types)
Bibliographic (meta)data
Research information
Educational (meta)data
Web & social data
Stakeholders
Archival organisations
Digital libraries
Publishers
....
Domains
Life Sciences
Computer Science
Learning Analytics
...
Data-centric tasks
Publishing, preservation, annotation, crawling, search, retrieval ...
29/03/16 3Stefan Dietze
Introduction & motivation
Publishing, linking and profiling
Publishing & linking (bibliographic) data
Dataset profiling & linking
Retrieval & search
Entity retrieval in large graphs
Embedded (bibliographic) Web data
Entity summarisation from Web markup
Outlook and future directions
Overview: contents
knowledge graphs and linked data
beyond LD: embedded semantics
[ESWC13, ESCW14]
[ISWC15]
[WebSci13, SWJ15]
Stefan Dietze
Information (types)
Bibliographic (meta)data
Research information
Educational (meta)data
Web & social data
Stakeholders
Archival organisations
Digital libraries
Publishers
....
Domains
Life Sciences
Computer Science
Learning Analytics
...
Data-centric tasks
Publishing, preservation, annotation, crawling, search, retrieval ...
[ongoing]
29/03/16 4Stefan Dietze
Linked Data diversity: example library & scholarly data
Linked Data: W3C standards & de-facto standard for sharing data on the Web (roughly 1000 datasets, 100 bn triples), adopted specifically by library/GLAM sector & life sciences
Strong focus on established knowledge graphs, e.g. Yago, DBpedia, Freebase (still)
Vocabularies/Schemas
BIBO, Bibliographic Ontology
BIRO, Bibliographic Reference Ontology
CITO, Citation Typing Ontology
SPAR vocabularies (incl. CITO, BIRO)
SWRC (Semantic Web Dogfood)
Functional Req. for Bibliographic Records (FRBR)
Nature Publishing Group Ontology
mEducator Educational Resources
....
Datasets
EUROPEANA
British Library
Deutsche-, Französische-, SpanischeNationalbibliotheken
Nature Publishing Group
Hochschulbibliothekszentrum NRW
Elsevier Scholarly Publications
TED Talks
mEducator Linked Educational Resources
Open Courseware Consortium
LAK Dataset
...
Initiatives
W3C Library Linked Data Incubator Group
Linked Library Data group on DataHub
LinkedUniversities.org
LinkedEducation.org
W3C Linked Open Education Community Group
...
29/03/16 5Stefan Dietze
??? ?
??
Challenge: efficient search for suitable resources & datasets
„Quality“: currency, dynamics, accessibility [Buil-Aranda2013], correctness [Paulheim2013], schema compliance [Hogan2012]
Domains/topics: which datasets/resources address topic XY (e.g. „microbiology“) ?
Types: statistical data, bibliographic resources, AV resources, scholarly publications?
Links: related datasets?
29/03/16 6Stefan Dietze
Data publishing, linking and profiling: LinkedUp
Dataset
Catalog/Registry
http://data.linkededucation.org/linkedup/catalog/
LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)
LinkedUp Catalog: largest collection of LD/Open Data for educationally relevant resources (approx. 50 Datasets)
Original datasets published with key content providers, automatically extracted metadata
29/03/16 7Stefan Dietze
Dietze, S., Kaldoudi, E., Dovrolis, E., Giordano, D.,
Spampinato, C., Hendrix, M., Protopsaltis, A., Taibi, D., Yu,
H. Q. (2013), Socio-semantic Integration of Educational
Resources – the Case of the mEducator Project, in
Journal of Universal Computer Science (J.UCS), Vol. 19,
No. 11, pp. 1543-1569.
Dietze, S., Taibi, D., Yu, H. Q., Dovrolis, N., A Linked
Dataset of Medical Educational Resources, British
Journal of Educational Technology (BJET), Volume 46,
Issue 5, pages 1123–1129, September 2015.
mEducator: medical educational resources
EC-funded eContentPlus project (2009-2012)
Exploratory search through semantic and clustering techniques
Lifting/enriching/clustering medical metadata
Common vocabularies (MESH, SNOMED, Bioportal etc)
mEducator dataset: first Linked Data corpus of enriched OER metadata, used by number of applications
29/03/16 8Stefan Dietze
LAK Dataset: facilitating scientometrics
Concept ofType #
Reference npg:Citation 7885
Author foaf:Person 1214
Conference Paper swrc:InProceedings 652
Organization foaf:Organization 365
Journal Paper bibo:Article 45
Proceedings Volume swrc:Proceedings 15
Journal Volume bibo:Journal 9
Cooperation of
Linked Data corpus of „Learning Analytics“publications of last 5 years (~ 800 publications)
Metadata, full-text & automated linking (DBLP, SWDF, DBpedia)
Wide adoption (http://lak.linkededucation.org)
1. Data extraction & vocabulary definition
2. 3. Applications & analysis Entity co-reference resolution & linking
Facilitating Scientometrics in Learning Analytics and
Educational Data Mining - the LAK Dataset, Dietze, S.,
Taibi, D., D’Aquin, M.,Semantic Web Journal, 2015.
29/03/16 9Stefan Dietze
29/03/16 10Stefan Dietze
LinkedUp Catalog: dataset index & registry, federated searchn a nutshell “Federated queries” through schema mappings
Dataset accessability
Linking & topic profiling
Schema/Types
Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
po:Programme
yov:Video
?
bibo:Book
Schema analysis & mapping
29/03/16 11Stefan Dietze
typeXtypeX
Co-occurence after mapping
(201 frequently occuring types,
mapped into 79 types)
bibo:Film
bibo:Document
po:Programme
bibo:Book
foaf:Document
yov:Video
typeX
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
Schema analysis & mapping
Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates)
29/03/16 12
29/03/16 13Stefan Dietze
http://data.linkededucation.org/linkedup/catalog/
LinkedUp Catalog: dataset index & registry, federated searchn a nutshell “Federated queries” through schema mappings
Dataset accessability
Linking & topic profiling
Dataset topic profiles
containsyov:Video
<yo:Video …>
<dc:title> Lecture 29 –
Stem Cells </dc:title>
…
</yo:Video…>
Yovisto Video
db:Medicine
db:Rudolf
Virchow
db:Cell
Biology
Linking entities/datasets through combination of (i) „semantic (graph-based) connectivity score (SCS)“ (based on Katz centrality) and „co-occurence-based measure (CBM)“ (similar to Normalised Google Distance)
Evaluation: outperforming Explicit Semantic Analysis (ESA)
SCS = 0.32
CBM = 0.24
Data(set) interlinking
bibo:Book
British Library Book
<bibo:Book …>
<bibo:title>Über den Hungertyphus</.>
<bibo:creator>Rudolf Virchov</…>
</bibo:Book…>
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
?
29/03/16 14
db:Cell
(Biology)
db:Cell(Micro-
processor)
Stefan Dietze
db:Biology
db:Cell biology
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Lecture 29 –
Stem Cells</dc:title>
…
</yo:Video…>
Yovisto Video
Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets
Technically trivial, but scalability issues: LOD Cloud 1000+ datasets with <100 billion RDF statements
Efficient approach: sampling & ranking for balance between scalability and precision /recall
Scalable profiling of datasetsA Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
db:Cell
(Biology)
29/03/16 15
db:Cell
(Biology)
Stefan Dietze
Efficient dataset profiling
1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling)
2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion)
3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)
Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
29/03/16 16Stefan Dietze
Search & exploration of datasets through topic profilesin a nutshell Applied to entire LOD cloud/graph
Visual exploration of extracted RDF dataset profiles(datasets, topics, relationships)
Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets)
http://data-observatory.org/lod-profiles/
29/03/16 17Stefan Dietze
Search: entity retrieval on large structured datasets?in a nutshellChallenges
How to efficiently retrieve related entities/resources for given query ?
Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods (eg BM25F, Blanco et al, ISWC2011)
Query type affinity?
29/03/16 18Stefan Dietze
??
Large dataset/crawl
e.g. LinkedUp dataset graph, LIVIVO dataset, BTC2014
entities related to <James D. Watson>
?
BTC2014
Entity retrieval: approachin a nutshell
(I) Offline processing (clustering to address link sparsity)
1. Feature vectors (lexical and structural features)
2. Bucketing: per type (LSH algorithm)
3. Clustering: X-means & Spectral clustering per bucket
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th International
Semantic Web Conference (ISWC2014), Bethlehem,
US, (2015).
(II) Online processing (retrieval)
1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities)
2. Re-Ranking (context terms & query type affinity)
29/03/16 19Stefan Dietze
Dataset
BTC2014 (1.4 billion triples)
92 SemSearch queries
Methods
Our approaches: XM: Xmeans, SP: Spectral
Baselines B: BM25F, S1: Tonon et al [SIGIR12]
Conclusions
XM & SP outperform baselines
Clustering to remedy link sparsity
Relevance to query crucial
Improving Entity Retrieval on Structured Data,
Fetahu, B., Gadiraju, U., Dietze, S., 14th International
Semantic Web Conference (ISWC2014), Bethlehem,
US, (2015).
Entity retrieval: evaluation
29/03/16 20Stefan Dietze
Introduction & motivation
Publishing, linking and profiling
Publishing & linking (bibliographic) data
Dataset profiling & linking
Retrieval & search
Entity retrieval in large graphs
Embedded (bibliographic) Web data
Entity summarisation from Web markup
Outlook and future directions
Overview: contents so far
29/03/16 21Stefan Dietze
[ESWC13, ESCW14]
[ISWC15]
[WebSci13, SWJ15]
Outcomes & impact ?
Tangible outcomes / impact
Open Datasets
Applications
Vocabularies & Schemas
Initiatives & Working Groups
VOL
+ vocabularies for educational resource & service modeling
W3C Community Group „Open Linked Education“
DCMI Task Force on LRMI
W3C Schema Bib Extend Group
Tutorial & workshop series on Linked Data & Learning
LinkedUniversities, LinkedEducation.org
KEYSTONE WG „Search and Profiling of LD“
….
http://linkeduniversties.org
29/03/16 22Stefan Dietze
Introduction & motivation
Publishing, linking and profiling
Publishing & linking (bibliographic) data
Dataset profiling & linking
Retrieval & search
Entity retrieval in large graphs
Embedded (bibliographic) Web data
Entity summarisation from Web markup
Outlook and future directions
Overview: contents
beyond LD: embedded semantics
Stefan Dietze
Information (types)
Bibliographic (meta)data
Research information
Educational (meta)data
Web & social data
Stakeholders
Archival organisations
Digital libraries
Publishers
....
Domains
Life Sciences
Computer Science
Learning Analytics
...
Data-centric tasks
Publishing, preservation, annotation, crawling, search, retrieval ...
29/03/16 23Stefan Dietze
The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google
vs
Linked Data: approx. 1000 datasets & 100 billion statements- different order of magnitude wrt scale & dynamics
Other „semantics“ (structured facts) on the Web?
The Web as a knowledge base: semantics on the Web?
29/03/16 24Stefan Dietze
Embedded markup (RDFa, Microdata, Microformats) forinterpretation of Web documents (search, retrieval)
Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)
Adoption on the Web: 26 %(2014 Google study of 12 bn Web pages)
“Web Data Commons” (Meusel & Paulheim [ISWC2014])
• Markup from Common Crawl (2.2 billion pages): 17 billion RDF quads
• Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)
Same order of magnitude as “the Web”
Embedded semantics: Web page markup & schema.org
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Forrest Gump</h1>
<span>Actor: <span itemprop=„actor">Tom Hanks</span>
<span itemprop="genre">Drama</span>
...
</div>
29/03/16 25
RDF statements
node1 actor _node-x
node1 actor Robin Wright
node1 genre Comedy
node2 actor T. Hanks
node2 distributed by Paramount Pic.
node3 actor Tom Cruise
node3 distributed by Paramount Pic.
Stefan Dietze
29/03/16 26Stefan Dietze
Characteristics Example
Coreferences18.000 results for <„Iphone 6“, type, s:Product>(8,6 quads on average)
Redundancy<s, schema:name, „Iphone 6“> occuring 1000 times in WDC2013
Lack of links Largely unlinked entity descriptions / subgraphs
Errors(typos & schema violations, see Meusel et al [ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates: 9,7 % in WDC, less common than in LOD
Confusion of datatype and object properties:<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD
Data property range violations: e.g. literals vs numbers (12,6% in WDC vs 4,6 in LOD)
Using markup as global knowledge base - state of the art
Glimmer (http://glimmer.research.yahoo.com): entity retrieval (BM25F) on WDC dataset [Blanco, Mika & Vigna, ISWC2011]
Challenges: specific characteristics of markup data
Goal: obtaining entity summary (or entity-centric knowledge graph) for given query ?
Tasks: document annotation, knowledge base augmentation, semantic enrichments
Using markup as global knowledge base/graph?
Web page
markup
29/03/16 27Stefan Dietze
Query
Nucleic Acids, type:(Article)
Entity Summary/Graph
NameMolecular structure of nucleic
acids
authorJames D. Watson
Francis Crick
publisher Nature
datePublished 1953
Web crawls, WDC or large (domain-specific) crawls: e.g. publishers, universities, libraries etc
Candidate Facts
node1 nameMolecular structure
of nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
node2 name Francis Crick
node2 name Cricks
Extract (domain-specific) knowledge bases and knowledge graphs for digital libraries
Experiments on WDC data: 87,6 % MAP, coverage: on average 57% additional facts compared to DBpedia
Ongoing work: entity summarisation from markup data
Query
Nucleic Acids, type:(Article) 1. Retrieval
2. Fact selection
Entity Summary/Graph
NameMolecular structure of nucleic
acids
authorJames D. Watson
Francis Crick
publisher Nature
datePublished 1953
29/03/16 28
New Queries
James D. Watson, type:(Person)
Francis Crick, type:(Person)
Nature, type:(Organization)
Stefan Dietze
Web crawls, WDC or large (domain-specific) crawls: e.g. publishers, universities, libraries etc
Web page
markup
(clustering, heuristics, trained classifier)
1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
cou
nt
(lo
g)
PLD (ranked)
# entities # statements
Unprecedented source of bibliographic data
Metadata about scholarly articles (s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC / 1 type alone)
Top 5 domains: Springer, MDPI, BMJ, diabetesjournals.org, mendeley.com, Biodiversitylibrary.org
Domains, topics, disciplines?
Life Sciences and Computer Science predominant
Top-10 article titles
Most important publishers/journals, libraries represented
=> Domain-specific & targeted crawls = unprecedented source of data
Embedded data for digital libraries / life sciences?
29/03/16 29Stefan Dietze
Knowledge graphs and LD (Yago, Freebase, Pubmed, DBLP etc)
Entity
node1 nameMolecular structure of
nucleic acids
node1 author James D. Watson
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Future work: improving entity-centric tasks for digital libraries
29/03/16 30
Entity
node2 name Francis Crick
node2 name Cricks
node2 born 1916
Stefan Dietze
• Web data as knowledge resource
• Background knowledge/structured data
• Training data & ground truths
• ....
Embedded data
Unstructured (Web) documents
Linked Data
Improving data-centric tasks for large (bibliographic/life sciences) corpora, eg LIVIVO
• KB construction & augmentation
• Document annotation
• Entity recognition, disambiguation, interlinking
• Search & retrieval ...
Acknowledgements: team
Besnik Fetahu (L3S)
Ivana Marenzi (L3S)
Ujwal Gadiraju (L3S)
Eelco Herder (L3S)
Ran Yu (L3S)
Ricardo Kawase (L3S)
Pracheta Sahoo (L3S, IIT India)
Bernardo Pereira Nunes (L3S, PUC Rio)
+ external collaborators
29/03/16 31Stefan Dietze
References (presented work)
Dietze, S., Taibi, D., D’Aquin, M., Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset, Semantic Web Journal, 2016.
Dietze, S., Kaldoudi, E., Dovrolis, E., Giordano, D., Spampinato, C., Hendrix, M., Protopsaltis, A., Taibi, D., Yu, H. Q. (2013), Socio-semantic Integration of Educational Resources – the Case of the mEducator Project, in Journal of Universal Computer Science (J.UCS), Vol. 19, No. 11, pp. 1543-1569.
Dietze, S., Taibi, D., Yu, H. Q., Dovrolis, N., A Linked Dataset of Medical Educational Resources, British Journal of Educational Technology (BJET), Volume 46, Issue 5, pages 1123–1129, September 2015.
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S. Human beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. In: IEEE Intelligent Systems, Volume 30 Issue 4 – Jul/Aug 2015.
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI Conference on Human Factors in Computing Systems (CHI2015), April 18-23, Seoul, Korea.
Fetahu, B., Gadiraju, U., Dietze, S., Improving Entity Retrieval on Structured Data, 14th International Semantic Web Conference (ISWC2014), Bethlehem, US, (2015).
Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
D’Aquin, M., Adamou, A., Dietze, S., Assessing the Educational Linked Data Landscape, ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Nunes, B. P., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W., Combining a co-occurrence-based and a semantic measure for entity linking, in: The Semantic Web: Semantics and Big Data, Proceedings of the 10th Extended Semantic Web Conference (ESWC2013), Lecture Notes in Computer Science Vol. 7882, Springer Berlin Heidelberg, 2013.
http://www.stefandietze.net
29/03/16 32Stefan Dietze
Selected related work
Entity retrieval
Alberto Tonon, Gianluca Demartini, and Philippe Cudré-Mauroux. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval. In: 35th Annual ACM SIGIR Conference (SIGIR 2012), Portland, Oregon, USA, August 2012.
Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International Semantic Web Conference (ISWC) 2011, pages 83-97.
Embedded markups & Web Data Commons
Robert Meusel, Petar Petrovski, Christian Bizer: The WebDataCommons Microdata, RDFa and MicroformatDataset Series. Proceedings of the 13th International Semantic Web Conference (ISWC 2014), RBDS Track, Trentino, Italy, October 2014.
Robert Meusel and Heiko Paulheim: Heuristics for Fixing Common Errors in Deployed schema.org Microdata. Proceedings of the 12th Extended Semantic Web Conference (ESWC 2015), Portoroz, Slovenia, May 2015
Linked Data quality
Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, SPARQL Web-Querying Infrastructure: Ready for Action?, International Semantic Web Conference 2013, (ISWC2013).
Paulheim H., Bizer, C., Type Inference on Noisy RDF Data, Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., An empirical survey of Linked Data conformance. Journal of Web Semantics 14, 2012
29/03/16 33Stefan Dietze