34
Semantic Linking & Retrieval for Digital Libraries Dr. Stefan Dietze 11.02.2016 Institut für Informatik/Universität Bonn 29/03/16 1 Stefan Dietze

Semantic Linking & Retrieval for Digital Libraries

Embed Size (px)

Citation preview

Backup

Semantic Linking & Retrieval for Digital Libraries

Dr. Stefan Dietze

11.02.2016

Institut für Informatik/Universität Bonn

29/03/16 1Stefan Dietze

Stefan Dietze

Overview: research/application context

Information (types)

Bibliographic (meta)data

Research information

Educational (meta)data

Web & social data

Stakeholders

Archival organisations

Digital libraries

Publishers

Resource providers/consumers

Domains

Life Sciences

Computer Science

Learning Analytics

...

Data-centric tasks

Publishing, preservation, annotation, crawling, search, retrieval ...

29/03/16 2Stefan Dietze

Overview: contents

Introduction & motivation

Publishing, linking and profiling

Publishing & linking (bibliographic) data

Dataset profiling & linking

Retrieval & search

Entity retrieval in large graphs

Embedded (bibliographic) Web data

Entity summarisation from Web markup

Outlook and future directions

Stefan Dietze

Information (types)

Bibliographic (meta)data

Research information

Educational (meta)data

Web & social data

Stakeholders

Archival organisations

Digital libraries

Publishers

....

Domains

Life Sciences

Computer Science

Learning Analytics

...

Data-centric tasks

Publishing, preservation, annotation, crawling, search, retrieval ...

29/03/16 3Stefan Dietze

Introduction & motivation

Publishing, linking and profiling

Publishing & linking (bibliographic) data

Dataset profiling & linking

Retrieval & search

Entity retrieval in large graphs

Embedded (bibliographic) Web data

Entity summarisation from Web markup

Outlook and future directions

Overview: contents

knowledge graphs and linked data

beyond LD: embedded semantics

[ESWC13, ESCW14]

[ISWC15]

[WebSci13, SWJ15]

Stefan Dietze

Information (types)

Bibliographic (meta)data

Research information

Educational (meta)data

Web & social data

Stakeholders

Archival organisations

Digital libraries

Publishers

....

Domains

Life Sciences

Computer Science

Learning Analytics

...

Data-centric tasks

Publishing, preservation, annotation, crawling, search, retrieval ...

[ongoing]

29/03/16 4Stefan Dietze

Linked Data diversity: example library & scholarly data

Linked Data: W3C standards & de-facto standard for sharing data on the Web (roughly 1000 datasets, 100 bn triples), adopted specifically by library/GLAM sector & life sciences

Strong focus on established knowledge graphs, e.g. Yago, DBpedia, Freebase (still)

Vocabularies/Schemas

BIBO, Bibliographic Ontology

BIRO, Bibliographic Reference Ontology

CITO, Citation Typing Ontology

SPAR vocabularies (incl. CITO, BIRO)

SWRC (Semantic Web Dogfood)

Functional Req. for Bibliographic Records (FRBR)

Nature Publishing Group Ontology

mEducator Educational Resources

....

Datasets

EUROPEANA

British Library

Deutsche-, Französische-, SpanischeNationalbibliotheken

Nature Publishing Group

Hochschulbibliothekszentrum NRW

Elsevier Scholarly Publications

TED Talks

mEducator Linked Educational Resources

Open Courseware Consortium

LAK Dataset

...

Initiatives

W3C Library Linked Data Incubator Group

Linked Library Data group on DataHub

LinkedUniversities.org

LinkedEducation.org

W3C Linked Open Education Community Group

...

29/03/16 5Stefan Dietze

??? ?

??

Challenge: efficient search for suitable resources & datasets

„Quality“: currency, dynamics, accessibility [Buil-Aranda2013], correctness [Paulheim2013], schema compliance [Hogan2012]

Domains/topics: which datasets/resources address topic XY (e.g. „microbiology“) ?

Types: statistical data, bibliographic resources, AV resources, scholarly publications?

Links: related datasets?

29/03/16 6Stefan Dietze

Data publishing, linking and profiling: LinkedUp

Dataset

Catalog/Registry

http://data.linkededucation.org/linkedup/catalog/

LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)

LinkedUp Catalog: largest collection of LD/Open Data for educationally relevant resources (approx. 50 Datasets)

Original datasets published with key content providers, automatically extracted metadata

29/03/16 7Stefan Dietze

Dietze, S., Kaldoudi, E., Dovrolis, E., Giordano, D.,

Spampinato, C., Hendrix, M., Protopsaltis, A., Taibi, D., Yu,

H. Q. (2013), Socio-semantic Integration of Educational

Resources – the Case of the mEducator Project, in

Journal of Universal Computer Science (J.UCS), Vol. 19,

No. 11, pp. 1543-1569.

Dietze, S., Taibi, D., Yu, H. Q., Dovrolis, N., A Linked

Dataset of Medical Educational Resources, British

Journal of Educational Technology (BJET), Volume 46,

Issue 5, pages 1123–1129, September 2015.

mEducator: medical educational resources

EC-funded eContentPlus project (2009-2012)

Exploratory search through semantic and clustering techniques

Lifting/enriching/clustering medical metadata

Common vocabularies (MESH, SNOMED, Bioportal etc)

mEducator dataset: first Linked Data corpus of enriched OER metadata, used by number of applications

29/03/16 8Stefan Dietze

LAK Dataset: facilitating scientometrics

Concept ofType #

Reference npg:Citation 7885

Author foaf:Person 1214

Conference Paper swrc:InProceedings 652

Organization foaf:Organization 365

Journal Paper bibo:Article 45

Proceedings Volume swrc:Proceedings 15

Journal Volume bibo:Journal 9

Cooperation of

Linked Data corpus of „Learning Analytics“publications of last 5 years (~ 800 publications)

Metadata, full-text & automated linking (DBLP, SWDF, DBpedia)

Wide adoption (http://lak.linkededucation.org)

1. Data extraction & vocabulary definition

2. 3. Applications & analysis Entity co-reference resolution & linking

Facilitating Scientometrics in Learning Analytics and

Educational Data Mining - the LAK Dataset, Dietze, S.,

Taibi, D., D’Aquin, M.,Semantic Web Journal, 2015.

29/03/16 9Stefan Dietze

29/03/16 10Stefan Dietze

LinkedUp Catalog: dataset index & registry, federated searchn a nutshell “Federated queries” through schema mappings

Dataset accessability

Linking & topic profiling

Schema/Types

Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates)

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

po:Programme

yov:Video

?

bibo:Book

Schema analysis & mapping

29/03/16 11Stefan Dietze

typeXtypeX

Co-occurence after mapping

(201 frequently occuring types,

mapped into 79 types)

bibo:Film

bibo:Document

po:Programme

bibo:Book

foaf:Document

yov:Video

typeX

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

Schema analysis & mapping

Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates)

29/03/16 12

29/03/16 13Stefan Dietze

http://data.linkededucation.org/linkedup/catalog/

LinkedUp Catalog: dataset index & registry, federated searchn a nutshell “Federated queries” through schema mappings

Dataset accessability

Linking & topic profiling

Dataset topic profiles

containsyov:Video

<yo:Video …>

<dc:title> Lecture 29 –

Stem Cells </dc:title>

</yo:Video…>

Yovisto Video

db:Medicine

db:Rudolf

Virchow

db:Cell

Biology

Linking entities/datasets through combination of (i) „semantic (graph-based) connectivity score (SCS)“ (based on Katz centrality) and „co-occurence-based measure (CBM)“ (similar to Normalised Google Distance)

Evaluation: outperforming Explicit Semantic Analysis (ESA)

SCS = 0.32

CBM = 0.24

Data(set) interlinking

bibo:Book

British Library Book

<bibo:Book …>

<bibo:title>Über den Hungertyphus</.>

<bibo:creator>Rudolf Virchov</…>

</bibo:Book…>

Combining a co-occurrence-based and a semantic

measure for entity linking, B. P. Nunes, S. Dietze, M.A.

Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013

- 10th Extended Semantic Web Conference, (May 2013).

?

29/03/16 14

db:Cell

(Biology)

db:Cell(Micro-

processor)

Stefan Dietze

db:Biology

db:Cell biology

Dataset

Catalog/Registry

yov:Video

<yo:Video …>

<dc:title>Lecture 29 –

Stem Cells</dc:title>

</yo:Video…>

Yovisto Video

Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets

Technically trivial, but scalability issues: LOD Cloud 1000+ datasets with <100 billion RDF statements

Efficient approach: sampling & ranking for balance between scalability and precision /recall

Scalable profiling of datasetsA Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).

db:Cell

(Biology)

29/03/16 15

db:Cell

(Biology)

Stefan Dietze

Efficient dataset profiling

1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling)

2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion)

3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)

Result: weighted dataset-topic profile graph

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).

29/03/16 16Stefan Dietze

Search & exploration of datasets through topic profilesin a nutshell Applied to entire LOD cloud/graph

Visual exploration of extracted RDF dataset profiles(datasets, topics, relationships)

Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets)

http://data-observatory.org/lod-profiles/

29/03/16 17Stefan Dietze

Search: entity retrieval on large structured datasets?in a nutshellChallenges

How to efficiently retrieve related entities/resources for given query ?

Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods (eg BM25F, Blanco et al, ISWC2011)

Query type affinity?

29/03/16 18Stefan Dietze

??

Large dataset/crawl

e.g. LinkedUp dataset graph, LIVIVO dataset, BTC2014

entities related to <James D. Watson>

?

BTC2014

Entity retrieval: approachin a nutshell

(I) Offline processing (clustering to address link sparsity)

1. Feature vectors (lexical and structural features)

2. Bucketing: per type (LSH algorithm)

3. Clustering: X-means & Spectral clustering per bucket

Improving Entity Retrieval on Structured Data,

Fetahu, B., Gadiraju, U., Dietze, S., 14th International

Semantic Web Conference (ISWC2014), Bethlehem,

US, (2015).

(II) Online processing (retrieval)

1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities)

2. Re-Ranking (context terms & query type affinity)

29/03/16 19Stefan Dietze

Dataset

BTC2014 (1.4 billion triples)

92 SemSearch queries

Methods

Our approaches: XM: Xmeans, SP: Spectral

Baselines B: BM25F, S1: Tonon et al [SIGIR12]

Conclusions

XM & SP outperform baselines

Clustering to remedy link sparsity

Relevance to query crucial

Improving Entity Retrieval on Structured Data,

Fetahu, B., Gadiraju, U., Dietze, S., 14th International

Semantic Web Conference (ISWC2014), Bethlehem,

US, (2015).

Entity retrieval: evaluation

29/03/16 20Stefan Dietze

Introduction & motivation

Publishing, linking and profiling

Publishing & linking (bibliographic) data

Dataset profiling & linking

Retrieval & search

Entity retrieval in large graphs

Embedded (bibliographic) Web data

Entity summarisation from Web markup

Outlook and future directions

Overview: contents so far

29/03/16 21Stefan Dietze

[ESWC13, ESCW14]

[ISWC15]

[WebSci13, SWJ15]

Outcomes & impact ?

Tangible outcomes / impact

Open Datasets

Applications

Vocabularies & Schemas

Initiatives & Working Groups

VOL

+ vocabularies for educational resource & service modeling

W3C Community Group „Open Linked Education“

DCMI Task Force on LRMI

W3C Schema Bib Extend Group

Tutorial & workshop series on Linked Data & Learning

LinkedUniversities, LinkedEducation.org

KEYSTONE WG „Search and Profiling of LD“

….

http://linkeduniversties.org

29/03/16 22Stefan Dietze

Introduction & motivation

Publishing, linking and profiling

Publishing & linking (bibliographic) data

Dataset profiling & linking

Retrieval & search

Entity retrieval in large graphs

Embedded (bibliographic) Web data

Entity summarisation from Web markup

Outlook and future directions

Overview: contents

beyond LD: embedded semantics

Stefan Dietze

Information (types)

Bibliographic (meta)data

Research information

Educational (meta)data

Web & social data

Stakeholders

Archival organisations

Digital libraries

Publishers

....

Domains

Life Sciences

Computer Science

Learning Analytics

...

Data-centric tasks

Publishing, preservation, annotation, crawling, search, retrieval ...

29/03/16 23Stefan Dietze

The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google

vs

Linked Data: approx. 1000 datasets & 100 billion statements- different order of magnitude wrt scale & dynamics

Other „semantics“ (structured facts) on the Web?

The Web as a knowledge base: semantics on the Web?

29/03/16 24Stefan Dietze

Embedded markup (RDFa, Microdata, Microformats) forinterpretation of Web documents (search, retrieval)

Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)

Adoption on the Web: 26 %(2014 Google study of 12 bn Web pages)

“Web Data Commons” (Meusel & Paulheim [ISWC2014])

• Markup from Common Crawl (2.2 billion pages): 17 billion RDF quads

• Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)

Same order of magnitude as “the Web”

Embedded semantics: Web page markup & schema.org

<div itemscope itemtype ="http://schema.org/Movie">

<h1 itemprop="name">Forrest Gump</h1>

<span>Actor: <span itemprop=„actor">Tom Hanks</span>

<span itemprop="genre">Drama</span>

...

</div>

29/03/16 25

RDF statements

node1 actor _node-x

node1 actor Robin Wright

node1 genre Comedy

node2 actor T. Hanks

node2 distributed by Paramount Pic.

node3 actor Tom Cruise

node3 distributed by Paramount Pic.

Stefan Dietze

29/03/16 26Stefan Dietze

Characteristics Example

Coreferences18.000 results for <„Iphone 6“, type, s:Product>(8,6 quads on average)

Redundancy<s, schema:name, „Iphone 6“> occuring 1000 times in WDC2013

Lack of links Largely unlinked entity descriptions / subgraphs

Errors(typos & schema violations, see Meusel et al [ESWC2015])

Wrong namespaces, such as http://schma.org

Undefined types & predicates: 9,7 % in WDC, less common than in LOD

Confusion of datatype and object properties:<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD

Data property range violations: e.g. literals vs numbers (12,6% in WDC vs 4,6 in LOD)

Using markup as global knowledge base - state of the art

Glimmer (http://glimmer.research.yahoo.com): entity retrieval (BM25F) on WDC dataset [Blanco, Mika & Vigna, ISWC2011]

Challenges: specific characteristics of markup data

Goal: obtaining entity summary (or entity-centric knowledge graph) for given query ?

Tasks: document annotation, knowledge base augmentation, semantic enrichments

Using markup as global knowledge base/graph?

Web page

markup

29/03/16 27Stefan Dietze

Query

Nucleic Acids, type:(Article)

Entity Summary/Graph

NameMolecular structure of nucleic

acids

authorJames D. Watson

Francis Crick

publisher Nature

datePublished 1953

Web crawls, WDC or large (domain-specific) crawls: e.g. publishers, universities, libraries etc

Candidate Facts

node1 nameMolecular structure

of nucleic acids

node1 author James D. Watson

node1 publisher Nature

node1 datePublished 1956

node1 datePublished 1953

node2 name Francis Crick

node2 name Cricks

Extract (domain-specific) knowledge bases and knowledge graphs for digital libraries

Experiments on WDC data: 87,6 % MAP, coverage: on average 57% additional facts compared to DBpedia

Ongoing work: entity summarisation from markup data

Query

Nucleic Acids, type:(Article) 1. Retrieval

2. Fact selection

Entity Summary/Graph

NameMolecular structure of nucleic

acids

authorJames D. Watson

Francis Crick

publisher Nature

datePublished 1953

29/03/16 28

New Queries

James D. Watson, type:(Person)

Francis Crick, type:(Person)

Nature, type:(Organization)

Stefan Dietze

Web crawls, WDC or large (domain-specific) crawls: e.g. publishers, universities, libraries etc

Web page

markup

(clustering, heuristics, trained classifier)

1

10

100

1000

10000

100000

1000000

10000000

1 51 101 151 201

cou

nt

(lo

g)

PLD (ranked)

# entities # statements

Unprecedented source of bibliographic data

Metadata about scholarly articles (s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC / 1 type alone)

Top 5 domains: Springer, MDPI, BMJ, diabetesjournals.org, mendeley.com, Biodiversitylibrary.org

Domains, topics, disciplines?

Life Sciences and Computer Science predominant

Top-10 article titles

Most important publishers/journals, libraries represented

=> Domain-specific & targeted crawls = unprecedented source of data

Embedded data for digital libraries / life sciences?

29/03/16 29Stefan Dietze

Knowledge graphs and LD (Yago, Freebase, Pubmed, DBLP etc)

Entity

node1 nameMolecular structure of

nucleic acids

node1 author James D. Watson

node1 publisher Nature

node1 datePublished 1956

node1 datePublished 1953

Future work: improving entity-centric tasks for digital libraries

29/03/16 30

Entity

node2 name Francis Crick

node2 name Cricks

node2 born 1916

Stefan Dietze

• Web data as knowledge resource

• Background knowledge/structured data

• Training data & ground truths

• ....

Embedded data

Unstructured (Web) documents

Linked Data

Improving data-centric tasks for large (bibliographic/life sciences) corpora, eg LIVIVO

• KB construction & augmentation

• Document annotation

• Entity recognition, disambiguation, interlinking

• Search & retrieval ...

Acknowledgements: team

Besnik Fetahu (L3S)

Ivana Marenzi (L3S)

Ujwal Gadiraju (L3S)

Eelco Herder (L3S)

Ran Yu (L3S)

Ricardo Kawase (L3S)

Pracheta Sahoo (L3S, IIT India)

Bernardo Pereira Nunes (L3S, PUC Rio)

+ external collaborators

29/03/16 31Stefan Dietze

References (presented work)

Dietze, S., Taibi, D., D’Aquin, M., Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset, Semantic Web Journal, 2016.

Dietze, S., Kaldoudi, E., Dovrolis, E., Giordano, D., Spampinato, C., Hendrix, M., Protopsaltis, A., Taibi, D., Yu, H. Q. (2013), Socio-semantic Integration of Educational Resources – the Case of the mEducator Project, in Journal of Universal Computer Science (J.UCS), Vol. 19, No. 11, pp. 1543-1569.

Dietze, S., Taibi, D., Yu, H. Q., Dovrolis, N., A Linked Dataset of Medical Educational Resources, British Journal of Educational Technology (BJET), Volume 46, Issue 5, pages 1123–1129, September 2015.

Gadiraju, U., Demartini, G., Kawase, R., Dietze, S. Human beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. In: IEEE Intelligent Systems, Volume 30 Issue 4 – Jul/Aug 2015.

Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI Conference on Human Factors in Computing Systems (CHI2015), April 18-23, Seoul, Korea.

Fetahu, B., Gadiraju, U., Dietze, S., Improving Entity Retrieval on Structured Data, 14th International Semantic Web Conference (ISWC2014), Bethlehem, US, (2015).

Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).

D’Aquin, M., Adamou, A., Dietze, S., Assessing the Educational Linked Data Landscape, ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.

Nunes, B. P., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W., Combining a co-occurrence-based and a semantic measure for entity linking, in: The Semantic Web: Semantics and Big Data, Proceedings of the 10th Extended Semantic Web Conference (ESWC2013), Lecture Notes in Computer Science Vol. 7882, Springer Berlin Heidelberg, 2013.

http://www.stefandietze.net

29/03/16 32Stefan Dietze

Selected related work

Entity retrieval

Alberto Tonon, Gianluca Demartini, and Philippe Cudré-Mauroux. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval. In: 35th Annual ACM SIGIR Conference (SIGIR 2012), Portland, Oregon, USA, August 2012.

Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International Semantic Web Conference (ISWC) 2011, pages 83-97.

Embedded markups & Web Data Commons

Robert Meusel, Petar Petrovski, Christian Bizer: The WebDataCommons Microdata, RDFa and MicroformatDataset Series. Proceedings of the 13th International Semantic Web Conference (ISWC 2014), RBDS Track, Trentino, Italy, October 2014.

Robert Meusel and Heiko Paulheim: Heuristics for Fixing Common Errors in Deployed schema.org Microdata. Proceedings of the 12th Extended Semantic Web Conference (ESWC 2015), Portoroz, Slovenia, May 2015

Linked Data quality

Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, SPARQL Web-Querying Infrastructure: Ready for Action?, International Semantic Web Conference 2013, (ISWC2013).

Paulheim H., Bizer, C., Type Inference on Noisy RDF Data, Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525

Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., An empirical survey of Linked Data conformance. Journal of Web Semantics 14, 2012

29/03/16 33Stefan Dietze

Thank you

29/03/16 34Stefan Dietze

• http://stefandietze.net

• http://data.l3s.de

• http://data.linkededucation.org/linkedup/catalog