28
Analysing Structured Scholarly Data Embedded in Web Pages Pracheta Sahoo, Ujwal Gadiraju , Ran Yu, Sriparna Saha and Stefan Dietze WWW 2016 April 11 th , 2016 Montreal, Canada

Analysing Structured Scholarly Data Embedded in Web Pages

Embed Size (px)

Citation preview

Page 1: Analysing Structured Scholarly Data Embedded in Web Pages

Analysing Structured Scholarly Data Embedded in Web Pages

Pracheta Sahoo, Ujwal Gadiraju, Ran Yu, Sriparna Saha and Stefan Dietze

WWW 2016

April 11th, 2016Montreal, Canada

Page 2: Analysing Structured Scholarly Data Embedded in Web Pages

OVERVIEW❏ INTRODUCTION❏ MOTIVATION❏ RESEARCH

QUESTIONS❏ ANALYSES❏ CONCLUSIONS❏ FUTURE WORK

Page 3: Analysing Structured Scholarly Data Embedded in Web Pages

INTRODUCTION (1/3)

The Web: nearly 46 trillion Web pages indexed by Google

VS

Linked Data: approx. 1000 datasets & 100 billion statements

● different order of magnitude w.r.t. scale & dynamics

Are there other semantics (structured facts) on the Web?

Page 4: Analysing Structured Scholarly Data Embedded in Web Pages

INTRODUCTION (2/3)● Web pages embed structured data

(microdata, microformats and RDFa)○ Interpretation of web documents

(search & retrieval)● Increase in prevalence of embedded

markup (2014 Google study of 12 bn pages estimates an adoption of 26%)

● “Web Data Commons” (Meusel et al. [ISWC’14])○ Markup from Common Crawl (2.2 bn

pages) ○ 17 billion RDF quads○ Markup in 26% of pages, 14% of PLDs

in 2013 (increase from 6% in 2011)

Page 5: Analysing Structured Scholarly Data Embedded in Web Pages

Other semantics (structured facts) on

the Web!

Page 6: Analysing Structured Scholarly Data Embedded in Web Pages

INTRODUCTION (3/3)

Characteristics of Markup Data

Page 7: Analysing Structured Scholarly Data Embedded in Web Pages

MOTIVATION

● Embedded markup ⇒ sparsely linked, large % of coreferences, redundant statements

● Uptake and reuse of embedded markup is hindered by the lack of dynamics, scale

● Lack of understanding of the adoption of markup for scholarly resource metadata

Page 8: Analysing Structured Scholarly Data Embedded in Web Pages

WHAT WE BRING TO THE TABLE ...

● Study of scholarly data extracted from embedded annotations (Web Data Commons)

● Shape & characteristics of entity descriptions

● Level of adoption of terms & types, distributions across TLDs, PLDs, data publishers

Page 9: Analysing Structured Scholarly Data Embedded in Web Pages

RESEARCH QUESTIONS

RQ1 What are frequently used terms & types for scholarly data?

RQ2 How are statements about bibliographic data distributed across the web? Who are the key providers of bibliographic markup?

RQ3 What are the frequent errors that can be observed?

Page 10: Analysing Structured Scholarly Data Embedded in Web Pages

DATASET

● Web Data Commons (WDC) 2014 dataset● Subset ⇒ all statements describing entities

of type s:ScholarlyArticle or co-occuring on same document with any s:ScholarlyArticle instance○ 6,793,764 quads○ 1,184,623 entities○ 83 distinct classes○ 429 distinct predicates

Page 11: Analysing Structured Scholarly Data Embedded in Web Pages

DATASET - Considerations ● s:ScholarlyArticle is the only type which

explicitly refers to scholarly articles● We focus on schema.org, the most

widely used schema● Types considered ⇒ s:ScholarlyArticle,

s:Person and s:Organization○ 280,616 instances (s:

ScholarlyArticle)○ 847,417 insrances (s:Person)○ 3,798 instances (s:Organization)

Page 12: Analysing Structured Scholarly Data Embedded in Web Pages

SCHOLARLY TYPES & PREDICATES (½)

Cumulative dist. of predicates over instances across extracted types

1 to 14

1 to 9 1 to 4

Page 13: Analysing Structured Scholarly Data Embedded in Web Pages

SCHOLARLY TYPES & PREDICATES (2/2)

Top-10 Predicates for s:ScholarlyArticle

Page 14: Analysing Structured Scholarly Data Embedded in Web Pages

DOMAINS & DOCUMENTS (1/5)

Distribution of Entities & Statements across PLDs

Page 15: Analysing Structured Scholarly Data Embedded in Web Pages

DOMAINS & DOCUMENTS (2/5)

Top-10 PLDs (ranked by no. of entities)

Page 16: Analysing Structured Scholarly Data Embedded in Web Pages

DOMAINS & DOCUMENTS (3/5)

Distribution of Entities & Statements across TLDs

Page 17: Analysing Structured Scholarly Data Embedded in Web Pages

DOMAINS & DOCUMENTS (4/5)

Distribution of Entities & Statements across HTML Documents

Page 18: Analysing Structured Scholarly Data Embedded in Web Pages

DOMAINS & DOCUMENTS (5/5)

Top-10 Documents Ranked According to Embedded Entities

Page 19: Analysing Structured Scholarly Data Embedded in Web Pages

TOPICS & PUBLICATION TYPES (1/4)

Distribution of Scholarly Articles across Publishers

Page 20: Analysing Structured Scholarly Data Embedded in Web Pages

TOPICS & PUBLICATION TYPES (2/4)

Top-10 Publishers and corresponding no. of Publications

Page 21: Analysing Structured Scholarly Data Embedded in Web Pages

TOPICS & PUBLICATION TYPES (3/4)

Top-10 Publication Types (genres) across WDC

Page 22: Analysing Structured Scholarly Data Embedded in Web Pages

TOPICS & PUBLICATION TYPES (4/4)

Top-10 Article Titles (ranked by frequency of occurrence)

Page 23: Analysing Structured Scholarly Data Embedded in Web Pages

FREQUENT ERRORS - Schema Violations

Top-10 Misused Predicates

Page 24: Analysing Structured Scholarly Data Embedded in Web Pages

CONCLUSIONS (½) ● First study on coverage & char. of

bibliographic metadata embedded in web pages.

● Early adopters ⇒ publishers, libraries, other providers of bibliographic data.

● Usage of terms, types ⇒ dist. across providers, domains and topics follows a power law; few providers & documents contributing to majority of data.

Page 25: Analysing Structured Scholarly Data Embedded in Web Pages

● Top-k genres & publishers indicate a bias towards French, English data providers.

● Article titles, PLDs & publishers ⇒ bias Computer Science and Life Sciences.

● In this study we only consider entities tagged explicitly as "scholarlyArticle", a deeper analysis considering more types (article, book, etc.) and other creative works can shed light on the true scale of and potential of embedded markup data.

CONCLUSIONS (2/2)

Page 26: Analysing Structured Scholarly Data Embedded in Web Pages

FUTURE WORK

● Targeted crawl of typical providers of scholarly data (publishers, academic orgs., libraries, etc.)

● Consider implicitly typed bibliographic or creative work as scholarly data

Page 28: Analysing Structured Scholarly Data Embedded in Web Pages

LIMITATIONS

● Our study is limited to schema.org & the types of s:ScholarlyArticle, s:Person, s:Organization.

● We consider only explicitly linked scholarly works.