27
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time Robert Meusel , Christian Bizer and Heiko Paulheim

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

Embed Size (px)

Citation preview

Page 1: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary

over TimeRobert Meusel, Christian Bizer and

Heiko Paulheim

Page 2: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 2

Motivation - LOD Cloud with 1.000 data providers

Page 3: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 3

Motivation - schema.org MD with 700k data providers

Page 4: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 4

Microdata in a Nutshell

- Adding structured information to web pages• By marking up contents and entities

- Arbitrary vocabularies are possible • Practically, only schema.org is deployed on a large scale

• Plus its historical predecessor: data-vocabulary.org

- Similar to RDFa

<div itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="name">Data and Web Science Group</span> <span itemprop="addressLocality">Mannheim</span>, <span itemprop="postalCode">68131</span> <span itemprop="addressCountry">Germany</span></div>

Page 5: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 5

Schema.org in a Nutshell

- Vocabulary for marking up entities on web pages• 675 classes and 965 properties (as of May 2015, release 2.0)

- Promoted and consumes by major search engine companies• Google, Bing, Yahoo!, and Yandex

• Google Rich Snippets

- Community-driven evolution and development

Page 6: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 6

Schema.org in a Nutshell – Coverage

- Schema.org has incorporated some popular vocabularies, like:• Good Relations (2012)

• W3C BibExtend (2014)

• MusicBrainz vocabulary (2015)

• Automotive Ontology (2015)

Page 7: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 7

Microdata with Schema.org in HTML Pages

<html>…<body>…<div id="main-section" class="performance left" data-sku="M17242_580“>

<h1> Predator Instinct FG Fußballschuh </h1><div>

<meta content="EUR"><span data-sale-price="219.95">219,95</span>…</body></html>

HTML pages embed directly markup languages to annotate items using different vocabularies

<html>…<body>…<div id="main-section" class="performance left" data-sku="M17242_580" itemscope itemtype="http://schema.org/Product"><h1 itemprop="name"> Predator Instinct FG Fußballschuh </h1><div itemscope itemtype="http://schema.org/Offer" itemprop="offers"><meta itemprop="priceCurrency" content="EUR"><span itemprop="price" data-sale-price="219.95">219,95</span>…</body></html>

1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Product> .

2._:node1 <http://schema.org/Product/name> "Predator Instinct FG Fußballschuh"@de .

3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Offer> .

4._:node1 <http://schema.org/Offer/price> "219,95"@de .

5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" .

6.…

Page 8: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 8

Wrap-Up

- Semantic annotations are used by more and more websites

- Entities on websites become machine-readable and machine-understandable

- schema.org together with Microdata is a success story • Promoted by search engine companies

• Deployed by over 17% of all websites [1] (over 700k data providers)

- Usage is more compliant to the schema than e.g. LOD [2]

[1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html[2] Meusel and Paulheim, ESWC 2015

Page 9: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

9

Digging for Reasons

- So, Microdata is more often deployed and is often more schema compliant, although there are millions of uncontrolled providers with different skill sets

- But why? Some hypotheses…• Availability of documentation

• Tool support

• Business incentive

• Schema flexibility

- Can we confirm/reject those from looking at the data?

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Page 10: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

10

A Diachronic Perspective

- Versions of schema.org are archived over time• Plus: there are several crawl releases per year

• i.e., we can look at change over time

- If we look at both schema and deployed data, we may observe• Adoption rates of schema changes

• Data-first changes to the schema

• Convergence or divergence of deployed data

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Page 11: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

11

A Diachronic Perspective

- Three releases of WDC Microdata corpus [1]• 2012, 2013, and 2014

- Versions of schema.org that were valid• At the beginning of the crawl

• At the end of the crawl

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

[1] http://webdatacommons.org/structureddata

Page 12: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

12

Top-down Adoption

- How fast are changes in the schema adopted?• New classes/properties

• Deprecations

• Domain/range changes

- Measuring adoption: challenges• Different crawls

• Overall growth of deployed schema.org

- Measure: normalized usage increase (nui) from i to j:• nui(s)>1.05: usage of schema element s has increased significantly

• nui(s)<0.95: usage of schema element s has decreased significantly

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Page 13: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

13

Top-down Adoption

- Adoption of new classes and properties• Almost half of all introduced classes are never used!

• Similar for new properties

- Reasons• Bulk-addition of vocabularies

• not every term is equally needed• e.g., medical vocabulary

• Blind spot of our approach• some terms are mainly for e-mail markup• e.g., Actions

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

SURPRISE!

Page 14: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

14

Top-down Adoption

- Main domains of positive adoption• Meta data for web content

(schema.org/Website has the highest nui)

• Broadcasting (e.g., TV Episodes)

• Questions & Answers

• Postal addresses

- Classes featured in Google Rich Snippets• Still growth on high level (tens of thousands of data providers)

• But nui(s)<0.95

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Yellow PagesSearch Engine Listings

Collaboration with BBC and EBU

Influence of CMS adoption

Q&A Pages, such asStackoverflow

Page 15: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

15

Top-down Adoption

- Adoption of domain/range changes• Again: rather low overall adoption

- Adopted well for• Products (height, width, itemCondition, …)

• Broadcasting domain (episode, actor, ...)

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Search Engine Listings

Collaboration with BBC and EBU

Page 16: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

16

Top-down Adoption

- Adoption of deprecations• Works well (29 out of 32 have a significantly low nui)

- Exceptions• s:map (← s:hasMap)

• s:maps (← s:hasMap)

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

For Google Maps(lots of outdated tutorials)

Page 17: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

17

Bottom-up Evolution

- Martin Luther• Started the protestant church

• A success story, too (like schema.org)

• (i.e., 800 million adopters worldwide)

- Famous quote:• “Man muss […] dem gemeinen Mann aufs Maul schauen”

• (roughly: “You have to listen to the way the common man really speaks.”)

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Martin Luther, 1483-1546

Disclaimer:I do not speak for the

protestant church.

Page 18: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

18

Bottom-up Evolution

- Are new features in the schema first used “inofficially”?• New classes/properties

• Domain/range changes

- Instrument for measurement: ROC curves• True positives mapped against false positives

• tp: elements used before

• fp: elements not used before

• Ranking by #PLDs

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Page 19: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

19

Bottom-up Evolution

- There are some mild influences observable• Stronger for domain/range changes

• especially range changes

• Weaker for new classes/properties

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

2012→ 2013 2013→ 2014 2012→ 2014

classes properties domains ranges

Page 20: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

20

Bottom-up Evolution

- Extension mechanism• Allows for user-defined classes/properties

• Those become subclasses implicitly

- Analysis over time• No measurable impact on standard evolution

• “Inofficial” use is likelier than use of extension mechanism

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

s:Product/ElectronicProduct

s:price/reducedPrice

Page 21: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

21

Overall Convergence

- Measuring convergence• i.e., homogeneity of descriptions of classes

• Example: two instances of s:LocalBusiness

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

_:1

_:2 “Birmingham”

“Main Street 24”

s:LocalBusiness

s:PostalAddress _:1

_:2 “Liverpool”

“Church Street 1”

s:LocalBusiness

s:PostalAddress

Page 22: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

22

Overall Convergence

- Recap• RDF from Microdata is a set of trees

• i.e., we can enumerate all paths to leaf nodes(omitting literals)

- Example:

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

_:1

_:2 “Liverpool”

“Church Street 1”

s:LocalBusiness

s:PostalAddress

rdf:type-s:LocalBusiness, s:address-rdf:type-s:PostalAddress,s:address-s:addressLocality,s:address-s:streetAddress

Page 23: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

23

Overall Convergence

- Using all paths, we can compute the entropy for each class as

- A low entropy refers to a high homogeneity

- We normalize both by maximum entropy and the total number of paths• i.e., we use normalized entropy rate as a measure for homogeneity

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Page 24: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

24

Overall Convergence

- Observations• Overall entropy decreases over time

- Classes with high convergence rates• WebSite, Blog, …

• Hotel, Restaurant, …

• Product, Offer, …

• Rating, Review

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Influence of CMS adoption

Yellow pages

Google Rich Snippets

...all of the above

Page 25: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

25

Key Adoption Drivers

- Search Engine Optimization• Web site providers want to be high in Google rankings

• Direct business incentive!

- Tool adoption• Major CMSs use schema.org

- Standard Agility• schema.org: 25 revisions in last three years

• cf. FOAF: six revisions in last eight years

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015

Page 26: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 26

Summary

- Both ways, top-down and bottom-up adoptions can be observed

- Homogeneity of deployed schema increase over time

- Described empirical data-driven study reveals valuable insights to understand how and why schema.org is a success story

- Observed key drivers and obstacles can also help to understand and analysis adoption of other standards, e.g. LOD

- More fine-grained insights might be revealed when extending the analysis corpus to the mailing list archive and issue tracker

Page 27: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 27

Thank you! Questions? Feedback?

Raw data can be found on the website of WebDataCommons:

http://webdatacommons.org/structureddata/

More interesting datasets and analysis:

http://webdatacommons.org/index.html

Acknowledgement

The extraction and analysis of the datasets was supported by AWS in Education Grant.