Upload
robert-meusel
View
52
Download
1
Embed Size (px)
Citation preview
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary
over TimeRobert Meusel, Christian Bizer and
Heiko Paulheim
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 2
Motivation - LOD Cloud with 1.000 data providers
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 3
Motivation - schema.org MD with 700k data providers
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 4
Microdata in a Nutshell
- Adding structured information to web pages• By marking up contents and entities
- Arbitrary vocabularies are possible • Practically, only schema.org is deployed on a large scale
• Plus its historical predecessor: data-vocabulary.org
- Similar to RDFa
<div itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="name">Data and Web Science Group</span> <span itemprop="addressLocality">Mannheim</span>, <span itemprop="postalCode">68131</span> <span itemprop="addressCountry">Germany</span></div>
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 5
Schema.org in a Nutshell
- Vocabulary for marking up entities on web pages• 675 classes and 965 properties (as of May 2015, release 2.0)
- Promoted and consumes by major search engine companies• Google, Bing, Yahoo!, and Yandex
• Google Rich Snippets
- Community-driven evolution and development
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 6
Schema.org in a Nutshell – Coverage
- Schema.org has incorporated some popular vocabularies, like:• Good Relations (2012)
• W3C BibExtend (2014)
• MusicBrainz vocabulary (2015)
• Automotive Ontology (2015)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 7
Microdata with Schema.org in HTML Pages
<html>…<body>…<div id="main-section" class="performance left" data-sku="M17242_580“>
<h1> Predator Instinct FG Fußballschuh </h1><div>
<meta content="EUR"><span data-sale-price="219.95">219,95</span>…</body></html>
HTML pages embed directly markup languages to annotate items using different vocabularies
<html>…<body>…<div id="main-section" class="performance left" data-sku="M17242_580" itemscope itemtype="http://schema.org/Product"><h1 itemprop="name"> Predator Instinct FG Fußballschuh </h1><div itemscope itemtype="http://schema.org/Offer" itemprop="offers"><meta itemprop="priceCurrency" content="EUR"><span itemprop="price" data-sale-price="219.95">219,95</span>…</body></html>
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Product> .
2._:node1 <http://schema.org/Product/name> "Predator Instinct FG Fußballschuh"@de .
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Offer> .
4._:node1 <http://schema.org/Offer/price> "219,95"@de .
5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" .
6.…
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 8
Wrap-Up
- Semantic annotations are used by more and more websites
- Entities on websites become machine-readable and machine-understandable
- schema.org together with Microdata is a success story • Promoted by search engine companies
• Deployed by over 17% of all websites [1] (over 700k data providers)
- Usage is more compliant to the schema than e.g. LOD [2]
[1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html[2] Meusel and Paulheim, ESWC 2015
9
Digging for Reasons
- So, Microdata is more often deployed and is often more schema compliant, although there are millions of uncontrolled providers with different skill sets
- But why? Some hypotheses…• Availability of documentation
• Tool support
• Business incentive
• Schema flexibility
- Can we confirm/reject those from looking at the data?
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
10
A Diachronic Perspective
- Versions of schema.org are archived over time• Plus: there are several crawl releases per year
• i.e., we can look at change over time
- If we look at both schema and deployed data, we may observe• Adoption rates of schema changes
• Data-first changes to the schema
• Convergence or divergence of deployed data
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
11
A Diachronic Perspective
- Three releases of WDC Microdata corpus [1]• 2012, 2013, and 2014
- Versions of schema.org that were valid• At the beginning of the crawl
• At the end of the crawl
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
[1] http://webdatacommons.org/structureddata
12
Top-down Adoption
- How fast are changes in the schema adopted?• New classes/properties
• Deprecations
• Domain/range changes
- Measuring adoption: challenges• Different crawls
• Overall growth of deployed schema.org
- Measure: normalized usage increase (nui) from i to j:• nui(s)>1.05: usage of schema element s has increased significantly
• nui(s)<0.95: usage of schema element s has decreased significantly
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
13
Top-down Adoption
- Adoption of new classes and properties• Almost half of all introduced classes are never used!
• Similar for new properties
- Reasons• Bulk-addition of vocabularies
• not every term is equally needed• e.g., medical vocabulary
• Blind spot of our approach• some terms are mainly for e-mail markup• e.g., Actions
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
SURPRISE!
14
Top-down Adoption
- Main domains of positive adoption• Meta data for web content
(schema.org/Website has the highest nui)
• Broadcasting (e.g., TV Episodes)
• Questions & Answers
• Postal addresses
- Classes featured in Google Rich Snippets• Still growth on high level (tens of thousands of data providers)
• But nui(s)<0.95
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Yellow PagesSearch Engine Listings
Collaboration with BBC and EBU
Influence of CMS adoption
Q&A Pages, such asStackoverflow
15
Top-down Adoption
- Adoption of domain/range changes• Again: rather low overall adoption
- Adopted well for• Products (height, width, itemCondition, …)
• Broadcasting domain (episode, actor, ...)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Search Engine Listings
Collaboration with BBC and EBU
16
Top-down Adoption
- Adoption of deprecations• Works well (29 out of 32 have a significantly low nui)
- Exceptions• s:map (← s:hasMap)
• s:maps (← s:hasMap)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
For Google Maps(lots of outdated tutorials)
17
Bottom-up Evolution
- Martin Luther• Started the protestant church
• A success story, too (like schema.org)
• (i.e., 800 million adopters worldwide)
- Famous quote:• “Man muss […] dem gemeinen Mann aufs Maul schauen”
• (roughly: “You have to listen to the way the common man really speaks.”)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Martin Luther, 1483-1546
Disclaimer:I do not speak for the
protestant church.
18
Bottom-up Evolution
- Are new features in the schema first used “inofficially”?• New classes/properties
• Domain/range changes
- Instrument for measurement: ROC curves• True positives mapped against false positives
• tp: elements used before
• fp: elements not used before
• Ranking by #PLDs
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
19
Bottom-up Evolution
- There are some mild influences observable• Stronger for domain/range changes
• especially range changes
• Weaker for new classes/properties
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
2012→ 2013 2013→ 2014 2012→ 2014
classes properties domains ranges
20
Bottom-up Evolution
- Extension mechanism• Allows for user-defined classes/properties
• Those become subclasses implicitly
- Analysis over time• No measurable impact on standard evolution
• “Inofficial” use is likelier than use of extension mechanism
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
s:Product/ElectronicProduct
s:price/reducedPrice
21
Overall Convergence
- Measuring convergence• i.e., homogeneity of descriptions of classes
• Example: two instances of s:LocalBusiness
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
_:1
_:2 “Birmingham”
“Main Street 24”
s:LocalBusiness
s:PostalAddress _:1
_:2 “Liverpool”
“Church Street 1”
s:LocalBusiness
s:PostalAddress
22
Overall Convergence
- Recap• RDF from Microdata is a set of trees
• i.e., we can enumerate all paths to leaf nodes(omitting literals)
- Example:
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
_:1
_:2 “Liverpool”
“Church Street 1”
s:LocalBusiness
s:PostalAddress
rdf:type-s:LocalBusiness, s:address-rdf:type-s:PostalAddress,s:address-s:addressLocality,s:address-s:streetAddress
23
Overall Convergence
- Using all paths, we can compute the entropy for each class as
- A low entropy refers to a high homogeneity
- We normalize both by maximum entropy and the total number of paths• i.e., we use normalized entropy rate as a measure for homogeneity
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
24
Overall Convergence
- Observations• Overall entropy decreases over time
- Classes with high convergence rates• WebSite, Blog, …
• Hotel, Restaurant, …
• Product, Offer, …
• Rating, Review
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Influence of CMS adoption
Yellow pages
Google Rich Snippets
...all of the above
25
Key Adoption Drivers
- Search Engine Optimization• Web site providers want to be high in Google rankings
• Direct business incentive!
- Tool adoption• Major CMSs use schema.org
- Standard Agility• schema.org: 25 revisions in last three years
• cf. FOAF: six revisions in last eight years
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 26
Summary
- Both ways, top-down and bottom-up adoptions can be observed
- Homogeneity of deployed schema increase over time
- Described empirical data-driven study reveals valuable insights to understand how and why schema.org is a success story
- Observed key drivers and obstacles can also help to understand and analysis adoption of other standards, e.g. LOD
- More fine-grained insights might be revealed when extending the analysis corpus to the mailing list archive and issue tracker
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 27
Thank you! Questions? Feedback?
Raw data can be found on the website of WebDataCommons:
http://webdatacommons.org/structureddata/
More interesting datasets and analysis:
http://webdatacommons.org/index.html
Acknowledgement
The extraction and analysis of the datasets was supported by AWS in Education Grant.