22
Validator and preview for the JobPosting data model of Schema.org Jindřich Mynarz Department of Information and Knowledge Engineering, University of Economics, Prague EC-WEB 2014, September 2, 2014

EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Embed Size (px)

DESCRIPTION

The presentation describes a tool for validating and previewing instances of Schema.org JobPosting described in structured data markup embedded in web pages. The validator and preview was developed to assist users of Schema.org to produce data of better quality. In this way, it tries to enhance usability of a part of Schema.org covering the domain of job postings. The paper discusses implementation of the tool and design of its validation rules based on SPARQL 1.1. Results of experimental validation of a job posting corpus harvested from the Web are presented. Among other findings, the results indicate that publishers of Schema.org JobPosting data often misunderstand precedence rules employed by markup parsers and that they ignore case-sensitivity of vocabulary names.

Citation preview

Page 1: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Validator and preview for the JobPosting data model of Schema.orgJindřich MynarzDepartment of Information and Knowledge Engineering,University of Economics, Prague

EC-WEB 2014, September 2, 2014

Page 2: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Motivation

● Improving usability of vocabularies● Provide feedback on the use of

vocabularies● Make vocabulary specification executable● Help ensure basic level of data quality● Capture application-specific requirements

for data in validation rules

Page 3: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

DámePráci.eu project

“Matching jobs with unemployedthrough semantic data”Data model using Schema.org with an extension for the job market.Application for searching through job postings aggregated from distinct sources:www.damepraci.cz (in Czech)

Page 4: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Validation method

● Rule-based, schema-awarevalidation

● Operates in the RDF data model● Focuses on semantic errors, beyond well-

formed markup● Partial open world assumption● Implemented as SPARQL 1.1 CONSTRUCT

queries● Error reporting via SPIN RDF vocabulary

Page 5: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Background knowledge

schema.org+ extension for job market (RDFS)+ external enumerations:● ISO 4217 currency codes (SKOS)● ISO 639-1 language codes (SKOS)

Loaded in separate named graphs that the validation rules can reference.

Page 6: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Validation rules

● Data completeness● Distinction between datatype and object

properties● Conflicting data● Datatype violations● Invalid codes

Page 7: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Data completeness

● At least 1 instanceof schema:JobPosting

● Other type information (class membership, datatypes) left optional

● Empty literals● Conditionally required data (e.g.,

compensation + currency)

Page 8: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Distinction between datatypeand object properties

● Object properties with literal objects instead of URIs or blank nodes (and vice versa for datatype properties)

● Simpler syntax of datatypeproperties○ Avoiding nested objects or difficulties with finding an

object's URI● May be a symptom of incorrectly nested

HTML elements

Page 9: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Conflicting data

● Mutually-exclusive properties○ schema:jobLocation

+ schema:isRemoteWork true● Cardinality violation for functional properties

with > 1 object○ schema:startDate, schema:currency, schema:

availableVacancies

● Incompatible class membership inferences○ schema:domainIncludes, schema:rangeIncludes○ Incompatible class membership is instantiation of 2+

distinct classes that are not in rdfs:subClassOf relation.

Page 10: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Datatype violations

● Regular expressions, casting errorsof XPath datatype constructor functions

● Date and time formats (xsd:date, xsd:duration)○ Not conforming to regular expressions○ Non-existent dates○ Dates from the future

● Interval limits○ Positive integers for schema:availableVacancies

Page 11: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Invalid codes

● Based on lookup in code lists enumerating every valid code

● Includes language codes (ISO 639-1) and currency codes (ISO 4217)

Page 12: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Implementation

Ruby on Rails web applicationbacked by Jena Fuseki SPARQL 1.1 endpoint.● Validates both RDFa and HTML5 Microdata● Czech and English localization● Validation results in HTML or JSON-LD● RSpec tests for each validation rule● Open source: https://github.com/OPLZZ/job-posting-

validator

Page 13: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Demo: bit.ly/broken-job-posting

Page 14: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Preview

Page 15: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Experimental validation of a JobPosting corpus

● 1332 seed URLs from 752 distinctpay-level domains obtained via Google Custom Search Engine restricted to schema:JobPosting

● Sample of 42 872 web pages obtained by crawling seed URLs

● Each page validated, validation results in JSON-LD loaded to Elasticsearch for exploration

Page 16: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Most common errors

Page 17: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Datatype property usedas object property

Most common path to error: schema:title

Possible cause: incorrect understanding of markup precedence rules:

[] schema:title <#title> .

<a property="title" href="#title">SEO guru</a>

[] schema:title "SEO guru" .

Page 18: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Empty literal value

Most common path to error: schema:addressRegion

Possible cause: incomplete data used to generate HTML from fixed templatesLess common in manually marked-up HTML

Page 19: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Both RDFa and HTML5 Microdata are case-sensitive.Spread across 116 unique PLDs.

“The default mode of authoring [Schema.org markup] is copy and edit.” — R.V. Guha

Incorrect character casein schema:Postaladdress

Page 20: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Most common path to error: schema:jobLocation

Common cause: simpler markup without intermediate resources

<p property="jobLocation">Munich</p>

Object property usedas datatype property

<p rel="jobLocation">

<p rel="address">

<p property= "addressLocality">

Munich

</p>

</p>

</p>

Page 21: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Unsuccessful experiments

Web Data Commons● Errors smoothed by extraction to RDF● Not suitable as a source of seed URLs: job

postings disappear quicklyVeterans Job Bank● Data from few PLDs, lacks variety● Severe restrictions on automated downloads

through its API

Page 22: EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Questions?

Acknowledgements:The presented research was partially supported by the project of Operational Programme Human Resources and Employment no. CZ.1.04/5.1.01/77.00440.

Image credits:Check List designed by Arthur Shlain from the thenounproject.comPuzzle designed by John from the thenounproject.com