View
533
Download
0
Category
Preview:
DESCRIPTION
The presentation describes a tool for validating and previewing instances of Schema.org JobPosting described in structured data markup embedded in web pages. The validator and preview was developed to assist users of Schema.org to produce data of better quality. In this way, it tries to enhance usability of a part of Schema.org covering the domain of job postings. The paper discusses implementation of the tool and design of its validation rules based on SPARQL 1.1. Results of experimental validation of a job posting corpus harvested from the Web are presented. Among other findings, the results indicate that publishers of Schema.org JobPosting data often misunderstand precedence rules employed by markup parsers and that they ignore case-sensitivity of vocabulary names.
Citation preview
Validator and preview for the JobPosting data model of Schema.orgJindřich MynarzDepartment of Information and Knowledge Engineering,University of Economics, Prague
EC-WEB 2014, September 2, 2014
Motivation
● Improving usability of vocabularies● Provide feedback on the use of
vocabularies● Make vocabulary specification executable● Help ensure basic level of data quality● Capture application-specific requirements
for data in validation rules
DámePráci.eu project
“Matching jobs with unemployedthrough semantic data”Data model using Schema.org with an extension for the job market.Application for searching through job postings aggregated from distinct sources:www.damepraci.cz (in Czech)
Validation method
● Rule-based, schema-awarevalidation
● Operates in the RDF data model● Focuses on semantic errors, beyond well-
formed markup● Partial open world assumption● Implemented as SPARQL 1.1 CONSTRUCT
queries● Error reporting via SPIN RDF vocabulary
Background knowledge
schema.org+ extension for job market (RDFS)+ external enumerations:● ISO 4217 currency codes (SKOS)● ISO 639-1 language codes (SKOS)
Loaded in separate named graphs that the validation rules can reference.
Validation rules
● Data completeness● Distinction between datatype and object
properties● Conflicting data● Datatype violations● Invalid codes
Data completeness
● At least 1 instanceof schema:JobPosting
● Other type information (class membership, datatypes) left optional
● Empty literals● Conditionally required data (e.g.,
compensation + currency)
Distinction between datatypeand object properties
● Object properties with literal objects instead of URIs or blank nodes (and vice versa for datatype properties)
● Simpler syntax of datatypeproperties○ Avoiding nested objects or difficulties with finding an
object's URI● May be a symptom of incorrectly nested
HTML elements
Conflicting data
● Mutually-exclusive properties○ schema:jobLocation
+ schema:isRemoteWork true● Cardinality violation for functional properties
with > 1 object○ schema:startDate, schema:currency, schema:
availableVacancies
● Incompatible class membership inferences○ schema:domainIncludes, schema:rangeIncludes○ Incompatible class membership is instantiation of 2+
distinct classes that are not in rdfs:subClassOf relation.
Datatype violations
● Regular expressions, casting errorsof XPath datatype constructor functions
● Date and time formats (xsd:date, xsd:duration)○ Not conforming to regular expressions○ Non-existent dates○ Dates from the future
● Interval limits○ Positive integers for schema:availableVacancies
Invalid codes
● Based on lookup in code lists enumerating every valid code
● Includes language codes (ISO 639-1) and currency codes (ISO 4217)
Implementation
Ruby on Rails web applicationbacked by Jena Fuseki SPARQL 1.1 endpoint.● Validates both RDFa and HTML5 Microdata● Czech and English localization● Validation results in HTML or JSON-LD● RSpec tests for each validation rule● Open source: https://github.com/OPLZZ/job-posting-
validator
Preview
Experimental validation of a JobPosting corpus
● 1332 seed URLs from 752 distinctpay-level domains obtained via Google Custom Search Engine restricted to schema:JobPosting
● Sample of 42 872 web pages obtained by crawling seed URLs
● Each page validated, validation results in JSON-LD loaded to Elasticsearch for exploration
Most common errors
Datatype property usedas object property
Most common path to error: schema:title
Possible cause: incorrect understanding of markup precedence rules:
[] schema:title <#title> .
<a property="title" href="#title">SEO guru</a>
[] schema:title "SEO guru" .
Empty literal value
Most common path to error: schema:addressRegion
Possible cause: incomplete data used to generate HTML from fixed templatesLess common in manually marked-up HTML
Both RDFa and HTML5 Microdata are case-sensitive.Spread across 116 unique PLDs.
“The default mode of authoring [Schema.org markup] is copy and edit.” — R.V. Guha
Incorrect character casein schema:Postaladdress
Most common path to error: schema:jobLocation
Common cause: simpler markup without intermediate resources
<p property="jobLocation">Munich</p>
Object property usedas datatype property
<p rel="jobLocation">
<p rel="address">
<p property= "addressLocality">
Munich
</p>
</p>
</p>
Unsuccessful experiments
Web Data Commons● Errors smoothed by extraction to RDF● Not suitable as a source of seed URLs: job
postings disappear quicklyVeterans Job Bank● Data from few PLDs, lacks variety● Severe restrictions on automated downloads
through its API
Questions?
Acknowledgements:The presented research was partially supported by the project of Operational Programme Human Resources and Employment no. CZ.1.04/5.1.01/77.00440.
Image credits:Check List designed by Arthur Shlain from the thenounproject.comPuzzle designed by John from the thenounproject.com
Recommended