33
Undefined 1 (2012) 1–5 1 IOS Press Quality Assessment Methodologies for Linked Open Data A Systematic Literature Review and Conceptual Framework Amrapali Zaveri a , Anisa Rula b , Andrea Maurino b , Ricardo Pietrobon c , Jens Lehmann a and Sören Auer a a Universität Leipzig, Institut für Informatik, D-04103 Leipzig, Germany, E-mail: (zaveri, lehmann, auer)@informatik.uni-leipzig.de b University of Milano-Bicocca, Department of Computer Science, Systems and Communication (DISCo), Innovative Techonologies for Interaction and Services (Lab), Viale Sarca 336, Milan, Italy E-mail: (anisa.rula, maurino)@disco.unimib.it c Associate Professor and Vice Chair of Surgery, Duke University, Durham, NC, USA., E-mail: [email protected] Abstract. The development and standardization of semantic web technologies have resulted in an unprecedented volume of data being published on the Web as Linked Open Data (LOD). However, we observe widely varying data quality ranging from extensively curated datasets to crowd-sourced and extracted data of relatively low quality. Data quality is commonly conceived as fitness of use. Consequently, a key challenge is to determine the data quality wrt. a particular use case. In this article, we present the results of a systematic review of approaches for assessing the data quality of LOD. We gather existing approaches and compare and group them under a common classification scheme. In particular, we unify and formalise commonly used terminologies across papers related to data quality. Additionally, a comprehensive list of the dimensions and metrics is presented. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and the development of new approaches focused towards data quality. Keywords: data quality, assessment, survey, Linked Data 1. Introduction The development and standardization of semantic web technologies have resulted in an unprecedented volume of data being published on the Web as Linked Open Data (LOD). This emerging Web of Data com- prises close to 50 billion facts represented as RDF triples. Although gathering and publishing such mas- sive amounts of data is certainly a step in the right di- rection, data is only as useful as its quality. Datasets published on the Data Web already cover a diverse set of domains. Specifically, biological and health care data is available as Linked Data in a great variety cov- ering areas such as drugs, clinical trials, proteins, and diseases. However, data on the Web reveals a large variation in data quality. For example, data extracted from semi-structured or even unstructured sources, such as DBpedia, often contains inconsistencies as well as misrepresented and incomplete information. Data quality is commonly conceived as fitness for use [31,57,34] for a certain application or use case. However, even datasets with quality problems might be useful for certain applications, as long as the qual- ity is in the required range. For example, in the case of DBpedia the data quality is perfectly sufficient for enriching Web search with facts or suggestions about common sense information, such as entertainment top- ics. In such a scenario, DBpedia can be used to show related movies and personal information, when a user searches for an actor. In this case, it is rather ne- 0000-0000/12/$00.00 c 2012 – IOS Press and the authors. All rights reserved

IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Embed Size (px)

Citation preview

Page 1: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Undefined 1 (2012) 1ndash5 1IOS Press

Quality Assessment Methodologies forLinked Open DataA Systematic Literature Review and Conceptual Framework

Amrapali Zaveri a Anisa Rula b Andrea Maurino b Ricardo Pietrobon c Jens Lehmann a andSoumlren Auer aa Universitaumlt Leipzig Institut fuumlr Informatik D-04103 Leipzig GermanyE-mail (zaveri lehmann auer)informatikuni-leipzigdeb University of Milano-Bicocca Department of Computer Science Systems and Communication (DISCo)Innovative Techonologies for Interaction and Services (Lab) Viale Sarca 336 Milan ItalyE-mail (anisarula maurino)discounimibitc Associate Professor and Vice Chair of Surgery Duke University Durham NC USAE-mail rpietrodukeedu

Abstract The development and standardization of semantic web technologies have resulted in an unprecedented volume ofdata being published on the Web as Linked Open Data (LOD) However we observe widely varying data quality ranging fromextensively curated datasets to crowd-sourced and extracted data of relatively low quality Data quality is commonly conceivedas fitness of use Consequently a key challenge is to determine the data quality wrt a particular use case In this article wepresent the results of a systematic review of approaches for assessing the data quality of LOD We gather existing approachesand compare and group them under a common classification scheme In particular we unify and formalise commonly usedterminologies across papers related to data quality Additionally a comprehensive list of the dimensions and metrics is presentedThe aim of this article is to provide researchers and data curators a comprehensive understanding of existing work therebyencouraging further experimentation and the development of new approaches focused towards data quality

Keywords data quality assessment survey Linked Data

1 Introduction

The development and standardization of semanticweb technologies have resulted in an unprecedentedvolume of data being published on the Web as LinkedOpen Data (LOD) This emerging Web of Data com-prises close to 50 billion facts represented as RDFtriples Although gathering and publishing such mas-sive amounts of data is certainly a step in the right di-rection data is only as useful as its quality Datasetspublished on the Data Web already cover a diverseset of domains Specifically biological and health caredata is available as Linked Data in a great variety cov-ering areas such as drugs clinical trials proteins anddiseases However data on the Web reveals a large

variation in data quality For example data extractedfrom semi-structured or even unstructured sourcessuch as DBpedia often contains inconsistencies aswell as misrepresented and incomplete information

Data quality is commonly conceived as fitness foruse [315734] for a certain application or use caseHowever even datasets with quality problems mightbe useful for certain applications as long as the qual-ity is in the required range For example in the caseof DBpedia the data quality is perfectly sufficient forenriching Web search with facts or suggestions aboutcommon sense information such as entertainment top-ics In such a scenario DBpedia can be used to showrelated movies and personal information when a usersearches for an actor In this case it is rather ne-

0000-000012$0000 ccopy 2012 ndash IOS Press and the authors All rights reserved

2 Linked Data Quality

glectable when in relatively few cases a related movieor some personal fact is missing For developing amedical application on the other hand the quality ofDBpedia is probably completely insufficient It shouldbe noted that even the traditional document-orientedWeb has content of varying quality and is still per-ceived to be extremely useful by most people Con-sequently a key challenge is to determine the qualityof datasets published on the Web and make this qual-ity information explicit Assuring data quality is par-ticularly a challenge in LOD as it involves a set of au-tonomously evolving data sources Other than on thedocument Web where information quality can be onlyindirectly (eg via page rank) or vaguely defined thereare much more concrete and measurable data qualitymetrics available for structured information Such dataquality metrics include correctness of facts adequacyof semantic representation or degree of coverage

There are already many methodologies and frame-works available for assessing data quality all address-ing different aspects of this task by proposing appro-priate methodologies measures and tools In particu-lar the database community has developed a numberof approaches [4735563] However data quality onthe Web of Data also includes a number of novel as-pects such as coherence via links to external datasetsdata representation quality or consistency with regardto implicit information Furthermore inference mech-anisms for knowledge representation formalisms onthe web such as OWL usually follow an open worldassumption whereas databases usually adopt closedworld semantics Despite the quality in LOD being anessential concept few efforts are currently in place tostandardize how quality tracking and assurance shouldbe implemented Moreover there is no consensus onhow the data quality dimensions and metrics should bedefined

Therefore in this paper we present the findings ofa systematic review of existing approaches that focustowards assessing the quality of Linked Data Afterperforming an exhaustive survey and filtering articlesbased on their titles we retrieved a corpus of 118 rel-evant articles published between 2002 and 2012 Fur-ther analyzing these 118 retrieved articles a total of 21papers were found relevant for our survey that form thecore of this paper These 21 approaches are comparedin detail and unified with respect to

ndash commonly used terminologies related to dataquality

ndash 26 different dimensions and their formalized def-initions

ndash metrics for each of the dimensions along with adistinction between them being subjective or ob-jective and

ndash methodologies and supported tools used to assessdata quality

Our goal is to provide researchers and those imple-menting data quality protocols with a comprehensiveunderstanding of the existing work thereby encourag-ing further experimentation and new approaches

This paper is structured as follows In Section 2 wedescribe the survey methodology used to conduct thesystematic review In Section 3 we unify and formal-ize (a) the terminologies related to data quality (b) def-initions for each of the data quality dimensions and(c) metrics for each of the dimensions In Section 4we compare the selected approaches based on differ-ent perspectives such as (a) dimensions (b) metrics(c) type of data (d) level of automation and (e) com-paring three particular tools to gauge their usabilityfor data quality assessment In Section 5 we concludewith ideas for future work

2 Survey Methodology

This systematic review was conducted by two re-viewers from different institutions following the sys-tematic review procedures described in [3344] A sys-tematic review can be conducted for several reasonssuch as (a) the summarisation and comparison interms of advantages and disadvantages of various ap-proaches in a field (b) the identification of open prob-lems (c) the contribution of a joint conceptualiza-tion comprising the various approaches developed ina field or (d) the synthesis of a new idea to cover theemphasized problems This systematic review tack-les in particular the problems (a)-(c) in that it sum-marises and compares various data quality assessmentmethodologies as well as identifies open problems re-lated to Linked Open Data Moreover a conceptualiza-tion of the data quality assessment field is proposedAn overview of our search methodology and the num-ber of retrieved articles at each step is shown in Figure1

Related surveys In order to justify the need ofconducting a systematic review we first conducteda search for related surveys and literature reviewsWe came across a study [34] conducted in 2005

Linked Data Quality 3

which summarizes 12 widely accepted informationquality frameworks applied on the World Wide WebThe study compares the frameworks and identifies20 dimensions common between them Additionallythere is a comprehensive review [3] which surveys13 methodologies for assessing the data quality ofdatasets available on the Web in structured or semi-structured formats Our survey is different in that itfocuses only on structured data and on approaches thataim at assessing the quality of LOD Additionally theprior review only focused on the data quality dimen-sions identified in the constituent approaches In oursurvey we not only identify existing dimensions butalso introduce new dimensions relevant for assessingthe quality of Linked Data Furthermore we describequality assessment metrics corresponding to each ofthe dimensions and also identified whether they can beobjectively or subjectively measured

Research question The goal of this review is to an-alyze existing methodologies for assessing the qualityof structured data with particular interest in LinkedData To achieve this goal we aim to answer the fol-lowing general research question

How can we assess the quality of Linked Data em-ploying a conceptual framework integrating prior ap-proaches

We can divide this general research question intofurther sub-questions such as

ndash What are the data quality problems that each ap-proach assesses

ndash Which are the data quality dimensions and met-rics supported by the proposed approaches

ndash What kind of tools are available for data qualityassessment

Eligibility criteria As a result of a discussion be-tween the two reviewers a list of eligibility criteria wasobtained as follows

ndash Inclusion criteria

lowast Studies published in English between 2002 and2012

lowast Studies focused on data quality assessment forLinked Data

lowast Studies focused on provenance assessment ofLinked Data

lowast Studies that proposed and implemented an ap-proach for data quality assessment

lowast Studies that assessed the quality of LinkedData and reported issues

ndash Exclusion criteria

lowast Studies that were not peer-reviewed or pub-lished

lowast Assessment methodologies that were publishedas a poster abstract

lowast Studies that focused on data quality manage-ment methodologies

lowast Studies that neither focused on Linked Datanor on other forms of structured data

lowast Studies that did non propose any methodologyor framework for the assessment of quality inLOD

Search strategy Search strategies in a systematic re-view are usually iterative and are run separately byboth members Based on the research question andthe eligibility criteria each reviewer identified severalterms that were most appropriate for this systematic re-view such as data quality data quality assessmentevaluation methodology improvement or linked datawhich were used as follows

ndash linked data and (quality OR assessment OR eval-uation OR methodology OR improvement)

ndash data OR quality OR data quality AND assess-ment OR evaluation OR methodology OR im-provement

In our experience searching in the title alone doesnot always provide us with all relevant publicationsThus the abstract or full-text of publications shouldalso potentially be included On the other hand sincethe search on the full-text of studies results in manyirrelevant publications we chose to apply the searchquery first on the title and abstract of the studies Thismeans a study is selected as a candidate study if its titleor abstract contains the keywords defined in the searchstring

After we defined the search strategy we applied thekeyword search in the following list of search enginesdigital libraries journals conferences and their respec-tive workshopsSearch Engines and digital libraries

ndash Google Scholarndash ISI Web of Sciencendash ACM Digital Libraryndash IEEE Xplore Digital Libraryndash Springer Linkndash Science Direct

4 Linked Data Quality

ACM Digital Library

Google Scholar

IEEE Xplore Digital Library

Science Direct

Springer Link

ISI Web of Science

Semantic Web Journal

Journal of Data and Information

Quality

Journal of Web Semantics

Search EnginesDigital Libraries

Journals

Scan article titles based on

inclusionexclusion

criteria

Import to Mendeley and

remove duplicates

Review Abstracts and includeexclude

articles Compare potentially shortlisted

articles among reviewers

10200

382

5422

2518

536

2347

137

317

62

Step 1

Step 2

Step 3

Step 4

Total number of articles

proposing methodologies for LD QA 12

Retrieve and analyse papers

from references

Step 5

Journal of Data and Knowledge

Engineering

110

118

4Total number of articles related to provenance

in LD 9

64

Fig 1 Number of articles retrieved during literature search

Journals

ndash Semantic Web Journalndash Journal of Web Semanticsndash Journal of Data and Information Qualityndash Journal of Data and Knowledge Engineering

Conferences and their Respective Workshops

ndash International Semantic Web Conference (ISWC)ndash European Semantic Web Conference (ESWC)ndash Asian Semantic Web Conference (ASWC)ndash International World Wide Web Conference (WWW)ndash Semantic Web in Provenance Management (SWPM)ndash Consuming Linked Data (COLD)ndash Linked Data on the Web (LDOW)ndash Web Quality

Thereafter the bibliographic metadata about the 118potentially relevant primary studies were recorded us-ing the bibliography management platform Mende-ley1

1httpswwwmendeleycom

Titles and abstract reviewing Both reviewers inde-pendently screened the titles and abstracts of the re-trieved 118 articles to identify the potentially eligiblearticles In case of disagreement while merging thelists the problem was resolved either by mutual con-sensus or by creating a list of articles to go under amore detailed review Then both the reviewers com-pared the articles and based on mutual agreement ob-tained a final list of 64 articles to be included

Retrieving further potential articles In order to en-sure that all relevant articles were included an addi-tional strategy was applied such as

ndash Looking up the references in the selected articlesndash Looking up the article title in Google Scholar and

retrieving the Cited By articles to check againstthe eligibility criteria

ndash Taking each data quality dimension individuallyand performing a related article search

After performing these search strategies we retrieved4 additional articles that matched the eligibility crite-ria

Linked Data Quality 5

Extracting data for quantitative and qualitativeanalysis As a result of the search we retrieved 21 pa-pers from 2002 to 2012 listed in Table 1 which arethe core of our survey Of these 21 9 articles focustowards provenance related quality assessment and 12propose generalized methodologies

Table 1List of the selected papers

Citation TitleGil etal 2002 [18] Trusting Information Sources One Citi-

zen at a Time

Golbeck et al2003 [21]

Trust Networks on the Semantic Web

Mostafavi etal2004 [45]

An ontology-based method for qualityassessment of spatial data bases

Golbeck 2006 [20] Using Trust and Provenance for ContentFiltering on the Semantic Web

Gil etal 2007 [17] Towards content trust of web resources

Lei etal2007 [38]

A framework for evaluating semanticmetadata

Hartig 2008 [23] Trustworthiness of Data on the Web

Bizer etal2009 [6]

Quality-driven information filtering us-ing the WIQA policy framework

Boumlhm etal2010 [7]

Profiling linked open data with ProLOD

Chen etal2010 [12]

Hypothesis generation and data qualityassessment through association mining

Flemming etal2010 [14]

Assessing the quality of a Linked Datasource

Hoganetal2010 [26]

Weaving the Pedantic Web

Shekarpour etal2010 [53]

Modeling and evaluation of trust with anextension in semantic web

Fuumlrberetal2011 [15]

Swiqa - a semantic web informationquality assessment framework

Gamble etal2011 [16]

Quality Trust and Utility of ScientificData on the Web Towards a Joint Model

Jacobi etal2011 [29]

Rule-Based Trust Assessment on the Se-mantic Web

Bonatti et al2011 [8]

Robust and scalable linked data reason-ing incorporating provenance and trustannotations

Gueacuteret et al2012 [22]

Assessing Linked Data Mappings UsingNetwork Measures

Hogan etal2012 [27]

An empirical survey of Linked Dataconformance

Mendes etal2012 [42]

Sieve Linked Data Quality Assessmentand Fusion

Rula etal2012 [50]

Capturing the Age of Linked OpenData Towards a Dataset-independentFramework

Comparison perspective of selected approachesThere exist several perspectives that can be used to an-alyze and compare the selected approaches such as

ndash the definitions of the core conceptsndash the dimensions and metrics proposed by each ap-

proachndash the type of data that is considered for the assess-

mentndash the level of automation of the supported tools

Selected approaches differ in how they consider allof these perspectives and are thus compared and de-scribed in Section 3 and Section 4

Quantitative overview Out of the 21 selected ap-proaches only 5 (23) were published in a journalparticularly only in the Journal of Web SemanticsOn the other hand 14 (66) approaches were pub-lished international conferences or workshops such asWWW ISWC and ICDE Only 2 (11) of the ap-proaches were master thesis and or PhD workshop pa-pers The majority of the papers was published evenlydistributed between the years 2010 and 2012 (4 paperseach year - 57) 2 papers were published in 2009(95) and the remaining 7 between 2002 and 2008(335)

3 Conceptualization

There exist a number of discrepancies in the defini-tion of many concepts in data quality due to the contex-tual nature of quality [3] Therefore we first describeand formally define the research context terminology(in Section 31) as well as the Linked Data quality di-mensions (in Section 32) along with their respectivemetrics in detail

31 General terms

RDF Dataset In this document we understand adata source as an access point for Linked Data in theWeb A data source provides a dataset and may sup-port multiple methods of access The terms RDF tripleRDF graph and RDF datasets have been adopted fromthe W3C Data Access Working Group [42410]

Given an infinite set U of URIs (resource identi-fiers) an infinite set B of blank nodes and an infi-nite set L of literals a triple 〈s p o〉 isin (U cup B) timesU times (U cup B cup L) is called an RDF triple wheres p o are the subject the predicate and the ob-ject of the triple respectively An RDF graph G is a

6 Linked Data Quality

Completeness

RelevancyAmount of

data

Verifiability ReputationBelievability

Licensing

Objectivity

Accuracy

Validity of documents

Consistency

Timeliness

CurrencyVolatility

Interlinking

Rep Conciseness

Rep Consistency

Conciseness

Interpretability

Understandibility

Versatility

Availability

PerformanceResponse Time

Security

RepresentationContextual

Trust

Intrinsic

Accessibility

Dataset Dynamicity

overlaps

Provenance

overlaps

Two dimensionsare related

Fig 2 Linked data quality dimensions and the relations between them The dimensions marked with a are newly introduced specifically forLinked Data

set of RDF triples A named graph is a pair 〈G u〉where G is called the default graph and u isin U AnRDF dataset is a set of default and named graphs =G (u1 G1) (u2 G2) (un Gn)

Data Quality Data quality is commonly conceivedas a multidimensional construct with a popular defini-tion as the fitness for use [31] Data quality may de-pend on various factors such as accuracy timelinesscompleteness relevancy objectivity believability un-derstandability consistency conciseness availabilityand verifiability [57]

In terms of the Semantic Web there are varying con-cepts of data quality The semantic metadata for exam-ple is an important concept to be considered when as-sessing the quality of datasets [37] On the other handthe notion of link quality is another important aspectin Linked Data that is introduced where it is automati-cally detected whether a link is useful or not [22] Alsoit is to be noted that data and information are inter-changeably used in the literature

Data Quality Problems Bizer et al [6] defines dataquality problems as the choice of web-based informa-tion systems design which integrate information from

different providers In [42] the problem of data qualityis related to values being in conflict between differentdata sources as a consequence of the diversity of thedata

In [14] the author does not provide a definition butimplicitly explains the problems in terms of data diver-sity In [26] the authors discuss about errors or noiseor difficulties and in [27] the author discuss about mod-elling issues which are prone of the non exploitationsof those data from the applications

Thus data quality problem refers to a set of issuesthat can affect the potentiality of the applications thatuse the data

Data Quality Dimensions and Metrics Data qual-ity assessment involves the measurement of quality di-mensions or criteria that are relevant to the consumerA data quality assessment metric or measure is a pro-cedure for measuring an information quality dimen-sion [6] These metrics are heuristics that are designedto fit a specific assessment situation [39] Since the di-mensions are rather abstract concepts the assessmentmetrics rely on quality indicators that allow for the as-sessment of the quality of a data source wrt the crite-

Linked Data Quality 7

ria [14] An assessment score is computed from theseindicators using a scoring function

In [6] the data quality dimensions are classified intothree categories according to the type of informationthat is used as quality indicator (1) Content Based -information content itself (2) Context Based - infor-mation about the context in which information wasclaimed (3) Rating Based - based on the ratings aboutthe data itself or the information provider

However we identify further dimensions (defined inSection 32) and also further categories to classify thedimensions namely (1) Contextual (2) Trust (3) In-trinsic (4) Accessibility (5) Representational and (6)Dataset Dynamicity dimensions as depicted in Fig-ure 2

Data Quality Assessment Method A data qualityassessment methodology is defined as the process ofevaluating if a piece of data meets in the informationconsumers need in a specific use case [6] The processinvolves measuring the quality dimensions that are rel-evant to the user and comparing the assessment resultswith the users quality requirements

32 Linked Data quality dimensions

After analyzing the 21 selected approaches in de-tail we identified a core set of 26 different data qual-ity dimensions that can be applied to assess the qual-ity of Linked Data In this section we formalize andadapt the definition for each dimension to the LinkedData context The metrics associated with each dimen-sion are also identified and reported Additionally wegroup the dimensions following a classification intro-duced in [57] which is further modified and extendedas

ndash Contextual dimensionsndash Trust dimensionsndash Intrinsic dimensionsndash Accessibility dimensionsndash Representational dimensionsndash Dataset dynamicity

These groups are not strictly disjoint but can par-tially overlap Also the dimensions are not indepen-dent from each other but correlations exists among di-mensions in the same group or between groups Fig-ure 2 shows the classification of the dimensions in theabove mentioned groups as well as the inter and intrarelations between them

Use case scenario Since data quality is describedas fitness to use we introduce a specific use case that

will allow us to illustrate the importance of each di-mension with the help of an example Our use case isabout an intelligent flight search engine which relieson acquiring (aggregating) data from several datasetsIt obtains information about airports and airlines froman airline dataset (eg OurAirports2 OpenFlights3)Information about the location of countries cities andparticular addresses is obtained from a spatial dataset(eg LinkedGeoData4) Additionally aggregators inRDF pull all the information related to airlines fromdifferent data sources (eg Expedia5 Tripadvisor6Skyscanner7 etc) that allows a user to query the in-tegrated dataset for a flight from any start and enddestination for any time period We will use this sce-nario throughout this section as an example of howdata quality influences fitness to use

321 Contextual dimensionsContextual dimensions are those that highly depend

on the context of the task at hand as well as on thesubjective preferences of the data consumer There arethree dimensions completeness amount-of-data andrelevancy that are part of this group which are alongwith a comprehensive list of their corresponding met-rics listed in Table 2 The reference for each metric isprovided in the table

3211 Completeness Completeness is defined asldquothe degree to which information is not missing in [5]This dimension is further classified in [15] into the fol-lowing categories (a) Schema completeness which isthe degree to which entities and attributes are not miss-ing in a schema (b) Column completeness which is afunction of the missing values for a specific property orcolumn and (c) Population completeness which refersto the ratio of entities represented in an informationsystem to the complete population In [42] complete-ness is distinguished on the schema level and the datalevel On the schema level a dataset is complete if itcontains all of the attributes needed for a given taskOn the data (ie instance) level a dataset is completeif it contains all of the necessary objects for a giventask As can be observed the authors in [5] give a gen-eral definition whereas the authors in [15] provide aset of sub-categories for completeness On the other

2httpthedatahuborgdatasetourairports3httpthedatahuborgdatasetopen-flights4linkedgeodataorg5httpwwwexpediacom6httpwwwtripadvisorcom7httpwwwskyscannerde

8 Linked Data Quality

Dimension Metric Description Type

Completeness

degree to which classes and properties arenot missing

detection of the degree to which the classes and propertiesof an ontology are represented [51542]

S

degree to which values for a property arenot missing

detection of no of missing values for a specific property [515]

O

degree to which real-world objects are notmissing

detection of the degree to which all the real-world objectsare represented [5152642]

O

degree to which interlinks are not missing detection of the degree to which instances in the dataset areinterlinked [22]

O

Amount-of-data appropriate volume of data for a particulartask

no of triples instances per class internal and external linksin a dataset [1412]

O

coverage scope and level of detail [14] S

Relevancyusage of meta-information attributes counting the occurrence of relevant terms within these at-

tributes or using vector space model and assigning higherweight to terms that appear within the meta-informationattributes [5]

S

retrieval of relevant documents sorting documents according to their relevancy for a givenquery [5]

S

Table 2

Comprehensive list of data quality metrics of the contextual di-mensions how it can be measured and itrsquos type - Subjective orObjective

hand in [42] the two types defined ie the schema anddata level completeness can be mapped to the two cat-egories (a) Schema completeness and (c) Populationcompleteness provided in [15]

Definition 1 (Completeness) Completeness refers tothe degree to which all required information is presentin a particular dataset In general completeness is theextent to which data is of sufficient depth breadth andscope for the task at hand In terms of Linked Datawe classify completeness as follows (a) Schema com-pleteness the degree to which the classes and proper-ties of an ontology are represented thus can be calledontology completeness (b) Property completenessmeasure of the missing values for a specific property(c) Population completeness is the percentage of allreal-world objects of a particular type that are repre-sented in the datasets and (d) Interlinking complete-ness has to be considered especially in Linked Dataand refers to the degree to which instances in thedataset are interlinked

Metrics Completeness can be measured by detect-ing the number of classes properties values and in-terlinks that are present in the dataset compared to theoriginal dataset (or gold standard dataset) It should benoted that in this case users should assume a closed-world-assumption where a gold standard dataset isavailable and can be used to compare against

Example In our use case the flight search engineshould contain complete information so that it includesall offers for flights (population completeness) For ex-

ample a user residing in Germany wants to visit herfriend in America Since the user is a student low priceis of high importance But looking for flights individ-ually on the airlines websites shows her flights withvery expensive fares However using our flight searchengine she finds all offers even the less expensive onesand is also able to compare fares from different air-lines and choose the most affordable one The morecomplete the information for flights is including cheapflights the more visitors the site attracts Moreoversufficient interlinks between the datasets will allow herto query the integrated dataset so as to find an optimalroute from the start to the end destination (interlinkingcompleteness) in cases when there is no direct flight

Particularly in Linked Data completeness is ofprime importance when integrating datasets from sev-eral sources where one of the goals is to increase com-pleteness That is the more sources are queried themore complete the results will be The completenessof interlinks between datasets is also important so thatone can retrieve all the relevant facts about a resourcewhen querying the integrated data sources Howevermeasuring completeness of a dataset usually requiresthe presence of a gold standard or the original datasource to compare with

3212 Amount-of-data The amount-of-data dimen-sion can be defined as ldquothe extent to which the volumeof data is appropriate for the task at hand [5] ldquoTheamount of data provided by a data source influencesits usability [14] and should be enough to approximate

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 2: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

2 Linked Data Quality

glectable when in relatively few cases a related movieor some personal fact is missing For developing amedical application on the other hand the quality ofDBpedia is probably completely insufficient It shouldbe noted that even the traditional document-orientedWeb has content of varying quality and is still per-ceived to be extremely useful by most people Con-sequently a key challenge is to determine the qualityof datasets published on the Web and make this qual-ity information explicit Assuring data quality is par-ticularly a challenge in LOD as it involves a set of au-tonomously evolving data sources Other than on thedocument Web where information quality can be onlyindirectly (eg via page rank) or vaguely defined thereare much more concrete and measurable data qualitymetrics available for structured information Such dataquality metrics include correctness of facts adequacyof semantic representation or degree of coverage

There are already many methodologies and frame-works available for assessing data quality all address-ing different aspects of this task by proposing appro-priate methodologies measures and tools In particu-lar the database community has developed a numberof approaches [4735563] However data quality onthe Web of Data also includes a number of novel as-pects such as coherence via links to external datasetsdata representation quality or consistency with regardto implicit information Furthermore inference mech-anisms for knowledge representation formalisms onthe web such as OWL usually follow an open worldassumption whereas databases usually adopt closedworld semantics Despite the quality in LOD being anessential concept few efforts are currently in place tostandardize how quality tracking and assurance shouldbe implemented Moreover there is no consensus onhow the data quality dimensions and metrics should bedefined

Therefore in this paper we present the findings ofa systematic review of existing approaches that focustowards assessing the quality of Linked Data Afterperforming an exhaustive survey and filtering articlesbased on their titles we retrieved a corpus of 118 rel-evant articles published between 2002 and 2012 Fur-ther analyzing these 118 retrieved articles a total of 21papers were found relevant for our survey that form thecore of this paper These 21 approaches are comparedin detail and unified with respect to

ndash commonly used terminologies related to dataquality

ndash 26 different dimensions and their formalized def-initions

ndash metrics for each of the dimensions along with adistinction between them being subjective or ob-jective and

ndash methodologies and supported tools used to assessdata quality

Our goal is to provide researchers and those imple-menting data quality protocols with a comprehensiveunderstanding of the existing work thereby encourag-ing further experimentation and new approaches

This paper is structured as follows In Section 2 wedescribe the survey methodology used to conduct thesystematic review In Section 3 we unify and formal-ize (a) the terminologies related to data quality (b) def-initions for each of the data quality dimensions and(c) metrics for each of the dimensions In Section 4we compare the selected approaches based on differ-ent perspectives such as (a) dimensions (b) metrics(c) type of data (d) level of automation and (e) com-paring three particular tools to gauge their usabilityfor data quality assessment In Section 5 we concludewith ideas for future work

2 Survey Methodology

This systematic review was conducted by two re-viewers from different institutions following the sys-tematic review procedures described in [3344] A sys-tematic review can be conducted for several reasonssuch as (a) the summarisation and comparison interms of advantages and disadvantages of various ap-proaches in a field (b) the identification of open prob-lems (c) the contribution of a joint conceptualiza-tion comprising the various approaches developed ina field or (d) the synthesis of a new idea to cover theemphasized problems This systematic review tack-les in particular the problems (a)-(c) in that it sum-marises and compares various data quality assessmentmethodologies as well as identifies open problems re-lated to Linked Open Data Moreover a conceptualiza-tion of the data quality assessment field is proposedAn overview of our search methodology and the num-ber of retrieved articles at each step is shown in Figure1

Related surveys In order to justify the need ofconducting a systematic review we first conducteda search for related surveys and literature reviewsWe came across a study [34] conducted in 2005

Linked Data Quality 3

which summarizes 12 widely accepted informationquality frameworks applied on the World Wide WebThe study compares the frameworks and identifies20 dimensions common between them Additionallythere is a comprehensive review [3] which surveys13 methodologies for assessing the data quality ofdatasets available on the Web in structured or semi-structured formats Our survey is different in that itfocuses only on structured data and on approaches thataim at assessing the quality of LOD Additionally theprior review only focused on the data quality dimen-sions identified in the constituent approaches In oursurvey we not only identify existing dimensions butalso introduce new dimensions relevant for assessingthe quality of Linked Data Furthermore we describequality assessment metrics corresponding to each ofthe dimensions and also identified whether they can beobjectively or subjectively measured

Research question The goal of this review is to an-alyze existing methodologies for assessing the qualityof structured data with particular interest in LinkedData To achieve this goal we aim to answer the fol-lowing general research question

How can we assess the quality of Linked Data em-ploying a conceptual framework integrating prior ap-proaches

We can divide this general research question intofurther sub-questions such as

ndash What are the data quality problems that each ap-proach assesses

ndash Which are the data quality dimensions and met-rics supported by the proposed approaches

ndash What kind of tools are available for data qualityassessment

Eligibility criteria As a result of a discussion be-tween the two reviewers a list of eligibility criteria wasobtained as follows

ndash Inclusion criteria

lowast Studies published in English between 2002 and2012

lowast Studies focused on data quality assessment forLinked Data

lowast Studies focused on provenance assessment ofLinked Data

lowast Studies that proposed and implemented an ap-proach for data quality assessment

lowast Studies that assessed the quality of LinkedData and reported issues

ndash Exclusion criteria

lowast Studies that were not peer-reviewed or pub-lished

lowast Assessment methodologies that were publishedas a poster abstract

lowast Studies that focused on data quality manage-ment methodologies

lowast Studies that neither focused on Linked Datanor on other forms of structured data

lowast Studies that did non propose any methodologyor framework for the assessment of quality inLOD

Search strategy Search strategies in a systematic re-view are usually iterative and are run separately byboth members Based on the research question andthe eligibility criteria each reviewer identified severalterms that were most appropriate for this systematic re-view such as data quality data quality assessmentevaluation methodology improvement or linked datawhich were used as follows

ndash linked data and (quality OR assessment OR eval-uation OR methodology OR improvement)

ndash data OR quality OR data quality AND assess-ment OR evaluation OR methodology OR im-provement

In our experience searching in the title alone doesnot always provide us with all relevant publicationsThus the abstract or full-text of publications shouldalso potentially be included On the other hand sincethe search on the full-text of studies results in manyirrelevant publications we chose to apply the searchquery first on the title and abstract of the studies Thismeans a study is selected as a candidate study if its titleor abstract contains the keywords defined in the searchstring

After we defined the search strategy we applied thekeyword search in the following list of search enginesdigital libraries journals conferences and their respec-tive workshopsSearch Engines and digital libraries

ndash Google Scholarndash ISI Web of Sciencendash ACM Digital Libraryndash IEEE Xplore Digital Libraryndash Springer Linkndash Science Direct

4 Linked Data Quality

ACM Digital Library

Google Scholar

IEEE Xplore Digital Library

Science Direct

Springer Link

ISI Web of Science

Semantic Web Journal

Journal of Data and Information

Quality

Journal of Web Semantics

Search EnginesDigital Libraries

Journals

Scan article titles based on

inclusionexclusion

criteria

Import to Mendeley and

remove duplicates

Review Abstracts and includeexclude

articles Compare potentially shortlisted

articles among reviewers

10200

382

5422

2518

536

2347

137

317

62

Step 1

Step 2

Step 3

Step 4

Total number of articles

proposing methodologies for LD QA 12

Retrieve and analyse papers

from references

Step 5

Journal of Data and Knowledge

Engineering

110

118

4Total number of articles related to provenance

in LD 9

64

Fig 1 Number of articles retrieved during literature search

Journals

ndash Semantic Web Journalndash Journal of Web Semanticsndash Journal of Data and Information Qualityndash Journal of Data and Knowledge Engineering

Conferences and their Respective Workshops

ndash International Semantic Web Conference (ISWC)ndash European Semantic Web Conference (ESWC)ndash Asian Semantic Web Conference (ASWC)ndash International World Wide Web Conference (WWW)ndash Semantic Web in Provenance Management (SWPM)ndash Consuming Linked Data (COLD)ndash Linked Data on the Web (LDOW)ndash Web Quality

Thereafter the bibliographic metadata about the 118potentially relevant primary studies were recorded us-ing the bibliography management platform Mende-ley1

1httpswwwmendeleycom

Titles and abstract reviewing Both reviewers inde-pendently screened the titles and abstracts of the re-trieved 118 articles to identify the potentially eligiblearticles In case of disagreement while merging thelists the problem was resolved either by mutual con-sensus or by creating a list of articles to go under amore detailed review Then both the reviewers com-pared the articles and based on mutual agreement ob-tained a final list of 64 articles to be included

Retrieving further potential articles In order to en-sure that all relevant articles were included an addi-tional strategy was applied such as

ndash Looking up the references in the selected articlesndash Looking up the article title in Google Scholar and

retrieving the Cited By articles to check againstthe eligibility criteria

ndash Taking each data quality dimension individuallyand performing a related article search

After performing these search strategies we retrieved4 additional articles that matched the eligibility crite-ria

Linked Data Quality 5

Extracting data for quantitative and qualitativeanalysis As a result of the search we retrieved 21 pa-pers from 2002 to 2012 listed in Table 1 which arethe core of our survey Of these 21 9 articles focustowards provenance related quality assessment and 12propose generalized methodologies

Table 1List of the selected papers

Citation TitleGil etal 2002 [18] Trusting Information Sources One Citi-

zen at a Time

Golbeck et al2003 [21]

Trust Networks on the Semantic Web

Mostafavi etal2004 [45]

An ontology-based method for qualityassessment of spatial data bases

Golbeck 2006 [20] Using Trust and Provenance for ContentFiltering on the Semantic Web

Gil etal 2007 [17] Towards content trust of web resources

Lei etal2007 [38]

A framework for evaluating semanticmetadata

Hartig 2008 [23] Trustworthiness of Data on the Web

Bizer etal2009 [6]

Quality-driven information filtering us-ing the WIQA policy framework

Boumlhm etal2010 [7]

Profiling linked open data with ProLOD

Chen etal2010 [12]

Hypothesis generation and data qualityassessment through association mining

Flemming etal2010 [14]

Assessing the quality of a Linked Datasource

Hoganetal2010 [26]

Weaving the Pedantic Web

Shekarpour etal2010 [53]

Modeling and evaluation of trust with anextension in semantic web

Fuumlrberetal2011 [15]

Swiqa - a semantic web informationquality assessment framework

Gamble etal2011 [16]

Quality Trust and Utility of ScientificData on the Web Towards a Joint Model

Jacobi etal2011 [29]

Rule-Based Trust Assessment on the Se-mantic Web

Bonatti et al2011 [8]

Robust and scalable linked data reason-ing incorporating provenance and trustannotations

Gueacuteret et al2012 [22]

Assessing Linked Data Mappings UsingNetwork Measures

Hogan etal2012 [27]

An empirical survey of Linked Dataconformance

Mendes etal2012 [42]

Sieve Linked Data Quality Assessmentand Fusion

Rula etal2012 [50]

Capturing the Age of Linked OpenData Towards a Dataset-independentFramework

Comparison perspective of selected approachesThere exist several perspectives that can be used to an-alyze and compare the selected approaches such as

ndash the definitions of the core conceptsndash the dimensions and metrics proposed by each ap-

proachndash the type of data that is considered for the assess-

mentndash the level of automation of the supported tools

Selected approaches differ in how they consider allof these perspectives and are thus compared and de-scribed in Section 3 and Section 4

Quantitative overview Out of the 21 selected ap-proaches only 5 (23) were published in a journalparticularly only in the Journal of Web SemanticsOn the other hand 14 (66) approaches were pub-lished international conferences or workshops such asWWW ISWC and ICDE Only 2 (11) of the ap-proaches were master thesis and or PhD workshop pa-pers The majority of the papers was published evenlydistributed between the years 2010 and 2012 (4 paperseach year - 57) 2 papers were published in 2009(95) and the remaining 7 between 2002 and 2008(335)

3 Conceptualization

There exist a number of discrepancies in the defini-tion of many concepts in data quality due to the contex-tual nature of quality [3] Therefore we first describeand formally define the research context terminology(in Section 31) as well as the Linked Data quality di-mensions (in Section 32) along with their respectivemetrics in detail

31 General terms

RDF Dataset In this document we understand adata source as an access point for Linked Data in theWeb A data source provides a dataset and may sup-port multiple methods of access The terms RDF tripleRDF graph and RDF datasets have been adopted fromthe W3C Data Access Working Group [42410]

Given an infinite set U of URIs (resource identi-fiers) an infinite set B of blank nodes and an infi-nite set L of literals a triple 〈s p o〉 isin (U cup B) timesU times (U cup B cup L) is called an RDF triple wheres p o are the subject the predicate and the ob-ject of the triple respectively An RDF graph G is a

6 Linked Data Quality

Completeness

RelevancyAmount of

data

Verifiability ReputationBelievability

Licensing

Objectivity

Accuracy

Validity of documents

Consistency

Timeliness

CurrencyVolatility

Interlinking

Rep Conciseness

Rep Consistency

Conciseness

Interpretability

Understandibility

Versatility

Availability

PerformanceResponse Time

Security

RepresentationContextual

Trust

Intrinsic

Accessibility

Dataset Dynamicity

overlaps

Provenance

overlaps

Two dimensionsare related

Fig 2 Linked data quality dimensions and the relations between them The dimensions marked with a are newly introduced specifically forLinked Data

set of RDF triples A named graph is a pair 〈G u〉where G is called the default graph and u isin U AnRDF dataset is a set of default and named graphs =G (u1 G1) (u2 G2) (un Gn)

Data Quality Data quality is commonly conceivedas a multidimensional construct with a popular defini-tion as the fitness for use [31] Data quality may de-pend on various factors such as accuracy timelinesscompleteness relevancy objectivity believability un-derstandability consistency conciseness availabilityand verifiability [57]

In terms of the Semantic Web there are varying con-cepts of data quality The semantic metadata for exam-ple is an important concept to be considered when as-sessing the quality of datasets [37] On the other handthe notion of link quality is another important aspectin Linked Data that is introduced where it is automati-cally detected whether a link is useful or not [22] Alsoit is to be noted that data and information are inter-changeably used in the literature

Data Quality Problems Bizer et al [6] defines dataquality problems as the choice of web-based informa-tion systems design which integrate information from

different providers In [42] the problem of data qualityis related to values being in conflict between differentdata sources as a consequence of the diversity of thedata

In [14] the author does not provide a definition butimplicitly explains the problems in terms of data diver-sity In [26] the authors discuss about errors or noiseor difficulties and in [27] the author discuss about mod-elling issues which are prone of the non exploitationsof those data from the applications

Thus data quality problem refers to a set of issuesthat can affect the potentiality of the applications thatuse the data

Data Quality Dimensions and Metrics Data qual-ity assessment involves the measurement of quality di-mensions or criteria that are relevant to the consumerA data quality assessment metric or measure is a pro-cedure for measuring an information quality dimen-sion [6] These metrics are heuristics that are designedto fit a specific assessment situation [39] Since the di-mensions are rather abstract concepts the assessmentmetrics rely on quality indicators that allow for the as-sessment of the quality of a data source wrt the crite-

Linked Data Quality 7

ria [14] An assessment score is computed from theseindicators using a scoring function

In [6] the data quality dimensions are classified intothree categories according to the type of informationthat is used as quality indicator (1) Content Based -information content itself (2) Context Based - infor-mation about the context in which information wasclaimed (3) Rating Based - based on the ratings aboutthe data itself or the information provider

However we identify further dimensions (defined inSection 32) and also further categories to classify thedimensions namely (1) Contextual (2) Trust (3) In-trinsic (4) Accessibility (5) Representational and (6)Dataset Dynamicity dimensions as depicted in Fig-ure 2

Data Quality Assessment Method A data qualityassessment methodology is defined as the process ofevaluating if a piece of data meets in the informationconsumers need in a specific use case [6] The processinvolves measuring the quality dimensions that are rel-evant to the user and comparing the assessment resultswith the users quality requirements

32 Linked Data quality dimensions

After analyzing the 21 selected approaches in de-tail we identified a core set of 26 different data qual-ity dimensions that can be applied to assess the qual-ity of Linked Data In this section we formalize andadapt the definition for each dimension to the LinkedData context The metrics associated with each dimen-sion are also identified and reported Additionally wegroup the dimensions following a classification intro-duced in [57] which is further modified and extendedas

ndash Contextual dimensionsndash Trust dimensionsndash Intrinsic dimensionsndash Accessibility dimensionsndash Representational dimensionsndash Dataset dynamicity

These groups are not strictly disjoint but can par-tially overlap Also the dimensions are not indepen-dent from each other but correlations exists among di-mensions in the same group or between groups Fig-ure 2 shows the classification of the dimensions in theabove mentioned groups as well as the inter and intrarelations between them

Use case scenario Since data quality is describedas fitness to use we introduce a specific use case that

will allow us to illustrate the importance of each di-mension with the help of an example Our use case isabout an intelligent flight search engine which relieson acquiring (aggregating) data from several datasetsIt obtains information about airports and airlines froman airline dataset (eg OurAirports2 OpenFlights3)Information about the location of countries cities andparticular addresses is obtained from a spatial dataset(eg LinkedGeoData4) Additionally aggregators inRDF pull all the information related to airlines fromdifferent data sources (eg Expedia5 Tripadvisor6Skyscanner7 etc) that allows a user to query the in-tegrated dataset for a flight from any start and enddestination for any time period We will use this sce-nario throughout this section as an example of howdata quality influences fitness to use

321 Contextual dimensionsContextual dimensions are those that highly depend

on the context of the task at hand as well as on thesubjective preferences of the data consumer There arethree dimensions completeness amount-of-data andrelevancy that are part of this group which are alongwith a comprehensive list of their corresponding met-rics listed in Table 2 The reference for each metric isprovided in the table

3211 Completeness Completeness is defined asldquothe degree to which information is not missing in [5]This dimension is further classified in [15] into the fol-lowing categories (a) Schema completeness which isthe degree to which entities and attributes are not miss-ing in a schema (b) Column completeness which is afunction of the missing values for a specific property orcolumn and (c) Population completeness which refersto the ratio of entities represented in an informationsystem to the complete population In [42] complete-ness is distinguished on the schema level and the datalevel On the schema level a dataset is complete if itcontains all of the attributes needed for a given taskOn the data (ie instance) level a dataset is completeif it contains all of the necessary objects for a giventask As can be observed the authors in [5] give a gen-eral definition whereas the authors in [15] provide aset of sub-categories for completeness On the other

2httpthedatahuborgdatasetourairports3httpthedatahuborgdatasetopen-flights4linkedgeodataorg5httpwwwexpediacom6httpwwwtripadvisorcom7httpwwwskyscannerde

8 Linked Data Quality

Dimension Metric Description Type

Completeness

degree to which classes and properties arenot missing

detection of the degree to which the classes and propertiesof an ontology are represented [51542]

S

degree to which values for a property arenot missing

detection of no of missing values for a specific property [515]

O

degree to which real-world objects are notmissing

detection of the degree to which all the real-world objectsare represented [5152642]

O

degree to which interlinks are not missing detection of the degree to which instances in the dataset areinterlinked [22]

O

Amount-of-data appropriate volume of data for a particulartask

no of triples instances per class internal and external linksin a dataset [1412]

O

coverage scope and level of detail [14] S

Relevancyusage of meta-information attributes counting the occurrence of relevant terms within these at-

tributes or using vector space model and assigning higherweight to terms that appear within the meta-informationattributes [5]

S

retrieval of relevant documents sorting documents according to their relevancy for a givenquery [5]

S

Table 2

Comprehensive list of data quality metrics of the contextual di-mensions how it can be measured and itrsquos type - Subjective orObjective

hand in [42] the two types defined ie the schema anddata level completeness can be mapped to the two cat-egories (a) Schema completeness and (c) Populationcompleteness provided in [15]

Definition 1 (Completeness) Completeness refers tothe degree to which all required information is presentin a particular dataset In general completeness is theextent to which data is of sufficient depth breadth andscope for the task at hand In terms of Linked Datawe classify completeness as follows (a) Schema com-pleteness the degree to which the classes and proper-ties of an ontology are represented thus can be calledontology completeness (b) Property completenessmeasure of the missing values for a specific property(c) Population completeness is the percentage of allreal-world objects of a particular type that are repre-sented in the datasets and (d) Interlinking complete-ness has to be considered especially in Linked Dataand refers to the degree to which instances in thedataset are interlinked

Metrics Completeness can be measured by detect-ing the number of classes properties values and in-terlinks that are present in the dataset compared to theoriginal dataset (or gold standard dataset) It should benoted that in this case users should assume a closed-world-assumption where a gold standard dataset isavailable and can be used to compare against

Example In our use case the flight search engineshould contain complete information so that it includesall offers for flights (population completeness) For ex-

ample a user residing in Germany wants to visit herfriend in America Since the user is a student low priceis of high importance But looking for flights individ-ually on the airlines websites shows her flights withvery expensive fares However using our flight searchengine she finds all offers even the less expensive onesand is also able to compare fares from different air-lines and choose the most affordable one The morecomplete the information for flights is including cheapflights the more visitors the site attracts Moreoversufficient interlinks between the datasets will allow herto query the integrated dataset so as to find an optimalroute from the start to the end destination (interlinkingcompleteness) in cases when there is no direct flight

Particularly in Linked Data completeness is ofprime importance when integrating datasets from sev-eral sources where one of the goals is to increase com-pleteness That is the more sources are queried themore complete the results will be The completenessof interlinks between datasets is also important so thatone can retrieve all the relevant facts about a resourcewhen querying the integrated data sources Howevermeasuring completeness of a dataset usually requiresthe presence of a gold standard or the original datasource to compare with

3212 Amount-of-data The amount-of-data dimen-sion can be defined as ldquothe extent to which the volumeof data is appropriate for the task at hand [5] ldquoTheamount of data provided by a data source influencesits usability [14] and should be enough to approximate

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 3: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 3

which summarizes 12 widely accepted informationquality frameworks applied on the World Wide WebThe study compares the frameworks and identifies20 dimensions common between them Additionallythere is a comprehensive review [3] which surveys13 methodologies for assessing the data quality ofdatasets available on the Web in structured or semi-structured formats Our survey is different in that itfocuses only on structured data and on approaches thataim at assessing the quality of LOD Additionally theprior review only focused on the data quality dimen-sions identified in the constituent approaches In oursurvey we not only identify existing dimensions butalso introduce new dimensions relevant for assessingthe quality of Linked Data Furthermore we describequality assessment metrics corresponding to each ofthe dimensions and also identified whether they can beobjectively or subjectively measured

Research question The goal of this review is to an-alyze existing methodologies for assessing the qualityof structured data with particular interest in LinkedData To achieve this goal we aim to answer the fol-lowing general research question

How can we assess the quality of Linked Data em-ploying a conceptual framework integrating prior ap-proaches

We can divide this general research question intofurther sub-questions such as

ndash What are the data quality problems that each ap-proach assesses

ndash Which are the data quality dimensions and met-rics supported by the proposed approaches

ndash What kind of tools are available for data qualityassessment

Eligibility criteria As a result of a discussion be-tween the two reviewers a list of eligibility criteria wasobtained as follows

ndash Inclusion criteria

lowast Studies published in English between 2002 and2012

lowast Studies focused on data quality assessment forLinked Data

lowast Studies focused on provenance assessment ofLinked Data

lowast Studies that proposed and implemented an ap-proach for data quality assessment

lowast Studies that assessed the quality of LinkedData and reported issues

ndash Exclusion criteria

lowast Studies that were not peer-reviewed or pub-lished

lowast Assessment methodologies that were publishedas a poster abstract

lowast Studies that focused on data quality manage-ment methodologies

lowast Studies that neither focused on Linked Datanor on other forms of structured data

lowast Studies that did non propose any methodologyor framework for the assessment of quality inLOD

Search strategy Search strategies in a systematic re-view are usually iterative and are run separately byboth members Based on the research question andthe eligibility criteria each reviewer identified severalterms that were most appropriate for this systematic re-view such as data quality data quality assessmentevaluation methodology improvement or linked datawhich were used as follows

ndash linked data and (quality OR assessment OR eval-uation OR methodology OR improvement)

ndash data OR quality OR data quality AND assess-ment OR evaluation OR methodology OR im-provement

In our experience searching in the title alone doesnot always provide us with all relevant publicationsThus the abstract or full-text of publications shouldalso potentially be included On the other hand sincethe search on the full-text of studies results in manyirrelevant publications we chose to apply the searchquery first on the title and abstract of the studies Thismeans a study is selected as a candidate study if its titleor abstract contains the keywords defined in the searchstring

After we defined the search strategy we applied thekeyword search in the following list of search enginesdigital libraries journals conferences and their respec-tive workshopsSearch Engines and digital libraries

ndash Google Scholarndash ISI Web of Sciencendash ACM Digital Libraryndash IEEE Xplore Digital Libraryndash Springer Linkndash Science Direct

4 Linked Data Quality

ACM Digital Library

Google Scholar

IEEE Xplore Digital Library

Science Direct

Springer Link

ISI Web of Science

Semantic Web Journal

Journal of Data and Information

Quality

Journal of Web Semantics

Search EnginesDigital Libraries

Journals

Scan article titles based on

inclusionexclusion

criteria

Import to Mendeley and

remove duplicates

Review Abstracts and includeexclude

articles Compare potentially shortlisted

articles among reviewers

10200

382

5422

2518

536

2347

137

317

62

Step 1

Step 2

Step 3

Step 4

Total number of articles

proposing methodologies for LD QA 12

Retrieve and analyse papers

from references

Step 5

Journal of Data and Knowledge

Engineering

110

118

4Total number of articles related to provenance

in LD 9

64

Fig 1 Number of articles retrieved during literature search

Journals

ndash Semantic Web Journalndash Journal of Web Semanticsndash Journal of Data and Information Qualityndash Journal of Data and Knowledge Engineering

Conferences and their Respective Workshops

ndash International Semantic Web Conference (ISWC)ndash European Semantic Web Conference (ESWC)ndash Asian Semantic Web Conference (ASWC)ndash International World Wide Web Conference (WWW)ndash Semantic Web in Provenance Management (SWPM)ndash Consuming Linked Data (COLD)ndash Linked Data on the Web (LDOW)ndash Web Quality

Thereafter the bibliographic metadata about the 118potentially relevant primary studies were recorded us-ing the bibliography management platform Mende-ley1

1httpswwwmendeleycom

Titles and abstract reviewing Both reviewers inde-pendently screened the titles and abstracts of the re-trieved 118 articles to identify the potentially eligiblearticles In case of disagreement while merging thelists the problem was resolved either by mutual con-sensus or by creating a list of articles to go under amore detailed review Then both the reviewers com-pared the articles and based on mutual agreement ob-tained a final list of 64 articles to be included

Retrieving further potential articles In order to en-sure that all relevant articles were included an addi-tional strategy was applied such as

ndash Looking up the references in the selected articlesndash Looking up the article title in Google Scholar and

retrieving the Cited By articles to check againstthe eligibility criteria

ndash Taking each data quality dimension individuallyand performing a related article search

After performing these search strategies we retrieved4 additional articles that matched the eligibility crite-ria

Linked Data Quality 5

Extracting data for quantitative and qualitativeanalysis As a result of the search we retrieved 21 pa-pers from 2002 to 2012 listed in Table 1 which arethe core of our survey Of these 21 9 articles focustowards provenance related quality assessment and 12propose generalized methodologies

Table 1List of the selected papers

Citation TitleGil etal 2002 [18] Trusting Information Sources One Citi-

zen at a Time

Golbeck et al2003 [21]

Trust Networks on the Semantic Web

Mostafavi etal2004 [45]

An ontology-based method for qualityassessment of spatial data bases

Golbeck 2006 [20] Using Trust and Provenance for ContentFiltering on the Semantic Web

Gil etal 2007 [17] Towards content trust of web resources

Lei etal2007 [38]

A framework for evaluating semanticmetadata

Hartig 2008 [23] Trustworthiness of Data on the Web

Bizer etal2009 [6]

Quality-driven information filtering us-ing the WIQA policy framework

Boumlhm etal2010 [7]

Profiling linked open data with ProLOD

Chen etal2010 [12]

Hypothesis generation and data qualityassessment through association mining

Flemming etal2010 [14]

Assessing the quality of a Linked Datasource

Hoganetal2010 [26]

Weaving the Pedantic Web

Shekarpour etal2010 [53]

Modeling and evaluation of trust with anextension in semantic web

Fuumlrberetal2011 [15]

Swiqa - a semantic web informationquality assessment framework

Gamble etal2011 [16]

Quality Trust and Utility of ScientificData on the Web Towards a Joint Model

Jacobi etal2011 [29]

Rule-Based Trust Assessment on the Se-mantic Web

Bonatti et al2011 [8]

Robust and scalable linked data reason-ing incorporating provenance and trustannotations

Gueacuteret et al2012 [22]

Assessing Linked Data Mappings UsingNetwork Measures

Hogan etal2012 [27]

An empirical survey of Linked Dataconformance

Mendes etal2012 [42]

Sieve Linked Data Quality Assessmentand Fusion

Rula etal2012 [50]

Capturing the Age of Linked OpenData Towards a Dataset-independentFramework

Comparison perspective of selected approachesThere exist several perspectives that can be used to an-alyze and compare the selected approaches such as

ndash the definitions of the core conceptsndash the dimensions and metrics proposed by each ap-

proachndash the type of data that is considered for the assess-

mentndash the level of automation of the supported tools

Selected approaches differ in how they consider allof these perspectives and are thus compared and de-scribed in Section 3 and Section 4

Quantitative overview Out of the 21 selected ap-proaches only 5 (23) were published in a journalparticularly only in the Journal of Web SemanticsOn the other hand 14 (66) approaches were pub-lished international conferences or workshops such asWWW ISWC and ICDE Only 2 (11) of the ap-proaches were master thesis and or PhD workshop pa-pers The majority of the papers was published evenlydistributed between the years 2010 and 2012 (4 paperseach year - 57) 2 papers were published in 2009(95) and the remaining 7 between 2002 and 2008(335)

3 Conceptualization

There exist a number of discrepancies in the defini-tion of many concepts in data quality due to the contex-tual nature of quality [3] Therefore we first describeand formally define the research context terminology(in Section 31) as well as the Linked Data quality di-mensions (in Section 32) along with their respectivemetrics in detail

31 General terms

RDF Dataset In this document we understand adata source as an access point for Linked Data in theWeb A data source provides a dataset and may sup-port multiple methods of access The terms RDF tripleRDF graph and RDF datasets have been adopted fromthe W3C Data Access Working Group [42410]

Given an infinite set U of URIs (resource identi-fiers) an infinite set B of blank nodes and an infi-nite set L of literals a triple 〈s p o〉 isin (U cup B) timesU times (U cup B cup L) is called an RDF triple wheres p o are the subject the predicate and the ob-ject of the triple respectively An RDF graph G is a

6 Linked Data Quality

Completeness

RelevancyAmount of

data

Verifiability ReputationBelievability

Licensing

Objectivity

Accuracy

Validity of documents

Consistency

Timeliness

CurrencyVolatility

Interlinking

Rep Conciseness

Rep Consistency

Conciseness

Interpretability

Understandibility

Versatility

Availability

PerformanceResponse Time

Security

RepresentationContextual

Trust

Intrinsic

Accessibility

Dataset Dynamicity

overlaps

Provenance

overlaps

Two dimensionsare related

Fig 2 Linked data quality dimensions and the relations between them The dimensions marked with a are newly introduced specifically forLinked Data

set of RDF triples A named graph is a pair 〈G u〉where G is called the default graph and u isin U AnRDF dataset is a set of default and named graphs =G (u1 G1) (u2 G2) (un Gn)

Data Quality Data quality is commonly conceivedas a multidimensional construct with a popular defini-tion as the fitness for use [31] Data quality may de-pend on various factors such as accuracy timelinesscompleteness relevancy objectivity believability un-derstandability consistency conciseness availabilityand verifiability [57]

In terms of the Semantic Web there are varying con-cepts of data quality The semantic metadata for exam-ple is an important concept to be considered when as-sessing the quality of datasets [37] On the other handthe notion of link quality is another important aspectin Linked Data that is introduced where it is automati-cally detected whether a link is useful or not [22] Alsoit is to be noted that data and information are inter-changeably used in the literature

Data Quality Problems Bizer et al [6] defines dataquality problems as the choice of web-based informa-tion systems design which integrate information from

different providers In [42] the problem of data qualityis related to values being in conflict between differentdata sources as a consequence of the diversity of thedata

In [14] the author does not provide a definition butimplicitly explains the problems in terms of data diver-sity In [26] the authors discuss about errors or noiseor difficulties and in [27] the author discuss about mod-elling issues which are prone of the non exploitationsof those data from the applications

Thus data quality problem refers to a set of issuesthat can affect the potentiality of the applications thatuse the data

Data Quality Dimensions and Metrics Data qual-ity assessment involves the measurement of quality di-mensions or criteria that are relevant to the consumerA data quality assessment metric or measure is a pro-cedure for measuring an information quality dimen-sion [6] These metrics are heuristics that are designedto fit a specific assessment situation [39] Since the di-mensions are rather abstract concepts the assessmentmetrics rely on quality indicators that allow for the as-sessment of the quality of a data source wrt the crite-

Linked Data Quality 7

ria [14] An assessment score is computed from theseindicators using a scoring function

In [6] the data quality dimensions are classified intothree categories according to the type of informationthat is used as quality indicator (1) Content Based -information content itself (2) Context Based - infor-mation about the context in which information wasclaimed (3) Rating Based - based on the ratings aboutthe data itself or the information provider

However we identify further dimensions (defined inSection 32) and also further categories to classify thedimensions namely (1) Contextual (2) Trust (3) In-trinsic (4) Accessibility (5) Representational and (6)Dataset Dynamicity dimensions as depicted in Fig-ure 2

Data Quality Assessment Method A data qualityassessment methodology is defined as the process ofevaluating if a piece of data meets in the informationconsumers need in a specific use case [6] The processinvolves measuring the quality dimensions that are rel-evant to the user and comparing the assessment resultswith the users quality requirements

32 Linked Data quality dimensions

After analyzing the 21 selected approaches in de-tail we identified a core set of 26 different data qual-ity dimensions that can be applied to assess the qual-ity of Linked Data In this section we formalize andadapt the definition for each dimension to the LinkedData context The metrics associated with each dimen-sion are also identified and reported Additionally wegroup the dimensions following a classification intro-duced in [57] which is further modified and extendedas

ndash Contextual dimensionsndash Trust dimensionsndash Intrinsic dimensionsndash Accessibility dimensionsndash Representational dimensionsndash Dataset dynamicity

These groups are not strictly disjoint but can par-tially overlap Also the dimensions are not indepen-dent from each other but correlations exists among di-mensions in the same group or between groups Fig-ure 2 shows the classification of the dimensions in theabove mentioned groups as well as the inter and intrarelations between them

Use case scenario Since data quality is describedas fitness to use we introduce a specific use case that

will allow us to illustrate the importance of each di-mension with the help of an example Our use case isabout an intelligent flight search engine which relieson acquiring (aggregating) data from several datasetsIt obtains information about airports and airlines froman airline dataset (eg OurAirports2 OpenFlights3)Information about the location of countries cities andparticular addresses is obtained from a spatial dataset(eg LinkedGeoData4) Additionally aggregators inRDF pull all the information related to airlines fromdifferent data sources (eg Expedia5 Tripadvisor6Skyscanner7 etc) that allows a user to query the in-tegrated dataset for a flight from any start and enddestination for any time period We will use this sce-nario throughout this section as an example of howdata quality influences fitness to use

321 Contextual dimensionsContextual dimensions are those that highly depend

on the context of the task at hand as well as on thesubjective preferences of the data consumer There arethree dimensions completeness amount-of-data andrelevancy that are part of this group which are alongwith a comprehensive list of their corresponding met-rics listed in Table 2 The reference for each metric isprovided in the table

3211 Completeness Completeness is defined asldquothe degree to which information is not missing in [5]This dimension is further classified in [15] into the fol-lowing categories (a) Schema completeness which isthe degree to which entities and attributes are not miss-ing in a schema (b) Column completeness which is afunction of the missing values for a specific property orcolumn and (c) Population completeness which refersto the ratio of entities represented in an informationsystem to the complete population In [42] complete-ness is distinguished on the schema level and the datalevel On the schema level a dataset is complete if itcontains all of the attributes needed for a given taskOn the data (ie instance) level a dataset is completeif it contains all of the necessary objects for a giventask As can be observed the authors in [5] give a gen-eral definition whereas the authors in [15] provide aset of sub-categories for completeness On the other

2httpthedatahuborgdatasetourairports3httpthedatahuborgdatasetopen-flights4linkedgeodataorg5httpwwwexpediacom6httpwwwtripadvisorcom7httpwwwskyscannerde

8 Linked Data Quality

Dimension Metric Description Type

Completeness

degree to which classes and properties arenot missing

detection of the degree to which the classes and propertiesof an ontology are represented [51542]

S

degree to which values for a property arenot missing

detection of no of missing values for a specific property [515]

O

degree to which real-world objects are notmissing

detection of the degree to which all the real-world objectsare represented [5152642]

O

degree to which interlinks are not missing detection of the degree to which instances in the dataset areinterlinked [22]

O

Amount-of-data appropriate volume of data for a particulartask

no of triples instances per class internal and external linksin a dataset [1412]

O

coverage scope and level of detail [14] S

Relevancyusage of meta-information attributes counting the occurrence of relevant terms within these at-

tributes or using vector space model and assigning higherweight to terms that appear within the meta-informationattributes [5]

S

retrieval of relevant documents sorting documents according to their relevancy for a givenquery [5]

S

Table 2

Comprehensive list of data quality metrics of the contextual di-mensions how it can be measured and itrsquos type - Subjective orObjective

hand in [42] the two types defined ie the schema anddata level completeness can be mapped to the two cat-egories (a) Schema completeness and (c) Populationcompleteness provided in [15]

Definition 1 (Completeness) Completeness refers tothe degree to which all required information is presentin a particular dataset In general completeness is theextent to which data is of sufficient depth breadth andscope for the task at hand In terms of Linked Datawe classify completeness as follows (a) Schema com-pleteness the degree to which the classes and proper-ties of an ontology are represented thus can be calledontology completeness (b) Property completenessmeasure of the missing values for a specific property(c) Population completeness is the percentage of allreal-world objects of a particular type that are repre-sented in the datasets and (d) Interlinking complete-ness has to be considered especially in Linked Dataand refers to the degree to which instances in thedataset are interlinked

Metrics Completeness can be measured by detect-ing the number of classes properties values and in-terlinks that are present in the dataset compared to theoriginal dataset (or gold standard dataset) It should benoted that in this case users should assume a closed-world-assumption where a gold standard dataset isavailable and can be used to compare against

Example In our use case the flight search engineshould contain complete information so that it includesall offers for flights (population completeness) For ex-

ample a user residing in Germany wants to visit herfriend in America Since the user is a student low priceis of high importance But looking for flights individ-ually on the airlines websites shows her flights withvery expensive fares However using our flight searchengine she finds all offers even the less expensive onesand is also able to compare fares from different air-lines and choose the most affordable one The morecomplete the information for flights is including cheapflights the more visitors the site attracts Moreoversufficient interlinks between the datasets will allow herto query the integrated dataset so as to find an optimalroute from the start to the end destination (interlinkingcompleteness) in cases when there is no direct flight

Particularly in Linked Data completeness is ofprime importance when integrating datasets from sev-eral sources where one of the goals is to increase com-pleteness That is the more sources are queried themore complete the results will be The completenessof interlinks between datasets is also important so thatone can retrieve all the relevant facts about a resourcewhen querying the integrated data sources Howevermeasuring completeness of a dataset usually requiresthe presence of a gold standard or the original datasource to compare with

3212 Amount-of-data The amount-of-data dimen-sion can be defined as ldquothe extent to which the volumeof data is appropriate for the task at hand [5] ldquoTheamount of data provided by a data source influencesits usability [14] and should be enough to approximate

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 4: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

4 Linked Data Quality

ACM Digital Library

Google Scholar

IEEE Xplore Digital Library

Science Direct

Springer Link

ISI Web of Science

Semantic Web Journal

Journal of Data and Information

Quality

Journal of Web Semantics

Search EnginesDigital Libraries

Journals

Scan article titles based on

inclusionexclusion

criteria

Import to Mendeley and

remove duplicates

Review Abstracts and includeexclude

articles Compare potentially shortlisted

articles among reviewers

10200

382

5422

2518

536

2347

137

317

62

Step 1

Step 2

Step 3

Step 4

Total number of articles

proposing methodologies for LD QA 12

Retrieve and analyse papers

from references

Step 5

Journal of Data and Knowledge

Engineering

110

118

4Total number of articles related to provenance

in LD 9

64

Fig 1 Number of articles retrieved during literature search

Journals

ndash Semantic Web Journalndash Journal of Web Semanticsndash Journal of Data and Information Qualityndash Journal of Data and Knowledge Engineering

Conferences and their Respective Workshops

ndash International Semantic Web Conference (ISWC)ndash European Semantic Web Conference (ESWC)ndash Asian Semantic Web Conference (ASWC)ndash International World Wide Web Conference (WWW)ndash Semantic Web in Provenance Management (SWPM)ndash Consuming Linked Data (COLD)ndash Linked Data on the Web (LDOW)ndash Web Quality

Thereafter the bibliographic metadata about the 118potentially relevant primary studies were recorded us-ing the bibliography management platform Mende-ley1

1httpswwwmendeleycom

Titles and abstract reviewing Both reviewers inde-pendently screened the titles and abstracts of the re-trieved 118 articles to identify the potentially eligiblearticles In case of disagreement while merging thelists the problem was resolved either by mutual con-sensus or by creating a list of articles to go under amore detailed review Then both the reviewers com-pared the articles and based on mutual agreement ob-tained a final list of 64 articles to be included

Retrieving further potential articles In order to en-sure that all relevant articles were included an addi-tional strategy was applied such as

ndash Looking up the references in the selected articlesndash Looking up the article title in Google Scholar and

retrieving the Cited By articles to check againstthe eligibility criteria

ndash Taking each data quality dimension individuallyand performing a related article search

After performing these search strategies we retrieved4 additional articles that matched the eligibility crite-ria

Linked Data Quality 5

Extracting data for quantitative and qualitativeanalysis As a result of the search we retrieved 21 pa-pers from 2002 to 2012 listed in Table 1 which arethe core of our survey Of these 21 9 articles focustowards provenance related quality assessment and 12propose generalized methodologies

Table 1List of the selected papers

Citation TitleGil etal 2002 [18] Trusting Information Sources One Citi-

zen at a Time

Golbeck et al2003 [21]

Trust Networks on the Semantic Web

Mostafavi etal2004 [45]

An ontology-based method for qualityassessment of spatial data bases

Golbeck 2006 [20] Using Trust and Provenance for ContentFiltering on the Semantic Web

Gil etal 2007 [17] Towards content trust of web resources

Lei etal2007 [38]

A framework for evaluating semanticmetadata

Hartig 2008 [23] Trustworthiness of Data on the Web

Bizer etal2009 [6]

Quality-driven information filtering us-ing the WIQA policy framework

Boumlhm etal2010 [7]

Profiling linked open data with ProLOD

Chen etal2010 [12]

Hypothesis generation and data qualityassessment through association mining

Flemming etal2010 [14]

Assessing the quality of a Linked Datasource

Hoganetal2010 [26]

Weaving the Pedantic Web

Shekarpour etal2010 [53]

Modeling and evaluation of trust with anextension in semantic web

Fuumlrberetal2011 [15]

Swiqa - a semantic web informationquality assessment framework

Gamble etal2011 [16]

Quality Trust and Utility of ScientificData on the Web Towards a Joint Model

Jacobi etal2011 [29]

Rule-Based Trust Assessment on the Se-mantic Web

Bonatti et al2011 [8]

Robust and scalable linked data reason-ing incorporating provenance and trustannotations

Gueacuteret et al2012 [22]

Assessing Linked Data Mappings UsingNetwork Measures

Hogan etal2012 [27]

An empirical survey of Linked Dataconformance

Mendes etal2012 [42]

Sieve Linked Data Quality Assessmentand Fusion

Rula etal2012 [50]

Capturing the Age of Linked OpenData Towards a Dataset-independentFramework

Comparison perspective of selected approachesThere exist several perspectives that can be used to an-alyze and compare the selected approaches such as

ndash the definitions of the core conceptsndash the dimensions and metrics proposed by each ap-

proachndash the type of data that is considered for the assess-

mentndash the level of automation of the supported tools

Selected approaches differ in how they consider allof these perspectives and are thus compared and de-scribed in Section 3 and Section 4

Quantitative overview Out of the 21 selected ap-proaches only 5 (23) were published in a journalparticularly only in the Journal of Web SemanticsOn the other hand 14 (66) approaches were pub-lished international conferences or workshops such asWWW ISWC and ICDE Only 2 (11) of the ap-proaches were master thesis and or PhD workshop pa-pers The majority of the papers was published evenlydistributed between the years 2010 and 2012 (4 paperseach year - 57) 2 papers were published in 2009(95) and the remaining 7 between 2002 and 2008(335)

3 Conceptualization

There exist a number of discrepancies in the defini-tion of many concepts in data quality due to the contex-tual nature of quality [3] Therefore we first describeand formally define the research context terminology(in Section 31) as well as the Linked Data quality di-mensions (in Section 32) along with their respectivemetrics in detail

31 General terms

RDF Dataset In this document we understand adata source as an access point for Linked Data in theWeb A data source provides a dataset and may sup-port multiple methods of access The terms RDF tripleRDF graph and RDF datasets have been adopted fromthe W3C Data Access Working Group [42410]

Given an infinite set U of URIs (resource identi-fiers) an infinite set B of blank nodes and an infi-nite set L of literals a triple 〈s p o〉 isin (U cup B) timesU times (U cup B cup L) is called an RDF triple wheres p o are the subject the predicate and the ob-ject of the triple respectively An RDF graph G is a

6 Linked Data Quality

Completeness

RelevancyAmount of

data

Verifiability ReputationBelievability

Licensing

Objectivity

Accuracy

Validity of documents

Consistency

Timeliness

CurrencyVolatility

Interlinking

Rep Conciseness

Rep Consistency

Conciseness

Interpretability

Understandibility

Versatility

Availability

PerformanceResponse Time

Security

RepresentationContextual

Trust

Intrinsic

Accessibility

Dataset Dynamicity

overlaps

Provenance

overlaps

Two dimensionsare related

Fig 2 Linked data quality dimensions and the relations between them The dimensions marked with a are newly introduced specifically forLinked Data

set of RDF triples A named graph is a pair 〈G u〉where G is called the default graph and u isin U AnRDF dataset is a set of default and named graphs =G (u1 G1) (u2 G2) (un Gn)

Data Quality Data quality is commonly conceivedas a multidimensional construct with a popular defini-tion as the fitness for use [31] Data quality may de-pend on various factors such as accuracy timelinesscompleteness relevancy objectivity believability un-derstandability consistency conciseness availabilityand verifiability [57]

In terms of the Semantic Web there are varying con-cepts of data quality The semantic metadata for exam-ple is an important concept to be considered when as-sessing the quality of datasets [37] On the other handthe notion of link quality is another important aspectin Linked Data that is introduced where it is automati-cally detected whether a link is useful or not [22] Alsoit is to be noted that data and information are inter-changeably used in the literature

Data Quality Problems Bizer et al [6] defines dataquality problems as the choice of web-based informa-tion systems design which integrate information from

different providers In [42] the problem of data qualityis related to values being in conflict between differentdata sources as a consequence of the diversity of thedata

In [14] the author does not provide a definition butimplicitly explains the problems in terms of data diver-sity In [26] the authors discuss about errors or noiseor difficulties and in [27] the author discuss about mod-elling issues which are prone of the non exploitationsof those data from the applications

Thus data quality problem refers to a set of issuesthat can affect the potentiality of the applications thatuse the data

Data Quality Dimensions and Metrics Data qual-ity assessment involves the measurement of quality di-mensions or criteria that are relevant to the consumerA data quality assessment metric or measure is a pro-cedure for measuring an information quality dimen-sion [6] These metrics are heuristics that are designedto fit a specific assessment situation [39] Since the di-mensions are rather abstract concepts the assessmentmetrics rely on quality indicators that allow for the as-sessment of the quality of a data source wrt the crite-

Linked Data Quality 7

ria [14] An assessment score is computed from theseindicators using a scoring function

In [6] the data quality dimensions are classified intothree categories according to the type of informationthat is used as quality indicator (1) Content Based -information content itself (2) Context Based - infor-mation about the context in which information wasclaimed (3) Rating Based - based on the ratings aboutthe data itself or the information provider

However we identify further dimensions (defined inSection 32) and also further categories to classify thedimensions namely (1) Contextual (2) Trust (3) In-trinsic (4) Accessibility (5) Representational and (6)Dataset Dynamicity dimensions as depicted in Fig-ure 2

Data Quality Assessment Method A data qualityassessment methodology is defined as the process ofevaluating if a piece of data meets in the informationconsumers need in a specific use case [6] The processinvolves measuring the quality dimensions that are rel-evant to the user and comparing the assessment resultswith the users quality requirements

32 Linked Data quality dimensions

After analyzing the 21 selected approaches in de-tail we identified a core set of 26 different data qual-ity dimensions that can be applied to assess the qual-ity of Linked Data In this section we formalize andadapt the definition for each dimension to the LinkedData context The metrics associated with each dimen-sion are also identified and reported Additionally wegroup the dimensions following a classification intro-duced in [57] which is further modified and extendedas

ndash Contextual dimensionsndash Trust dimensionsndash Intrinsic dimensionsndash Accessibility dimensionsndash Representational dimensionsndash Dataset dynamicity

These groups are not strictly disjoint but can par-tially overlap Also the dimensions are not indepen-dent from each other but correlations exists among di-mensions in the same group or between groups Fig-ure 2 shows the classification of the dimensions in theabove mentioned groups as well as the inter and intrarelations between them

Use case scenario Since data quality is describedas fitness to use we introduce a specific use case that

will allow us to illustrate the importance of each di-mension with the help of an example Our use case isabout an intelligent flight search engine which relieson acquiring (aggregating) data from several datasetsIt obtains information about airports and airlines froman airline dataset (eg OurAirports2 OpenFlights3)Information about the location of countries cities andparticular addresses is obtained from a spatial dataset(eg LinkedGeoData4) Additionally aggregators inRDF pull all the information related to airlines fromdifferent data sources (eg Expedia5 Tripadvisor6Skyscanner7 etc) that allows a user to query the in-tegrated dataset for a flight from any start and enddestination for any time period We will use this sce-nario throughout this section as an example of howdata quality influences fitness to use

321 Contextual dimensionsContextual dimensions are those that highly depend

on the context of the task at hand as well as on thesubjective preferences of the data consumer There arethree dimensions completeness amount-of-data andrelevancy that are part of this group which are alongwith a comprehensive list of their corresponding met-rics listed in Table 2 The reference for each metric isprovided in the table

3211 Completeness Completeness is defined asldquothe degree to which information is not missing in [5]This dimension is further classified in [15] into the fol-lowing categories (a) Schema completeness which isthe degree to which entities and attributes are not miss-ing in a schema (b) Column completeness which is afunction of the missing values for a specific property orcolumn and (c) Population completeness which refersto the ratio of entities represented in an informationsystem to the complete population In [42] complete-ness is distinguished on the schema level and the datalevel On the schema level a dataset is complete if itcontains all of the attributes needed for a given taskOn the data (ie instance) level a dataset is completeif it contains all of the necessary objects for a giventask As can be observed the authors in [5] give a gen-eral definition whereas the authors in [15] provide aset of sub-categories for completeness On the other

2httpthedatahuborgdatasetourairports3httpthedatahuborgdatasetopen-flights4linkedgeodataorg5httpwwwexpediacom6httpwwwtripadvisorcom7httpwwwskyscannerde

8 Linked Data Quality

Dimension Metric Description Type

Completeness

degree to which classes and properties arenot missing

detection of the degree to which the classes and propertiesof an ontology are represented [51542]

S

degree to which values for a property arenot missing

detection of no of missing values for a specific property [515]

O

degree to which real-world objects are notmissing

detection of the degree to which all the real-world objectsare represented [5152642]

O

degree to which interlinks are not missing detection of the degree to which instances in the dataset areinterlinked [22]

O

Amount-of-data appropriate volume of data for a particulartask

no of triples instances per class internal and external linksin a dataset [1412]

O

coverage scope and level of detail [14] S

Relevancyusage of meta-information attributes counting the occurrence of relevant terms within these at-

tributes or using vector space model and assigning higherweight to terms that appear within the meta-informationattributes [5]

S

retrieval of relevant documents sorting documents according to their relevancy for a givenquery [5]

S

Table 2

Comprehensive list of data quality metrics of the contextual di-mensions how it can be measured and itrsquos type - Subjective orObjective

hand in [42] the two types defined ie the schema anddata level completeness can be mapped to the two cat-egories (a) Schema completeness and (c) Populationcompleteness provided in [15]

Definition 1 (Completeness) Completeness refers tothe degree to which all required information is presentin a particular dataset In general completeness is theextent to which data is of sufficient depth breadth andscope for the task at hand In terms of Linked Datawe classify completeness as follows (a) Schema com-pleteness the degree to which the classes and proper-ties of an ontology are represented thus can be calledontology completeness (b) Property completenessmeasure of the missing values for a specific property(c) Population completeness is the percentage of allreal-world objects of a particular type that are repre-sented in the datasets and (d) Interlinking complete-ness has to be considered especially in Linked Dataand refers to the degree to which instances in thedataset are interlinked

Metrics Completeness can be measured by detect-ing the number of classes properties values and in-terlinks that are present in the dataset compared to theoriginal dataset (or gold standard dataset) It should benoted that in this case users should assume a closed-world-assumption where a gold standard dataset isavailable and can be used to compare against

Example In our use case the flight search engineshould contain complete information so that it includesall offers for flights (population completeness) For ex-

ample a user residing in Germany wants to visit herfriend in America Since the user is a student low priceis of high importance But looking for flights individ-ually on the airlines websites shows her flights withvery expensive fares However using our flight searchengine she finds all offers even the less expensive onesand is also able to compare fares from different air-lines and choose the most affordable one The morecomplete the information for flights is including cheapflights the more visitors the site attracts Moreoversufficient interlinks between the datasets will allow herto query the integrated dataset so as to find an optimalroute from the start to the end destination (interlinkingcompleteness) in cases when there is no direct flight

Particularly in Linked Data completeness is ofprime importance when integrating datasets from sev-eral sources where one of the goals is to increase com-pleteness That is the more sources are queried themore complete the results will be The completenessof interlinks between datasets is also important so thatone can retrieve all the relevant facts about a resourcewhen querying the integrated data sources Howevermeasuring completeness of a dataset usually requiresthe presence of a gold standard or the original datasource to compare with

3212 Amount-of-data The amount-of-data dimen-sion can be defined as ldquothe extent to which the volumeof data is appropriate for the task at hand [5] ldquoTheamount of data provided by a data source influencesits usability [14] and should be enough to approximate

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 5: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 5

Extracting data for quantitative and qualitativeanalysis As a result of the search we retrieved 21 pa-pers from 2002 to 2012 listed in Table 1 which arethe core of our survey Of these 21 9 articles focustowards provenance related quality assessment and 12propose generalized methodologies

Table 1List of the selected papers

Citation TitleGil etal 2002 [18] Trusting Information Sources One Citi-

zen at a Time

Golbeck et al2003 [21]

Trust Networks on the Semantic Web

Mostafavi etal2004 [45]

An ontology-based method for qualityassessment of spatial data bases

Golbeck 2006 [20] Using Trust and Provenance for ContentFiltering on the Semantic Web

Gil etal 2007 [17] Towards content trust of web resources

Lei etal2007 [38]

A framework for evaluating semanticmetadata

Hartig 2008 [23] Trustworthiness of Data on the Web

Bizer etal2009 [6]

Quality-driven information filtering us-ing the WIQA policy framework

Boumlhm etal2010 [7]

Profiling linked open data with ProLOD

Chen etal2010 [12]

Hypothesis generation and data qualityassessment through association mining

Flemming etal2010 [14]

Assessing the quality of a Linked Datasource

Hoganetal2010 [26]

Weaving the Pedantic Web

Shekarpour etal2010 [53]

Modeling and evaluation of trust with anextension in semantic web

Fuumlrberetal2011 [15]

Swiqa - a semantic web informationquality assessment framework

Gamble etal2011 [16]

Quality Trust and Utility of ScientificData on the Web Towards a Joint Model

Jacobi etal2011 [29]

Rule-Based Trust Assessment on the Se-mantic Web

Bonatti et al2011 [8]

Robust and scalable linked data reason-ing incorporating provenance and trustannotations

Gueacuteret et al2012 [22]

Assessing Linked Data Mappings UsingNetwork Measures

Hogan etal2012 [27]

An empirical survey of Linked Dataconformance

Mendes etal2012 [42]

Sieve Linked Data Quality Assessmentand Fusion

Rula etal2012 [50]

Capturing the Age of Linked OpenData Towards a Dataset-independentFramework

Comparison perspective of selected approachesThere exist several perspectives that can be used to an-alyze and compare the selected approaches such as

ndash the definitions of the core conceptsndash the dimensions and metrics proposed by each ap-

proachndash the type of data that is considered for the assess-

mentndash the level of automation of the supported tools

Selected approaches differ in how they consider allof these perspectives and are thus compared and de-scribed in Section 3 and Section 4

Quantitative overview Out of the 21 selected ap-proaches only 5 (23) were published in a journalparticularly only in the Journal of Web SemanticsOn the other hand 14 (66) approaches were pub-lished international conferences or workshops such asWWW ISWC and ICDE Only 2 (11) of the ap-proaches were master thesis and or PhD workshop pa-pers The majority of the papers was published evenlydistributed between the years 2010 and 2012 (4 paperseach year - 57) 2 papers were published in 2009(95) and the remaining 7 between 2002 and 2008(335)

3 Conceptualization

There exist a number of discrepancies in the defini-tion of many concepts in data quality due to the contex-tual nature of quality [3] Therefore we first describeand formally define the research context terminology(in Section 31) as well as the Linked Data quality di-mensions (in Section 32) along with their respectivemetrics in detail

31 General terms

RDF Dataset In this document we understand adata source as an access point for Linked Data in theWeb A data source provides a dataset and may sup-port multiple methods of access The terms RDF tripleRDF graph and RDF datasets have been adopted fromthe W3C Data Access Working Group [42410]

Given an infinite set U of URIs (resource identi-fiers) an infinite set B of blank nodes and an infi-nite set L of literals a triple 〈s p o〉 isin (U cup B) timesU times (U cup B cup L) is called an RDF triple wheres p o are the subject the predicate and the ob-ject of the triple respectively An RDF graph G is a

6 Linked Data Quality

Completeness

RelevancyAmount of

data

Verifiability ReputationBelievability

Licensing

Objectivity

Accuracy

Validity of documents

Consistency

Timeliness

CurrencyVolatility

Interlinking

Rep Conciseness

Rep Consistency

Conciseness

Interpretability

Understandibility

Versatility

Availability

PerformanceResponse Time

Security

RepresentationContextual

Trust

Intrinsic

Accessibility

Dataset Dynamicity

overlaps

Provenance

overlaps

Two dimensionsare related

Fig 2 Linked data quality dimensions and the relations between them The dimensions marked with a are newly introduced specifically forLinked Data

set of RDF triples A named graph is a pair 〈G u〉where G is called the default graph and u isin U AnRDF dataset is a set of default and named graphs =G (u1 G1) (u2 G2) (un Gn)

Data Quality Data quality is commonly conceivedas a multidimensional construct with a popular defini-tion as the fitness for use [31] Data quality may de-pend on various factors such as accuracy timelinesscompleteness relevancy objectivity believability un-derstandability consistency conciseness availabilityand verifiability [57]

In terms of the Semantic Web there are varying con-cepts of data quality The semantic metadata for exam-ple is an important concept to be considered when as-sessing the quality of datasets [37] On the other handthe notion of link quality is another important aspectin Linked Data that is introduced where it is automati-cally detected whether a link is useful or not [22] Alsoit is to be noted that data and information are inter-changeably used in the literature

Data Quality Problems Bizer et al [6] defines dataquality problems as the choice of web-based informa-tion systems design which integrate information from

different providers In [42] the problem of data qualityis related to values being in conflict between differentdata sources as a consequence of the diversity of thedata

In [14] the author does not provide a definition butimplicitly explains the problems in terms of data diver-sity In [26] the authors discuss about errors or noiseor difficulties and in [27] the author discuss about mod-elling issues which are prone of the non exploitationsof those data from the applications

Thus data quality problem refers to a set of issuesthat can affect the potentiality of the applications thatuse the data

Data Quality Dimensions and Metrics Data qual-ity assessment involves the measurement of quality di-mensions or criteria that are relevant to the consumerA data quality assessment metric or measure is a pro-cedure for measuring an information quality dimen-sion [6] These metrics are heuristics that are designedto fit a specific assessment situation [39] Since the di-mensions are rather abstract concepts the assessmentmetrics rely on quality indicators that allow for the as-sessment of the quality of a data source wrt the crite-

Linked Data Quality 7

ria [14] An assessment score is computed from theseindicators using a scoring function

In [6] the data quality dimensions are classified intothree categories according to the type of informationthat is used as quality indicator (1) Content Based -information content itself (2) Context Based - infor-mation about the context in which information wasclaimed (3) Rating Based - based on the ratings aboutthe data itself or the information provider

However we identify further dimensions (defined inSection 32) and also further categories to classify thedimensions namely (1) Contextual (2) Trust (3) In-trinsic (4) Accessibility (5) Representational and (6)Dataset Dynamicity dimensions as depicted in Fig-ure 2

Data Quality Assessment Method A data qualityassessment methodology is defined as the process ofevaluating if a piece of data meets in the informationconsumers need in a specific use case [6] The processinvolves measuring the quality dimensions that are rel-evant to the user and comparing the assessment resultswith the users quality requirements

32 Linked Data quality dimensions

After analyzing the 21 selected approaches in de-tail we identified a core set of 26 different data qual-ity dimensions that can be applied to assess the qual-ity of Linked Data In this section we formalize andadapt the definition for each dimension to the LinkedData context The metrics associated with each dimen-sion are also identified and reported Additionally wegroup the dimensions following a classification intro-duced in [57] which is further modified and extendedas

ndash Contextual dimensionsndash Trust dimensionsndash Intrinsic dimensionsndash Accessibility dimensionsndash Representational dimensionsndash Dataset dynamicity

These groups are not strictly disjoint but can par-tially overlap Also the dimensions are not indepen-dent from each other but correlations exists among di-mensions in the same group or between groups Fig-ure 2 shows the classification of the dimensions in theabove mentioned groups as well as the inter and intrarelations between them

Use case scenario Since data quality is describedas fitness to use we introduce a specific use case that

will allow us to illustrate the importance of each di-mension with the help of an example Our use case isabout an intelligent flight search engine which relieson acquiring (aggregating) data from several datasetsIt obtains information about airports and airlines froman airline dataset (eg OurAirports2 OpenFlights3)Information about the location of countries cities andparticular addresses is obtained from a spatial dataset(eg LinkedGeoData4) Additionally aggregators inRDF pull all the information related to airlines fromdifferent data sources (eg Expedia5 Tripadvisor6Skyscanner7 etc) that allows a user to query the in-tegrated dataset for a flight from any start and enddestination for any time period We will use this sce-nario throughout this section as an example of howdata quality influences fitness to use

321 Contextual dimensionsContextual dimensions are those that highly depend

on the context of the task at hand as well as on thesubjective preferences of the data consumer There arethree dimensions completeness amount-of-data andrelevancy that are part of this group which are alongwith a comprehensive list of their corresponding met-rics listed in Table 2 The reference for each metric isprovided in the table

3211 Completeness Completeness is defined asldquothe degree to which information is not missing in [5]This dimension is further classified in [15] into the fol-lowing categories (a) Schema completeness which isthe degree to which entities and attributes are not miss-ing in a schema (b) Column completeness which is afunction of the missing values for a specific property orcolumn and (c) Population completeness which refersto the ratio of entities represented in an informationsystem to the complete population In [42] complete-ness is distinguished on the schema level and the datalevel On the schema level a dataset is complete if itcontains all of the attributes needed for a given taskOn the data (ie instance) level a dataset is completeif it contains all of the necessary objects for a giventask As can be observed the authors in [5] give a gen-eral definition whereas the authors in [15] provide aset of sub-categories for completeness On the other

2httpthedatahuborgdatasetourairports3httpthedatahuborgdatasetopen-flights4linkedgeodataorg5httpwwwexpediacom6httpwwwtripadvisorcom7httpwwwskyscannerde

8 Linked Data Quality

Dimension Metric Description Type

Completeness

degree to which classes and properties arenot missing

detection of the degree to which the classes and propertiesof an ontology are represented [51542]

S

degree to which values for a property arenot missing

detection of no of missing values for a specific property [515]

O

degree to which real-world objects are notmissing

detection of the degree to which all the real-world objectsare represented [5152642]

O

degree to which interlinks are not missing detection of the degree to which instances in the dataset areinterlinked [22]

O

Amount-of-data appropriate volume of data for a particulartask

no of triples instances per class internal and external linksin a dataset [1412]

O

coverage scope and level of detail [14] S

Relevancyusage of meta-information attributes counting the occurrence of relevant terms within these at-

tributes or using vector space model and assigning higherweight to terms that appear within the meta-informationattributes [5]

S

retrieval of relevant documents sorting documents according to their relevancy for a givenquery [5]

S

Table 2

Comprehensive list of data quality metrics of the contextual di-mensions how it can be measured and itrsquos type - Subjective orObjective

hand in [42] the two types defined ie the schema anddata level completeness can be mapped to the two cat-egories (a) Schema completeness and (c) Populationcompleteness provided in [15]

Definition 1 (Completeness) Completeness refers tothe degree to which all required information is presentin a particular dataset In general completeness is theextent to which data is of sufficient depth breadth andscope for the task at hand In terms of Linked Datawe classify completeness as follows (a) Schema com-pleteness the degree to which the classes and proper-ties of an ontology are represented thus can be calledontology completeness (b) Property completenessmeasure of the missing values for a specific property(c) Population completeness is the percentage of allreal-world objects of a particular type that are repre-sented in the datasets and (d) Interlinking complete-ness has to be considered especially in Linked Dataand refers to the degree to which instances in thedataset are interlinked

Metrics Completeness can be measured by detect-ing the number of classes properties values and in-terlinks that are present in the dataset compared to theoriginal dataset (or gold standard dataset) It should benoted that in this case users should assume a closed-world-assumption where a gold standard dataset isavailable and can be used to compare against

Example In our use case the flight search engineshould contain complete information so that it includesall offers for flights (population completeness) For ex-

ample a user residing in Germany wants to visit herfriend in America Since the user is a student low priceis of high importance But looking for flights individ-ually on the airlines websites shows her flights withvery expensive fares However using our flight searchengine she finds all offers even the less expensive onesand is also able to compare fares from different air-lines and choose the most affordable one The morecomplete the information for flights is including cheapflights the more visitors the site attracts Moreoversufficient interlinks between the datasets will allow herto query the integrated dataset so as to find an optimalroute from the start to the end destination (interlinkingcompleteness) in cases when there is no direct flight

Particularly in Linked Data completeness is ofprime importance when integrating datasets from sev-eral sources where one of the goals is to increase com-pleteness That is the more sources are queried themore complete the results will be The completenessof interlinks between datasets is also important so thatone can retrieve all the relevant facts about a resourcewhen querying the integrated data sources Howevermeasuring completeness of a dataset usually requiresthe presence of a gold standard or the original datasource to compare with

3212 Amount-of-data The amount-of-data dimen-sion can be defined as ldquothe extent to which the volumeof data is appropriate for the task at hand [5] ldquoTheamount of data provided by a data source influencesits usability [14] and should be enough to approximate

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 6: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

6 Linked Data Quality

Completeness

RelevancyAmount of

data

Verifiability ReputationBelievability

Licensing

Objectivity

Accuracy

Validity of documents

Consistency

Timeliness

CurrencyVolatility

Interlinking

Rep Conciseness

Rep Consistency

Conciseness

Interpretability

Understandibility

Versatility

Availability

PerformanceResponse Time

Security

RepresentationContextual

Trust

Intrinsic

Accessibility

Dataset Dynamicity

overlaps

Provenance

overlaps

Two dimensionsare related

Fig 2 Linked data quality dimensions and the relations between them The dimensions marked with a are newly introduced specifically forLinked Data

set of RDF triples A named graph is a pair 〈G u〉where G is called the default graph and u isin U AnRDF dataset is a set of default and named graphs =G (u1 G1) (u2 G2) (un Gn)

Data Quality Data quality is commonly conceivedas a multidimensional construct with a popular defini-tion as the fitness for use [31] Data quality may de-pend on various factors such as accuracy timelinesscompleteness relevancy objectivity believability un-derstandability consistency conciseness availabilityand verifiability [57]

In terms of the Semantic Web there are varying con-cepts of data quality The semantic metadata for exam-ple is an important concept to be considered when as-sessing the quality of datasets [37] On the other handthe notion of link quality is another important aspectin Linked Data that is introduced where it is automati-cally detected whether a link is useful or not [22] Alsoit is to be noted that data and information are inter-changeably used in the literature

Data Quality Problems Bizer et al [6] defines dataquality problems as the choice of web-based informa-tion systems design which integrate information from

different providers In [42] the problem of data qualityis related to values being in conflict between differentdata sources as a consequence of the diversity of thedata

In [14] the author does not provide a definition butimplicitly explains the problems in terms of data diver-sity In [26] the authors discuss about errors or noiseor difficulties and in [27] the author discuss about mod-elling issues which are prone of the non exploitationsof those data from the applications

Thus data quality problem refers to a set of issuesthat can affect the potentiality of the applications thatuse the data

Data Quality Dimensions and Metrics Data qual-ity assessment involves the measurement of quality di-mensions or criteria that are relevant to the consumerA data quality assessment metric or measure is a pro-cedure for measuring an information quality dimen-sion [6] These metrics are heuristics that are designedto fit a specific assessment situation [39] Since the di-mensions are rather abstract concepts the assessmentmetrics rely on quality indicators that allow for the as-sessment of the quality of a data source wrt the crite-

Linked Data Quality 7

ria [14] An assessment score is computed from theseindicators using a scoring function

In [6] the data quality dimensions are classified intothree categories according to the type of informationthat is used as quality indicator (1) Content Based -information content itself (2) Context Based - infor-mation about the context in which information wasclaimed (3) Rating Based - based on the ratings aboutthe data itself or the information provider

However we identify further dimensions (defined inSection 32) and also further categories to classify thedimensions namely (1) Contextual (2) Trust (3) In-trinsic (4) Accessibility (5) Representational and (6)Dataset Dynamicity dimensions as depicted in Fig-ure 2

Data Quality Assessment Method A data qualityassessment methodology is defined as the process ofevaluating if a piece of data meets in the informationconsumers need in a specific use case [6] The processinvolves measuring the quality dimensions that are rel-evant to the user and comparing the assessment resultswith the users quality requirements

32 Linked Data quality dimensions

After analyzing the 21 selected approaches in de-tail we identified a core set of 26 different data qual-ity dimensions that can be applied to assess the qual-ity of Linked Data In this section we formalize andadapt the definition for each dimension to the LinkedData context The metrics associated with each dimen-sion are also identified and reported Additionally wegroup the dimensions following a classification intro-duced in [57] which is further modified and extendedas

ndash Contextual dimensionsndash Trust dimensionsndash Intrinsic dimensionsndash Accessibility dimensionsndash Representational dimensionsndash Dataset dynamicity

These groups are not strictly disjoint but can par-tially overlap Also the dimensions are not indepen-dent from each other but correlations exists among di-mensions in the same group or between groups Fig-ure 2 shows the classification of the dimensions in theabove mentioned groups as well as the inter and intrarelations between them

Use case scenario Since data quality is describedas fitness to use we introduce a specific use case that

will allow us to illustrate the importance of each di-mension with the help of an example Our use case isabout an intelligent flight search engine which relieson acquiring (aggregating) data from several datasetsIt obtains information about airports and airlines froman airline dataset (eg OurAirports2 OpenFlights3)Information about the location of countries cities andparticular addresses is obtained from a spatial dataset(eg LinkedGeoData4) Additionally aggregators inRDF pull all the information related to airlines fromdifferent data sources (eg Expedia5 Tripadvisor6Skyscanner7 etc) that allows a user to query the in-tegrated dataset for a flight from any start and enddestination for any time period We will use this sce-nario throughout this section as an example of howdata quality influences fitness to use

321 Contextual dimensionsContextual dimensions are those that highly depend

on the context of the task at hand as well as on thesubjective preferences of the data consumer There arethree dimensions completeness amount-of-data andrelevancy that are part of this group which are alongwith a comprehensive list of their corresponding met-rics listed in Table 2 The reference for each metric isprovided in the table

3211 Completeness Completeness is defined asldquothe degree to which information is not missing in [5]This dimension is further classified in [15] into the fol-lowing categories (a) Schema completeness which isthe degree to which entities and attributes are not miss-ing in a schema (b) Column completeness which is afunction of the missing values for a specific property orcolumn and (c) Population completeness which refersto the ratio of entities represented in an informationsystem to the complete population In [42] complete-ness is distinguished on the schema level and the datalevel On the schema level a dataset is complete if itcontains all of the attributes needed for a given taskOn the data (ie instance) level a dataset is completeif it contains all of the necessary objects for a giventask As can be observed the authors in [5] give a gen-eral definition whereas the authors in [15] provide aset of sub-categories for completeness On the other

2httpthedatahuborgdatasetourairports3httpthedatahuborgdatasetopen-flights4linkedgeodataorg5httpwwwexpediacom6httpwwwtripadvisorcom7httpwwwskyscannerde

8 Linked Data Quality

Dimension Metric Description Type

Completeness

degree to which classes and properties arenot missing

detection of the degree to which the classes and propertiesof an ontology are represented [51542]

S

degree to which values for a property arenot missing

detection of no of missing values for a specific property [515]

O

degree to which real-world objects are notmissing

detection of the degree to which all the real-world objectsare represented [5152642]

O

degree to which interlinks are not missing detection of the degree to which instances in the dataset areinterlinked [22]

O

Amount-of-data appropriate volume of data for a particulartask

no of triples instances per class internal and external linksin a dataset [1412]

O

coverage scope and level of detail [14] S

Relevancyusage of meta-information attributes counting the occurrence of relevant terms within these at-

tributes or using vector space model and assigning higherweight to terms that appear within the meta-informationattributes [5]

S

retrieval of relevant documents sorting documents according to their relevancy for a givenquery [5]

S

Table 2

Comprehensive list of data quality metrics of the contextual di-mensions how it can be measured and itrsquos type - Subjective orObjective

hand in [42] the two types defined ie the schema anddata level completeness can be mapped to the two cat-egories (a) Schema completeness and (c) Populationcompleteness provided in [15]

Definition 1 (Completeness) Completeness refers tothe degree to which all required information is presentin a particular dataset In general completeness is theextent to which data is of sufficient depth breadth andscope for the task at hand In terms of Linked Datawe classify completeness as follows (a) Schema com-pleteness the degree to which the classes and proper-ties of an ontology are represented thus can be calledontology completeness (b) Property completenessmeasure of the missing values for a specific property(c) Population completeness is the percentage of allreal-world objects of a particular type that are repre-sented in the datasets and (d) Interlinking complete-ness has to be considered especially in Linked Dataand refers to the degree to which instances in thedataset are interlinked

Metrics Completeness can be measured by detect-ing the number of classes properties values and in-terlinks that are present in the dataset compared to theoriginal dataset (or gold standard dataset) It should benoted that in this case users should assume a closed-world-assumption where a gold standard dataset isavailable and can be used to compare against

Example In our use case the flight search engineshould contain complete information so that it includesall offers for flights (population completeness) For ex-

ample a user residing in Germany wants to visit herfriend in America Since the user is a student low priceis of high importance But looking for flights individ-ually on the airlines websites shows her flights withvery expensive fares However using our flight searchengine she finds all offers even the less expensive onesand is also able to compare fares from different air-lines and choose the most affordable one The morecomplete the information for flights is including cheapflights the more visitors the site attracts Moreoversufficient interlinks between the datasets will allow herto query the integrated dataset so as to find an optimalroute from the start to the end destination (interlinkingcompleteness) in cases when there is no direct flight

Particularly in Linked Data completeness is ofprime importance when integrating datasets from sev-eral sources where one of the goals is to increase com-pleteness That is the more sources are queried themore complete the results will be The completenessof interlinks between datasets is also important so thatone can retrieve all the relevant facts about a resourcewhen querying the integrated data sources Howevermeasuring completeness of a dataset usually requiresthe presence of a gold standard or the original datasource to compare with

3212 Amount-of-data The amount-of-data dimen-sion can be defined as ldquothe extent to which the volumeof data is appropriate for the task at hand [5] ldquoTheamount of data provided by a data source influencesits usability [14] and should be enough to approximate

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 7: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 7

ria [14] An assessment score is computed from theseindicators using a scoring function

In [6] the data quality dimensions are classified intothree categories according to the type of informationthat is used as quality indicator (1) Content Based -information content itself (2) Context Based - infor-mation about the context in which information wasclaimed (3) Rating Based - based on the ratings aboutthe data itself or the information provider

However we identify further dimensions (defined inSection 32) and also further categories to classify thedimensions namely (1) Contextual (2) Trust (3) In-trinsic (4) Accessibility (5) Representational and (6)Dataset Dynamicity dimensions as depicted in Fig-ure 2

Data Quality Assessment Method A data qualityassessment methodology is defined as the process ofevaluating if a piece of data meets in the informationconsumers need in a specific use case [6] The processinvolves measuring the quality dimensions that are rel-evant to the user and comparing the assessment resultswith the users quality requirements

32 Linked Data quality dimensions

After analyzing the 21 selected approaches in de-tail we identified a core set of 26 different data qual-ity dimensions that can be applied to assess the qual-ity of Linked Data In this section we formalize andadapt the definition for each dimension to the LinkedData context The metrics associated with each dimen-sion are also identified and reported Additionally wegroup the dimensions following a classification intro-duced in [57] which is further modified and extendedas

ndash Contextual dimensionsndash Trust dimensionsndash Intrinsic dimensionsndash Accessibility dimensionsndash Representational dimensionsndash Dataset dynamicity

These groups are not strictly disjoint but can par-tially overlap Also the dimensions are not indepen-dent from each other but correlations exists among di-mensions in the same group or between groups Fig-ure 2 shows the classification of the dimensions in theabove mentioned groups as well as the inter and intrarelations between them

Use case scenario Since data quality is describedas fitness to use we introduce a specific use case that

will allow us to illustrate the importance of each di-mension with the help of an example Our use case isabout an intelligent flight search engine which relieson acquiring (aggregating) data from several datasetsIt obtains information about airports and airlines froman airline dataset (eg OurAirports2 OpenFlights3)Information about the location of countries cities andparticular addresses is obtained from a spatial dataset(eg LinkedGeoData4) Additionally aggregators inRDF pull all the information related to airlines fromdifferent data sources (eg Expedia5 Tripadvisor6Skyscanner7 etc) that allows a user to query the in-tegrated dataset for a flight from any start and enddestination for any time period We will use this sce-nario throughout this section as an example of howdata quality influences fitness to use

321 Contextual dimensionsContextual dimensions are those that highly depend

on the context of the task at hand as well as on thesubjective preferences of the data consumer There arethree dimensions completeness amount-of-data andrelevancy that are part of this group which are alongwith a comprehensive list of their corresponding met-rics listed in Table 2 The reference for each metric isprovided in the table

3211 Completeness Completeness is defined asldquothe degree to which information is not missing in [5]This dimension is further classified in [15] into the fol-lowing categories (a) Schema completeness which isthe degree to which entities and attributes are not miss-ing in a schema (b) Column completeness which is afunction of the missing values for a specific property orcolumn and (c) Population completeness which refersto the ratio of entities represented in an informationsystem to the complete population In [42] complete-ness is distinguished on the schema level and the datalevel On the schema level a dataset is complete if itcontains all of the attributes needed for a given taskOn the data (ie instance) level a dataset is completeif it contains all of the necessary objects for a giventask As can be observed the authors in [5] give a gen-eral definition whereas the authors in [15] provide aset of sub-categories for completeness On the other

2httpthedatahuborgdatasetourairports3httpthedatahuborgdatasetopen-flights4linkedgeodataorg5httpwwwexpediacom6httpwwwtripadvisorcom7httpwwwskyscannerde

8 Linked Data Quality

Dimension Metric Description Type

Completeness

degree to which classes and properties arenot missing

detection of the degree to which the classes and propertiesof an ontology are represented [51542]

S

degree to which values for a property arenot missing

detection of no of missing values for a specific property [515]

O

degree to which real-world objects are notmissing

detection of the degree to which all the real-world objectsare represented [5152642]

O

degree to which interlinks are not missing detection of the degree to which instances in the dataset areinterlinked [22]

O

Amount-of-data appropriate volume of data for a particulartask

no of triples instances per class internal and external linksin a dataset [1412]

O

coverage scope and level of detail [14] S

Relevancyusage of meta-information attributes counting the occurrence of relevant terms within these at-

tributes or using vector space model and assigning higherweight to terms that appear within the meta-informationattributes [5]

S

retrieval of relevant documents sorting documents according to their relevancy for a givenquery [5]

S

Table 2

Comprehensive list of data quality metrics of the contextual di-mensions how it can be measured and itrsquos type - Subjective orObjective

hand in [42] the two types defined ie the schema anddata level completeness can be mapped to the two cat-egories (a) Schema completeness and (c) Populationcompleteness provided in [15]

Definition 1 (Completeness) Completeness refers tothe degree to which all required information is presentin a particular dataset In general completeness is theextent to which data is of sufficient depth breadth andscope for the task at hand In terms of Linked Datawe classify completeness as follows (a) Schema com-pleteness the degree to which the classes and proper-ties of an ontology are represented thus can be calledontology completeness (b) Property completenessmeasure of the missing values for a specific property(c) Population completeness is the percentage of allreal-world objects of a particular type that are repre-sented in the datasets and (d) Interlinking complete-ness has to be considered especially in Linked Dataand refers to the degree to which instances in thedataset are interlinked

Metrics Completeness can be measured by detect-ing the number of classes properties values and in-terlinks that are present in the dataset compared to theoriginal dataset (or gold standard dataset) It should benoted that in this case users should assume a closed-world-assumption where a gold standard dataset isavailable and can be used to compare against

Example In our use case the flight search engineshould contain complete information so that it includesall offers for flights (population completeness) For ex-

ample a user residing in Germany wants to visit herfriend in America Since the user is a student low priceis of high importance But looking for flights individ-ually on the airlines websites shows her flights withvery expensive fares However using our flight searchengine she finds all offers even the less expensive onesand is also able to compare fares from different air-lines and choose the most affordable one The morecomplete the information for flights is including cheapflights the more visitors the site attracts Moreoversufficient interlinks between the datasets will allow herto query the integrated dataset so as to find an optimalroute from the start to the end destination (interlinkingcompleteness) in cases when there is no direct flight

Particularly in Linked Data completeness is ofprime importance when integrating datasets from sev-eral sources where one of the goals is to increase com-pleteness That is the more sources are queried themore complete the results will be The completenessof interlinks between datasets is also important so thatone can retrieve all the relevant facts about a resourcewhen querying the integrated data sources Howevermeasuring completeness of a dataset usually requiresthe presence of a gold standard or the original datasource to compare with

3212 Amount-of-data The amount-of-data dimen-sion can be defined as ldquothe extent to which the volumeof data is appropriate for the task at hand [5] ldquoTheamount of data provided by a data source influencesits usability [14] and should be enough to approximate

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 8: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

8 Linked Data Quality

Dimension Metric Description Type

Completeness

degree to which classes and properties arenot missing

detection of the degree to which the classes and propertiesof an ontology are represented [51542]

S

degree to which values for a property arenot missing

detection of no of missing values for a specific property [515]

O

degree to which real-world objects are notmissing

detection of the degree to which all the real-world objectsare represented [5152642]

O

degree to which interlinks are not missing detection of the degree to which instances in the dataset areinterlinked [22]

O

Amount-of-data appropriate volume of data for a particulartask

no of triples instances per class internal and external linksin a dataset [1412]

O

coverage scope and level of detail [14] S

Relevancyusage of meta-information attributes counting the occurrence of relevant terms within these at-

tributes or using vector space model and assigning higherweight to terms that appear within the meta-informationattributes [5]

S

retrieval of relevant documents sorting documents according to their relevancy for a givenquery [5]

S

Table 2

Comprehensive list of data quality metrics of the contextual di-mensions how it can be measured and itrsquos type - Subjective orObjective

hand in [42] the two types defined ie the schema anddata level completeness can be mapped to the two cat-egories (a) Schema completeness and (c) Populationcompleteness provided in [15]

Definition 1 (Completeness) Completeness refers tothe degree to which all required information is presentin a particular dataset In general completeness is theextent to which data is of sufficient depth breadth andscope for the task at hand In terms of Linked Datawe classify completeness as follows (a) Schema com-pleteness the degree to which the classes and proper-ties of an ontology are represented thus can be calledontology completeness (b) Property completenessmeasure of the missing values for a specific property(c) Population completeness is the percentage of allreal-world objects of a particular type that are repre-sented in the datasets and (d) Interlinking complete-ness has to be considered especially in Linked Dataand refers to the degree to which instances in thedataset are interlinked

Metrics Completeness can be measured by detect-ing the number of classes properties values and in-terlinks that are present in the dataset compared to theoriginal dataset (or gold standard dataset) It should benoted that in this case users should assume a closed-world-assumption where a gold standard dataset isavailable and can be used to compare against

Example In our use case the flight search engineshould contain complete information so that it includesall offers for flights (population completeness) For ex-

ample a user residing in Germany wants to visit herfriend in America Since the user is a student low priceis of high importance But looking for flights individ-ually on the airlines websites shows her flights withvery expensive fares However using our flight searchengine she finds all offers even the less expensive onesand is also able to compare fares from different air-lines and choose the most affordable one The morecomplete the information for flights is including cheapflights the more visitors the site attracts Moreoversufficient interlinks between the datasets will allow herto query the integrated dataset so as to find an optimalroute from the start to the end destination (interlinkingcompleteness) in cases when there is no direct flight

Particularly in Linked Data completeness is ofprime importance when integrating datasets from sev-eral sources where one of the goals is to increase com-pleteness That is the more sources are queried themore complete the results will be The completenessof interlinks between datasets is also important so thatone can retrieve all the relevant facts about a resourcewhen querying the integrated data sources Howevermeasuring completeness of a dataset usually requiresthe presence of a gold standard or the original datasource to compare with

3212 Amount-of-data The amount-of-data dimen-sion can be defined as ldquothe extent to which the volumeof data is appropriate for the task at hand [5] ldquoTheamount of data provided by a data source influencesits usability [14] and should be enough to approximate

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 9: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 9

the true scenario precisely [12] While the authorsin [5] provide a formal definition the author in [14] ex-plains the dimension by mentioning its advantages andmetrics In case of [12] we analyzed the mentionedproblem and the respective metrics and mapped themto this dimension since it best fits the definition

Definition 2 (Amount-of-data) Amount-of-data refersto the quantity and volume of data that is appropriatefor a particular task

Metrics The amount-of-data can be measuredin terms of bytes (most coarse-grained) triples in-stances andor links present in the dataset Thisamount should represent an appropriate volume of datafor a particular task that is appropriate scope and levelof detail

Example In our use case the flight search engineacquires enough amount of data so as to cover alleven small airports In addition the data also cov-ers alternative means of transportation This helps toprovide the user with better travel plans which in-cludes smaller cities (airports) For example when auser wants to travel from Connecticut to Santa Bar-bara she might not find direct or indirect flights bysearching individual flights websites But using ourexample search engine she is suggested convenientflight connections between the two destinations be-cause it contains a large amount of data so as to coverall the airports She is also offered convenient com-binations of planes trains and buses The provisionof such information also necessitates the presence ofa large amount of internal as well as externals linksbetween the datasets so as to provide a fine grainedsearch for flights between specific places

In essence this dimension conveys the importanceof not having too much unnecessary informationwhich might overwhelm the data consumer or reducequery performance An appropriate volume of data interms of quantity and coverage should be a main aimof a dataset provider In the Linked Data communitythere is often a focus on large amounts of data beingavailable However a small amount of data appropri-ate for a particular task does not violate this defini-tion The aim should be to have sufficient breadth anddepth as well as sufficient scope (number of entities)and detail (number of properties applied) in a givendataset

3213 Relevancy In [5] relevancy is explained asldquothe extent to which information is applicable andhelpful for the task at hand

Definition 3 (Relevancy) Relevancy refers to the pro-vision of information which is in accordance with thetask at hand and important to the usersrsquo query

Metrics Relevancy is highly context dependent andcan be measured by using meta-information attributesfor assessing whether the content is relevant for a par-ticular task Additionally retrieval of relevant docu-ments can be performed using a combination of hyper-link analysis and information retrieval methods

Example When a user is looking for flights be-tween any two cities only relevant information iestart and end times duration and cost per personshould be provided If a lot of irrelevant data is in-cluded in the spatial data eg post offices trees etc(present in LinkedGeoData) query performance candecrease The user may thus get lost in the silos of in-formation and may not be able to browse through itefficiently to get only what she requires

When a user is provided with relevant informationas a response to her query higher recall with respect toquery answering can be obtained Data polluted withirrelevant information affects the usability as well astypical query performance of the dataset Using exter-nal links or owlsameAs links can help to pull in ad-ditional relevant information about a resource and isthus encouraged in LOD

Relations between dimensions For instance if adataset is incomplete for a particular purpose theamount of data is often insufficient However if theamount of data is too large it could be that irrelevantdata is provided which affects the relevance dimen-sion

322 Trust dimensionsTrust dimensions are those that focus on the trust-

worthiness of the dataset There are five dimensionsthat are part of this group namely provenance veri-fiability believability reputation and licensing whichare displayed along with their respective metrics in Ta-ble 3 The reference for each metric is provided in thetable

3221 Provenance There are many definitions inthe literature that emphasize different views of prove-nance

Definition 4 (Provenance) Provenance refers to thecontextual metadata that focuses on how to representmanage and use information about the origin of thesource Provenance helps to describe entities to enabletrust assess authenticity and allow reproducibility

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 10: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

10 Linked Data Quality

Dimension Metric Description Type

Provenance

indication of metadata about a dataset presence of the title content and URI of the dataset [14] Ocomputing personalised trust recommen-dations

using provenance of existing trust annotations in social net-works [20]

S

computing the trustworthiness of RDFstatements

computing a trust value based on the provenance informa-tion which can be either unknown or a value in the interval[-11] where 1 absolute belief -1 absolute disbelief and0lack of beliefdisbelief [23]

O

computing the trustworthiness of RDFstatements

computing a trust value based on user-based ratings oropinion-based method [23]

S

detect the reliability and credibility of aperson (publisher)

indication of the level of trust for people a person knowson a scale of 1 - 9 [21]

S

computing the trust of an entity construction of decision networks informed by provenancegraphs [16]

O

accuracy of computing the trust betweentwo entities

by using a combination of (1) propagation algorithm whichutilises statistical techniques for computing trust values be-tween 2 entities through a path and (2) an aggregation al-gorithm based on a weighting mechanism for calculatingthe aggregate value of trust over all paths [53]

O

acquiring content trust from users based on associations that transfer trust from entities to re-sources [17]

O

detection of trustworthiness reliabilityand credibility of a data source

use of trust annotations made by several individuals to de-rive an assessment of the sourcesrsquo trustworthiness reliabil-ity and credibility [18]

S

assigning trust values to data-sourcesrules

use of trust ontologies that assign content-based ormetadata-based trust values that can be transferred fromknown to unknown data [29]

O

determining trust value for data using annotations for data such as (i) blacklisting (ii) au-thoritativeness and (iii) ranking and using reasoning to in-corporate trust values to the data [8]

O

Verifiability

verifying publisher information stating the author and his contributors the publisher of thedata and its sources [14]

S

verifying authenticity of the dataset whether the dataset uses a provenance vocabulary eg theuse of the Provenance Vocabulary [14]

O

verifying correctness of the dataset with the help of unbiased trusted third party [5] Sverifying usage of digital signatures signing a document containing an RDF serialisation or

signing an RDF graph [14]O

Reputation reputation of the publisher survey in a community questioned about other mem-bers [17]

S

reputation of the dataset analyzing references or page rank or by assigning a repu-tation score to the dataset [42]

S

Believability meta-information about the identity of in-formation provider

checking whether the providercontributor is contained ina list of trusted providers [5]

O

Licensing

machine-readable indication of a license detection of the indication of a license in the voiD descrip-tion or in the dataset itself [1427]

O

human-readable indication of a license detection of a license in the documentation of the datasetor its source [1427]

O

permissions to use the dataset detection of license indicating whether reproduction dis-tribution modification or redistribution is permitted [14]

O

indication of attribution detection of whether the work is attributed in the same wayas specified by the author or licensor [14]

O

indication of Copyleft or ShareAlike checking whether the derivated contents are published un-der the same license as the original [14]

O

Table 3Comprehensive list of data quality metrics of the trust dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 11: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 11

Metrics Provenance can be measured by analyzingthe metadata associated with the source This prove-nance information can in turn be used to assess thetrustworthiness reliability and credibility of a datasource an entity a publishers or individual RDF state-ments There exists an inter-dependancy between thedata provider and the data itself On the one handdata is likely to be accepted as true if it is providedby a trustworthy provider On the other hand the dataprovider is trustworthy if it provides true data Thusboth can be checked to measure the trustworthiness

Example Our flight search engine constitutes infor-mation from several airline providers In order to verifythe reliability of these different airline data providersprovenance information from the aggregators can beanalyzed and re-used so as enable users of the flightsearch engine to trust the authenticity of the data pro-vided to them

In general if the source information or publisher ishighly trusted the resulting data will also be highlytrusted Provenance data not only helps determinethe trustworthiness but additionally the reliability andcredibility of a source [18] Provenance assertions area form of contextual metadata and can themselves be-come important records with their own provenance

3222 Verifiability Verifiability is described as theldquodegree and ease with which the information can bechecked for correctnessrdquo [5] Similarly in [14] the ver-ifiability criterion is used as the means a consumer isprovided with which can be used to examine the datafor correctness Without such means the assurance ofthe correctness of the data would come from the con-sumerrsquos trust in that source It can be observed herethat on the one hand the authors in [5] provide a for-mal definition whereas the author in [14] describes thedimension by providing its advantages and metrics

Definition 5 (Verifiability) Verifiability refers to thedegree by which a data consumer can assess the cor-rectness of a dataset and as a consequence its trust-worthiness

Metrics Verifiability can be measured either by anunbiased third party if the dataset itself points to thesource or by the presence of a digital signature

Example In our use case if we assume that theflight search engine crawls information from arbitraryairline websites which publish flight information ac-cording to a standard vocabulary there is a risk for re-ceiving incorrect information from malicious websitesFor instance such a website publishes cheap flights

just to attract a large number of visitors In that casethe use of digital signatures for published RDF dataallows to restrict crawling only to verified datasets

Verifiability is an important dimension when adataset includes sources with low believability or rep-utation This dimension allows data consumers to de-cide whether to accept provided information Onemeans of verification in Linked Data is to provide basicprovenance information along with the dataset suchas using existing vocabularies like SIOC Dublin CoreProvenance Vocabulary the OPMV8 or the recently in-troduced PROV vocabulary9 Yet another mechanismis by the usage of digital signatures [11] whereby asource can sign either a document containing an RDFserialisation or an RDF graph Using a digital signa-ture the data source can vouch for all possible seriali-sations that can result from the graph thus ensuring theuser that the data she receives is in fact the data thatthe source has vouched for

3223 Reputation The authors in [17] associatereputation of an entity either as a result from direct ex-perience or recommendations from others They pro-pose the tracking of reputation either through a cen-tralized authority or via decentralized voting

Definition 6 (Reputation) Reputation is a judgementmade by a user to determine the integrity of a sourceIt is mainly associated with a data publisher a personorganisation group of people or community of prac-tice rather than being a characteristic of a dataset Thedata publisher should be identifiable for a certain (partof a) dataset

Metrics Reputation is usually a score for exam-ple a real value between 0 (low) and 1 (high) Thereare different possibilities to determine reputation andcan be classified into manual or (semi-)automated ap-proaches The manual approach is via a survey in acommunity or by questioning other members who canhelp to determine the reputation of a source or by theperson who published a dataset The (semi-)automatedapproach can be performed by the use of external linksor page ranks

Example The provision of information on the rep-utation of data sources allows conflict resolution Forinstance several data sources report conflicting prices(or times) for a particular flight number In that case

8httpopen-biomedsourceforgenetopmvnshtml

9httpwwww3orgTRprov-o

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 12: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

12 Linked Data Quality

the search engine can decide to trust only the sourcewith higher reputation

Reputation is a social notion of trust [19] Trust isoften represented in a web of trust where nodes areentities and edges are the trust value based on a met-ric that reflects the reputation one entity assigns to an-other [17] Based on the information presented to auser she forms an opinion or makes a judgement aboutthe reputation of the dataset or the publisher and thereliability of the statements

3224 Believability In [5] believability is explainedas ldquothe extent to which information is regarded astrue and crediblerdquo Believability can also be termed asldquotrustworthiness as it is the subjective measure of ausers belief that the data is true [29]

Definition 7 (Believability) Believability is defined asthe degree to which the information is accepted to becorrect true real and credible

Metrics Believability is measured by checkingwhether the contributor is contained in a list of trustedproviders In Linked Data believability can be subjec-tively measured by analyzing the provenance informa-tion of the dataset

Example In our flight search engine use case ifthe flight information is provided by trusted and well-known flights companies such as Lufthansa BritishAirways etc then the user believes the informationprovided by their websites She does not need to ver-ify their credibility since these are well-known interna-tional flight companies On the other hand if the userretrieves information about an airline previously un-known she can decide whether to believe this infor-mation by checking whether the airline is well-knownor if it is contained in a list of trusted providers More-over she will need to check the source website fromwhich this information was obtained

This dimension involves the decision of which in-formation to believe Users can make these decisionsbased on factors such as the source their prior knowl-edge about the subject the reputation of the source andtheir prior experience [29] Either all of the informa-tion that is provided can be trusted if it is well-knownor only by looking at the data source can the decisionof its credibility and believability be determined An-other method proposed by Tim Berners-Lee was thatWeb browsers should be enhanced with an Oh yeahbutton to support the user in assessing the reliabilityof data encountered on the web10 Pressing of such a

10httpwwww3orgDesignIssuesUIhtml

button for any piece of data or an entire dataset wouldcontribute towards assessing the believability of thedataset

3225 Licensing ldquoIn order to enable informationconsumers to use the data under clear legal terms eachRDF document should contain a license under whichthe content can be (re-)usedrdquo [2714] Additionally theexistence of a machine-readable indication (by includ-ing the specifications in a VoID description) as well asa human-readable indication of a license is also impor-tant Although in both [27] and [14] the authors do notprovide a formal definition they agree on the use andimportance of licensing in terms of data quality

Definition 8 (Licensing) Licensing is defined as agranting of the permission for a consumer to re-use adataset under defined conditions

Metrics Licensing can be checked by the indica-tion of machine and human readable information as-sociated with the dataset clearly indicating the permis-sions of data re-use

Example Since our flight search engine aggregatesdata from several data sources a clear indication of thelicense allows the search engine to re-use the data fromthe airlines websites For example the LinkedGeoDatadataset is licensed under the Open Database License11which allows others to copy distribute and use the dataand produce work from the data allowing modifica-tions and transformations Due to the presence of thisspecific license the flight search engine is able to re-use this dataset to pull geo-spatial information and feedit to the search engine

Linked Data aims to provide users the capability toaggregate data from several sources therefore the in-dication of an explicit license or waiver statement isnecessary for each data source A dataset can choosea license depending on what permits it wants to issuePossible permissions include the reproduction of datathe distribution of data and the modification and redis-tribution of data [43] Providing licensing informationincreases the usability of the dataset as the consumersor third parties are thus made aware of the legal rightsand permissiveness under which the pertinent data aremade available The more permissions a source grantsthe more possibilities a consumer has while (re-)usingthe data Additional triples should be added to a datasetclearly indicating the type of license or waiver or li-cense details should be mentioned in the VoID12 file

11httpopendatacommonsorglicensesodbl12httpvocabderiievoid

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 13: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 13

Relations between dimensions Verifiability is re-lated to the believability dimension but differs from itbecause even though verification can find whether in-formation is correct or incorrect belief is the degreeto which a user thinks an information is correct Theprovenance information associated with a dataset as-sists a user in verifying the believability of a datasetTherefore believability is affiliated to the provenanceof a dataset Moreover if a dataset has a high repu-tation it implies high believability and vice-a-versaLicensing is also part of the provenance of a datasetand contributes towards its believability Believabilityon the other hand can be seen as expected accuracyMoreover the verifiability believability and reputationdimensions are also included in the contextual dimen-sions group (as shown in Figure 2) because they highlydepend on the context of the task at hand

323 Intrinsic dimensionsIntrinsic dimensions are those that are indepen-

dent of the userrsquos context These dimensions focuson whether information correctly represents the realworld and whether information is logically consistentin itself Table 4 lists the five dimensions with their re-spective metrics which are part of this group namelyaccuracy objectivity validity-of-documents interlink-ing consistency and conciseness The reference foreach metric is provided in the table

3231 Accuracy In [5] accuracy is defined as theldquodegree of correctness and precision with which infor-mation in an information system represents states ofthe real world Also we mapped the problems of in-accurate annotation such as inaccurate labelling andinaccurate classification mentioned in [38] to the ac-curacy dimension In [15] there are two types of ac-curacy which were identified syntactic and semanticWe associate the accuracy dimension mainly to seman-tic accuracy and the syntactic accuracy to the validity-of-documents dimension

Definition 9 (Accuracy) Accuracy can be defined asthe extent to which data is correct that is the degree towhich it correctly represents the real world facts andis also free of error In particular we associate accu-racy mainly to semantic accuracy which relates to thecorrectness of a value to the actual real world valuethat is accuracy of the meaning

13Not being what it purports to be false or fake14predicates are often misused when no applicable predicate ex-

ists

Metrics Accuracy can be measured by checking thecorrectness of the data in a data source That is the de-tection of outliers or identification of semantically in-correct values through the violation of functional de-pendency rules Accuracy is one of the dimensionswhich is affected by assuming a closed or open worldWhen assuming an open world it is more challengingto assess accuracy since more logical constraints needto be specified for inferring logical contradictions

Example In our use case suppose a user is lookingfor flights between Paris and New York Instead of re-turning flights starting from Paris France the searchreturns flights between Paris in Texas and New YorkThis kind of semantic inaccuracy in terms of labellingas well as classification can lead to erroneous results

A possible method for checking accuracy is analignment with high quality datasets in the domain(reference dataset) if available Yet another method isby manually checking accuracy against several sourceswhere a single fact is checked individually in differentdatasets to determine its accuracy [36]

3232 Objectivity In [5] objectivity is expressedas ldquothe extent to which information is unbiased un-prejudiced and impartial

Definition 10 (Objectivity) Objectivity is defined asthe degree to which the interpretation and usage ofdata is unbiased unprejudiced and impartial This di-mension highly depends on the type of information andtherefore is classified as a subjective dimension

Metrics Objectivity can not be measured qualita-tively but indirectly by checking the authenticity ofthe source responsible for the information whether thedataset is neutral or the publisher has a personal influ-ence on the data provided Additionally it can be mea-sured by checking whether independent sources canconfirm a single fact

Example In our use case consider the reviewsavailable for each airline regarding the safety comfortand prices for each It may happen that an airline be-longing to a particular alliance is ranked higher thanothers when in reality it is not so This could be an indi-cation of a bias where the review is falsified due to theproviders preference or intentions This kind of bias orpartiality affects the user as she might be provided withincorrect information from expensive flights or frommalicious websites

One of the possible ways to detect biased informa-tion is to compare the information with other datasetsproviding the same information However objectivity

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 14: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

14 Linked Data Quality

Dimension Metric Description Type

Accuracydetection of poor attributes ie those that donot contain useful values for data entries

using association rules (using the Apriori Algorithm [1]) orinverse relations or foreign key relationships [38]

O

detection of outliers statistical methods such as distance-based deviations-basedand distribution-based method [5]

O

detection of semantically incorrect values checking the data source by integrating queries for the iden-tification of functional dependencies violations [157]

O

Objectivity

objectivity of the information checking for bias or opinion expressed when a data providerinterprets or analyzes facts [5]

S

objectivity of the source checking whether independent sources confirm a fact Sno biased data provided by the publisher checking whether the dataset is neutral or the publisher has a

personal influence on the data providedS

Validity-of-documents

no syntax errors detecting syntax errors using validators [14] Oinvalid usage of undefined classes and prop-erties

detection of classes and properties which are used withoutany formal definition [14]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [14]

O

invalid usage of vocabularies detection of the improper usage of vocabularies [14] Omalformed datatype literals detection of ill-typed literals which do not abide by the lexical

syntax for their respective datatype [14]O

erroneous13 annotation representation 1 - (erroneous instances total no of instances) [38] Oinaccurate annotation labelling classifica-tion

(1 - inaccurate instances total no of instances) (balanceddistance metric [40] total no of instances) [38] (Balanceddistance metric is an algorithm that calculates the distancebetween the extracted (or learned) concept and the target con-cept)

O

interlinking interlinking degree clustering coefficientcentrality and sameAs chains descriptionrichness through sameAs

by using network measures [22] O

existence of links to external data providers detection of the existence and usage of external URIs andowlsameAs links [27]

S

Consistency

entities as members of disjoint classes no of entities described as members of disjoint classes totalno of entities described in the dataset [14]

O

valid usage of inverse-functional properties detection of inverse-functional properties that do not describeentities stating empty values [14]

S

no redefinition of existing properties detection of existing vocabulary being redefined [14] Susage of homogeneous datatypes no of properties used with homogeneous units in the dataset

total no of properties used in the dataset [14]O

no stating of inconsistent property valuesfor entities

no of entities described inconsistently in the dataset totalno of entities described in the dataset [14]

O

ambiguous annotation detection of an instance mapped back to more than one realworld object leading to more than one interpretation [38]

S

invalid usage of undefined classes and prop-erties

detection of classes and properties used without any formaldefinition [27]

O

misplaced classes or properties detection of a URI defined as a class is used as a property ora URI defined as a property is used as a class [27]

O

misuse of owldatatypeProperty orowlobjectProperty

detection of attribute properties used between two resourcesand relation properties used with literal values [27]

O

use of members of deprecated classes orproperties

detection of use of OWL classes owlDeprecatedClass andowl-DeprecatedProperty [27]

O

bogus owlInverse-FunctionalProperty val-ues

detecting uniqueness amp validity of inverse-functional val-ues [27]

O

literals incompatible with datatype range detection of a datatype clash that can then occur if the prop-erty is given a value (i) that is malformed or (ii) that is amember of an incompatible datatype [26]

O

ontology hijacking detection of the redefinition by third parties of external class-es properties such that reasoning over data using those exter-nal terms is affected [26]

O

misuse of predicates14 profiling statistics support the detection of such discordantvalues or misused predicates and facilitate to find valid for-mats for specific predicates [7]

O

negative dependenciescorrelation amongpredicates

using association rules [7] O

Concisenessdoes not contain redundant attributesprop-erties (intensional conciseness)

number of unique attributes of a dataset in relation to theoverall number of attributes in a target schema [42]

O

does not contain redundant objectsin-stances (extensional conciseness)

number of unique objects in relation to the overall number ofobject representations in the dataset [42]

O

no unique values for functional properties assessed by integration of uniqueness rules [15] Ono unique annotations check if several annotations refer to the same object [38] O

Table 4Comprehensive list of data quality metrics of the intrinsic dimensions how it can be measured and itrsquos type - Subjective or Objective

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 15: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 15

can be measured only with quantitative (factual) dataThis can be done by checking a single fact individu-ally in different datasets for confirmation [36] Mea-suring the bias in qualitative data is far more challeng-ing However the bias in information can lead to er-rors in judgment and decision making and should beavoided

3233 Validity-of-documents ldquoValidity of docu-ments consists of two aspects influencing the usabilityof the documents the valid usage of the underlying vo-cabularies and the valid syntax of the documents [14]

Definition 11 (Validity-of-documents) Validity-of-documents refers to the valid usage of the underlyingvocabularies and the valid syntax of the documents(syntactic accuracy)

Metrics A syntax validator can be employed to as-sess the validity of a document ie its syntactic cor-rectness The syntactic accuracy of entities can be mea-sured by detecting the erroneous or inaccurate annota-tions classifications or representations An RDF val-idator can be used to parse the RDF document andensure that it is syntactically valid that is to checkwhether the document is in accordance with the RDFspecification

Example In our use case let us assume that the useris looking for flights between two specific locationsfor instance Paris (France) and New York (UnitedStates) However the user is returned with no resultsA possible reason for this is that one of the data sourcesincorrectly uses the property geolon to specify thelongitude of Paris instead of using geolong Thiscauses a query to retrieve no data when querying forflights starting near to a particular location

Invalid usage of vocabularies means referring to notexisting or deprecated resources Moreover invalid us-age of certain vocabularies can result in consumers notable to process data as intended Syntax errors typosuse of deprecated classes and properties all add to theproblem of the invalidity of the document as the datacan neither be processed nor a consumer can performreasoning on such data

3234 interlinking When an RDF triple containsURIs from different namespaces in subject and ob-ject position this triple basically establishes a link be-tween the entity identified by the subject (described inthe source dataset using namespace A) with the en-tity identified by the object (described in the targetdataset using namespace B) Through the typed RDFlinks data items are effectively interlinked The impor-

tance of mapping coherence can be classified in oneof the four scenarios (a) Frameworks (b) Termino-logical Reasoning (c) Data Transformation (d) QueryProcessing as identified in [41]

Definition 12 (interlinking) interlinking refers to thedegree to which entities that represent the same con-cept are linked to each other

Metrics interlinking can be measured by using net-work measures that calculate the interlinking degreecluster coefficient sameAs chains centrality and de-scription richness through sameAs links

Example In our flight search engine the instanceof the country United States in the airlinedataset should be interlinked with the instance Ame-rica in the spatial dataset This interlinking canhelp when a user queries for a flight as the search en-gine can display the correct route from the start des-tination to the end destination by correctly combin-ing information for the same country from both thedatasets Since names of various entities can have dif-ferent URIs in different datasets their interlinking canhelp in disambiguation

In the Web of Data it is common to use differ-ent URIs to identify the same real-world object oc-curring in two different datasets Therefore it is theaim of Linked Data to link or relate these two objectsin order to be unambiguous The interlinking refers tonot only the interlinking between different datasets butalso internal links within the dataset itself Moreovernot only the creation of precise links but also the main-tenance of these interlinks is important An aspect tobe considered while interlinking data is to use differ-ent URIs to identify the real-world object and the doc-ument that describes it The ability to distinguish thetwo through the use of different URIs is critical to theinterlinking coherence of the Web of Data [25] An ef-fort towards assessing the quality of a mapping (ie in-coherent mappings) even if no reference mapping isavailable is provided in [41]

3235 Consistency Consistency implies that ldquotwoor more values do not conflict with each other [5]Similarly in [14] and [26] consistency is defined asldquono contradictions in the data A more generic defi-nition is that ldquoa dataset is consistent if it is free of con-flicting information [42]

Definition 13 (Consistency) Consistency means thata knowledge base is free of (logicalformal) contradic-tions with respect to particular knowledge representa-tion and inference mechanisms

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 16: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

16 Linked Data Quality

RDF-S

RDFa

SPARQL

RDFXML

N3 NT

OWL

OWL2

SWRL

Domain specific

rules

RDF

DAMLLD ResourceDescription

DLLite

Full

QLRL

EL

Turtle

Protocol

Query Results JSON

Fig 3 Different components related to the RDF representation ofdata

Metrics On the Linked Data Web semantic knowl-edge representation techniques are employed whichcome with certain inference and reasoning strategiesfor revealing implicit knowledge which then mightrender a contradiction Consistency is relative to a par-ticular logic (set of inference rules) for identifying con-tradictions A consequence of our definition of con-sistency is that a dataset can be consistent wrt theRDF inference rules but inconsistent when taking theOWL2-QL reasoning profile into account For assess-ing consistency we can employ an inference engineor a reasoner which supports the respective expres-sivity of the underlying knowledge representation for-malism Additionally we can detect functional depen-dency violations such as domainrange violations

In practice RDF-Schema inference and reasoningwith regard to the different OWL profiles can be usedto measure consistency in a dataset For domain spe-cific applications consistency rules can be defined forexample according to the SWRL [28] or RIF stan-dards [32] and processed using a rule engine Figure 3shows the different components related to the RDFrepresentation of data where consistency mainly ap-plies to the schema (and the related components) of thedata rather than RDF and its representation formats

Example Let us assume a user looking for flightsbetween Paris and New York on the 21st of December2012 Her query returns the following resultsFlight From To Arrival DepartureA123 Paris NewYork 1450 2235B123 Paris Singapore 1450 2235The results show that the flight number A123 has twodifferent destinations at the same date and same timeof arrival and departure which is inconsistent with the

ontology definition that one flight can only have onedestination at a specific time and date This contradic-tion arises due to inconsistency in data representationwhich can be detected by using inference and reason-ing

3236 Conciseness The authors in [42] distin-guish the evaluation of conciseness at the schemaand the instance level On the schema level (inten-sional) ldquoa dataset is concise if it does not contain re-dundant attributes (two equivalent attributes with dif-ferent names) Thus intensional conciseness mea-sures the number of unique attributes of a dataset inrelation to the overall number of attributes in a tar-get schema On the data (instance) level (extensional)ldquoa dataset is concise if it does not contain redundantobjects (two equivalent objects with different iden-tifiers) Thus extensional conciseness measures thenumber of unique objects in relation to the overallnumber of object representations in the dataset Thedefinition of conciseness is very similar to the defini-tion of rsquouniquenessrsquo defined in [15] as the ldquodegree towhich data is free of redundancies in breadth depthand scope This comparison shows that concisenessand uniqueness can be used interchangeably

Definition 14 (Conciseness) Conciseness refers to theredundancy of entities be it at the schema or the datalevel Thus conciseness can be classified into (i) inten-sional conciseness (schema level) which refers to theredundant attributes and (ii) extensional conciseness(data level) which refers to the redundant objects

Metrics As conciseness is classified in two cate-gories it can be measured by as the ratio between thenumber of unique attributes (properties) or unique ob-jects (instances) compared to the overall number of at-tributes or objects respectively present in a dataset

Example In our flight search engine since datais fused from different datasets an example of inten-sional conciseness would be a particular flight sayA123 being represented by two different identifiersin different datasets such as httpairlinesorgA123 and httpflightsorgA123This redundancy can ideally be solved by fusing thetwo and keeping only one unique identifier On theother hand an example of extensional conciseness iswhen both these different identifiers of the same flighthave the same information associated with them inboth the datasets thus duplicating the information

While integrating data from two different datasetsif both use the same schema or vocabulary to repre-

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 17: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 17

sent the data then the intensional conciseness is highHowever if the integration leads to the duplication ofvalues that is the same information is stored in differ-ent ways this leads to extensional conciseness Thismay lead to contradictory values and can be solved byfusing duplicate entries and merging common proper-ties

Relations between dimensions Both the dimensionsaccuracy and objectivity focus towards the correctnessof representing the real world data Thus objectivityoverlaps with the concept of accuracy but differs fromit because the concept of accuracy does not heavily de-pend on the consumersrsquo preference On the other handobjectivity is influenced by the userrsquos preferences andby the type of information (eg height of a buildingvs product description) Objectivity is also related tothe verifiability dimension that is the more verifiablea source is the more objective it will be Accuracyin particular the syntactic accuracy of the documentsis related to the validity-of-documents dimension Al-though the interlinking dimension is not directly re-lated to the other dimensions it is included in thisgroup since it is independent of the userrsquos context

324 AccessibilityThe dimensions belonging to this category involve

aspects related to the way data can be accessed andretrieved There are four dimensions part of thisgroup which are availability performance securityand response-time as displayed along with their cor-responding metrics in Table 5 The reference for eachmetric is provided in the table

3241 Availability Availability refers to ldquothe ex-tent to which information is available or easily andquickly retrievable [5] In [14] on the other handavailability is expressed as the proper functioning ofall access methods In the former definition availabil-ity is more related to the measurement of available in-formation rather than to the method of accessing theinformation as implied in the latter definition

Definition 15 (Availability) Availability of a datasetis the extent to which information is present obtain-able and ready for use

Metrics Availability of a dataset can be measuredin terms of accessibility of the server SPARQL end-points or RDF dumps and also by the dereferencabilityof the URIs

Example Let us consider the case in which the userlooks up a flight in our flight search engine How-

ever instead of retrieving the results she is presentedwith an error response code such as 4xx clienterror This is an indication that a requested resourceis unavailable In particular when the returned errorcode is 404 Not Found code she may assumethat either there is no information present at that spec-ified URI or the information is unavailable Naturallyan apparently unreliable system is less likely to beused in which case the user may not book flights afterencountering such issues

Execution of queries over the integrated knowledgebase can sometimes lead to low availability due to sev-eral reasons such as network congestion unavailabil-ity of servers planned maintenance interruptions deadlinks or dereferencability issues Such problems affectthe usability of a dataset and thus should be avoidedby methods such as replicating servers or caching in-formation

3242 Performance Performance is denoted as aquality indicator which ldquocomprises aspects of enhanc-ing the performance of a source as well as measuringof the actual values [14] However in [27] perfor-mance is associated with issues such as avoiding pro-lix RDF features such as (i) reification (ii) containersand (iii) collections These features should be avoidedas they are cumbersome to represent in triples andcan prove to be expensive to support in performanceor data intensive environments In the aforementionedreferences we can notice that there is no such a for-mal definition provided for performance In the formerreference the authors give a general description of per-formance without explaining what is meant by perfor-mance In the latter reference the authors describe theissue related to performance

Definition 16 (Performance) Performance refers tothe efficiency of a system that binds to a large datasetthat is the more performant a data source the moreefficiently a system can process data

Metrics Performance is measured based on thescalability of the data source that is a query shouldbe answered in a reasonable amount of time Also de-tection of the usage of prolix RDF features or usageof slash-URIs can help determine the performance ofa dataset Additional metrics are low latency and highthroughput of the services provided for the dataset

Example In our use case the target performancemay depend on the number of users ie it may be re-quired to be able to server 100 simultaneous users Ourflight search engine will not be scalable if the time re-

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 18: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

18 Linked Data Quality

Dimension Metric Description Type

Availability

accessibility of the server checking whether the server responds to a SPARQL query [1426]

O

accessibility of the SPARQL end-point

checking whether the server responds to a SPARQL query [1426]

O

accessibility of the RDF dumps checking whether a RDF dump is provided and can be down-loaded [1426]

O

dereferencability issues when a URIs returns an error (4xx client error 5xx server error)response code or detection of broken links [1426]

O

no structured data available detection of dead links or detection of a URI without any sup-porting RDF metadata or no redirection using the status code303 See Other or no code 200 OK [1426]

O

misreported content types detection of whether the content is suitable for consumptionand whether the content should be accessed [26]

S

no dereferenced back-links detection of all local in-links or back-links locally availabletriples in which the resource URI appears as an object in thedereferenced document returned for the given resource [27]

O

Performance

no usage of slash-URIs checking for usage of slash-URIs where large amounts of datais provided [14]

O

low latency if an HTTP-request is not answered within an average time ofone second the latency of the data source is considered too low[14]

O

high throughput no of answered HTTP-requests per second [14] Oscalability of a data source detection of whether the time to answer an amount of ten re-

quests divided by ten is not longer than the time it takes toanswer one request

O

no use of prolix RDF features detect use of RDF primitives ie RDF reification RDF con-tainers and RDF collections [27]

O

Security access to data is secure use of login credentials or use of SSL or SSH Odata is of proprietary nature data owner allows access only to certain users O

Response-time delay in response time delay between submission of a request by the user and recep-tion of the response from the system [5]

O

Table 5Comprehensive list of data quality metrics of the accessibility dimensions how it can be measured and itrsquos type - Subjective or Objective

quired to answer to all queries is similar to the timerequired when querying the individual datasets In thatcase satisfying performance needs requires cachingmechanisms

Latency is the amount of time from issuing the queryuntil the first information reaches the user Achievinghigh performance should be the aim of a dataset ser-vice The performance of a dataset can be improved by(i) providing the dataset additionally as an RDF dump(ii) usage of hash-URIs instead of slash-URIs and (iii)avoiding the use of prolix RDF features Since LinkedData may involve the aggregation of several largedatasets they should be easily and quickly retriev-able Also the performance should be maintained evenwhile executing complex queries over large amounts ofdata to provide query repeatability explorational fluid-ity as well as accessibility

3243 Security Security refers to ldquothe possibilityto restrict access to the data and to guarantee the confi-

dentiality of the communication between a source andits consumers [14]

Definition 17 (Security) Security can be defined asthe extent to which access to data can be restrictedand hence protected against its illegal alteration andmisuse It refers to the degree to which information ispassed securely from users to the information sourceand back

Metrics Security can be measured based on whetherthe data has a proprietor or requires web security tech-niques (eg SSL or SSH) for users to access acquireor re-use the data The importance of security dependson whether the data needs to be protected and whetherthere is a cost of data becoming unintentionally avail-able For open data the protection aspect of securitycan be often neglected but the non-repudiation of thedata is still an important issue Digital signatures basedon private-public key infrastructures can be employedto guarantee the authenticity of the data

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 19: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 19

Example In our scenario we consider a user thatwants to book a flight from a city A to a city B Thesearch engine should ensure a secure environment tothe user during the payment transaction since her per-sonal data is highly sensitive If there is enough iden-tifiable public information of the user then she canbe potentially targeted by private businesses insurancecompanies etc which she is unlikely to want Thus theuse of SSL can be used to keep the information safe

Security covers technical aspects of the accessibilityof a dataset such as secure login and the authentica-tion of an information source by a trusted organizationThe use of secure login credentials or access via SSHor SSL can be used as a means to keep the informationsecure especially in cases of medical and governmen-tal data Additionally adequate protection of a datasetis an important aspect to be considered against its alter-ation or misuse and therefore a reliable and secure in-frastructure or methodologies can be applied [54] Al-though security is an important dimension it does notoften apply to open data The importance of the secu-rity depends on whether the data needs to be protected

3244 Response-time Response-time is defined asldquothat which measures the delay between submissionof a request by the user and reception of the responsefrom the system [5]

Definition 18 (Response-time) Response-time mea-sures the delay usually in seconds between submis-sion of a query by the user and reception of the com-plete response from the dataset

Metrics Response-time can be assessed by measur-ing the delay between submission of a request by theuser and reception of the response from the dataset

Example A user is looking for a flight which in-cludes multiple destinations She wants to fly fromMilan to Boston Boston to New York and NewYork to Milan In spite of the complexity of thequery the search engine should respond quickly withall the flight details (including connections) for all thedestinations

The response time depends on several factors suchas network traffic server workload server capabilitiesandor complexity of the user query which affect thequality of query processing This dimension also de-pends on the type and complexity of the request Lowresponse time hinders the usability as well as accessi-bility of a dataset Locally replicating or caching infor-mation are possible means of improving the responsetime Another option is by providing low latency so

that the user is provided with a part of the results earlyon

Relations between dimensions The dimensions inthis group are inter-related with each other as followsperformance of a system is related to the availabil-ity and response-time dimensions Only if a dataset isavailable or has low response time it can perform wellSecurity is also related to the availability of a datasetbecause the methods used for restricting users is tiedto the way a user can access a dataset

325 Representational dimensionsRepresentational dimensions capture aspects related

to the design of the data such as the representational-conciseness representational-consistency understand-ability versatility as well as the Interpretability of thedata These dimensions along with their correspondingmetrics are listed in Table 6 The reference for eachmetric is provided in the table

3251 Representational-conciseness Representa-tional-conciseness is only defined as ldquothe extent towhich information is compactly represented [5]

Definition 19 (Representational-conciseness) Repre-sentational-conciseness refers to the representation ofthe data which is compact and well formatted on theone hand but also clear and complete on the otherhand

Metrics Representational-conciseness is measuredby qualitatively verifying whether the RDF model thatis used to represent the data is concise enough in orderto be self-descriptive and unambiguous

Example A user after booking her flight is inter-ested in additional information about the destinationairport such as its location Our flight search engineshould provide only that information related to the lo-cation rather than returning a chain of other properties

Representation of RDF data in N3 format is con-sidered to be more compact than RDFXML [14] Theconcise representation not only contributes to the hu-man readability of the data but also influences the per-formance of data when queried For example in [27]the use of very long URIs or those that contain queryparameters is an issue related to the representational-conciseness Keeping URIs short and human readableis highly recommended for large scale andor frequentprocessing of RDF data as well as for efficient index-ing and serialisation

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 20: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

20 Linked Data Quality

Dimension Metric Description TypeRepresentational-conciseness

keeping URIs short detection of long URIs or those that contain query param-eters [27]

O

Representational-consistency

re-use existing terms detecting of whether existing terms from other vocabular-ies have been reused [27]

O

re-use existing vocabularies usage of established vocabularies [14] S

Understandibility

human-readable labelling of classesproperties and entities by providingrdfslabel

no of entities described by stating an rdfslabel orrdfscomment in the dataset total no of entities describedin the data [14]

O

indication of metadata about a dataset checking for the presence of the title content and URI ofthe dataset [145]

O

dereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [14]

O

indication of one or more exemplary URIs detecting whether the pattern of the URIs is provided [14] Oindication of a regular expression thatmatches the URIs of a dataset

detecting whether a regular expression that matches theURIs is present [14]

O

indication of an exemplary SPARQLquery

detecting whether examples of SPARQL queries are pro-vided [14]

O

indication of the vocabularies used in thedataset

checking whether a list of vocabularies used in the datasetis provided [14]

O

provision of message boards and mailinglists

checking the effectiveness and the efficiency of the usageof the mailing list andor the message boards [14]

O

Interpretability

interpretability of data detect the use of appropriate language symbols units andclear definitions [5]

S

detect the use of self-descriptive formats identifying ob-jects and terms used to define the objects with globallyunique identifiers [5]

O

interpretability of terms use of various schema languages to provide definitions forterms [5]

S

misinterpretation of missing values detecting use of blank nodes [27] Odereferenced representations giving hu-man readable metadata

detecting use of rdfslabel to attach labels or names to re-sources [27]

O

atypical use of collections containers andreification

detection of the non-standard usage of collections contain-ers and reification [27]

O

Versatility

provision of the data in different serializa-tion formats

checking whether data is available in different serializationformats [14]

O

provision of the data in various languages checking whether data is available in different lan-guages [14]

O

application of content negotiation checking whether data can be retrieved in accepted formatsand languages by adding a corresponding accept-header toan HTTP request [14]

O

accessing of data in different ways checking whether the data is available as a SPARQL end-point and is available for download as an RDF dump [14]

O

Table 6

Comprehensive list of data quality metrics of the representational dimensions how it can be measured and itrsquos type - Subjective or Objective

3252 Representational-consistency Representa-tional-consistency is defined as ldquothe extent to whichinformation is represented in the same format [5] Thedefinition of the representational-consistency dimen-sion is very similar to the definition of uniformitywhich refers to the re-use of established format to rep-resent data [14] As stated in [27] the re-use of well-known terms to describe resources in a uniform man-ner increases the interoperability of data published inthis manner and contributes towards representationalconsistency of the dataset

Definition 20 (Representational-consistency) Repre-sentational consistency is the degree to which the for-mat and structure of the information conform to pre-viously returned information Since Linked Data in-volves aggregation of data from multiple sources weextend this definition to not only imply compatibil-ity with previous data but also with data from othersources

Metrics Representational consistency can be as-sessed by detecting whether the dataset re-uses exist-

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 21: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 21

ing vocabularies or terms from existing established vo-cabularies to represent its entities

Example In our use case consider different air-lines companies using different notation for represent-ing their data eg some use RDF data and some othersuse turtle In order to avoid interoperability issue weprovide data based on the Linked Data principle whichis designed to support heterogeneous description mod-els which is necessary to handle different format ofdata The exchange of information in different formatswill not be a big deal in our search engine since stronglinks are created between datasets

Re-use of well known vocabularies rather than in-venting new ones not only ensures that the data is con-sistently represented in different datasets but also sup-ports data integration and management tasks In prac-tice for instance when a data provider needs to de-scribe information about people FOAF15 should be thevocabulary of choice Moreover re-using vocabulariesmaximises the probability that data can be consumedby applications that may be tuned to well-known vo-cabularies without requiring further pre-processingof the data or modification of the application Eventhough there is no central repository of existing vocab-ularies suitable terms can be found in SchemaWeb16SchemaCache17 and Swoogle18 Additionally a com-prehensive survey done in [52] lists a set of namingconventions that should be used to avoid inconsisten-cies19 Another possibility is to use LODStats [13]which allows to perform a search for frequently usedproperties and classes in the LOD cloud

3253 Understandibility Understandability is de-fined as the ldquoextent to which data is easily compre-hended by the information consumer [5] Understand-ability is also related to the comprehensibility of dataie the ease with which human consumers can under-stand and utilize the data [14] Thus the dimensionsunderstandability and comprehensibility can be inter-changeably used

Definition 21 (Understandability) Understandabilityrefers to the ease with which data can be compre-hended without ambiguity and used by a human con-

15httpxmlnscomfoafspec16httpwwwschemawebinfo17httpschemacachecom18httpswoogleumbcedu19However they only restrict themselves to only considering the

needs of the OBO foundry community but still can be applied toother domains

sumer Thus this dimension can also be referred toas the comprehensibility of the information where thedata should be of sufficient clarity in order to be used

Metrics Understandability can be measured by de-tecting whether human-readable labels for classesproperties and entities are provided Provision of themetadata of a dataset can also contribute towards as-sessing its understandability The dataset should alsoclearly provide exemplary URIs and SPARQL queriesalong with the vocabularies used so that can users canunderstand how it can be used

Example Let us assume that in our flight search en-gine it allows a user to enter a start and destination ad-dress In that case strings entered by the user need tobe matched to entities in the spatial dataset4 probablyvia string similarity Understandable labels for citiesplaces etc improve search performance in that caseFor instance when a user looks for a flight to US (la-bel) then the search engine should return the flights tothe United States or America

Understandibility measures how well a source pres-ents its data so that a user is able to understand its se-mantic value In Linked Data data publishers are en-couraged to re-use well know formats vocabulariesidentifiers human-readable labels and descriptions ofdefined classes properties and entities to ensure clarityin the understandability of their data

3254 Interpretability Interpretability is definedas the ldquoextent to which information is in appropriatelanguages symbols and units and the definitions areclear [5]

Definition 22 (Interpretability) Interpretability refersto technical aspects of the data that is whether infor-mation is represented using an appropriate notationand whether it conforms to the technical ability of theconsumer

Metrics Interpretability can be measured by the useof globally unique identifiers for objects and terms orby the use of appropriate language symbols units andclear definitions

Example Consider our flight search engine anda user that is looking for a flight from Milan toBoston Data related to Boston in the integrated datafor the required flight contains the following entities

ndash httprdffreebasecomnsm049jnngndash httprdffreebasecomnsm043j22xndash Boston Logan Airport

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 22: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

22 Linked Data Quality

For the first two items no human-readable label isavailable therefore the URI is displayed which doesnot represent anything meaningful to the user besidesthe information that Freebase contains informationabout Boston Logan Airport The third however con-tains a human-readable label which the user can easilyinterpret

The more interpretable a Linked Data source is themore easy it is to integrate with other data sourcesAlso interpretability contributes towards the usabilityof a data source Use of existing well-known termsself-descriptive formats and globally unique identifiersincrease the interpretability of a dataset

3255 Versatility In [14] versatility is defined toas the ldquoalternative representations of the data and itshandling

Definition 23 (Versatility) Versatility mainly refersto the alternative representations of data and its sub-sequent handling Additionally versatility also corre-sponds to the provision of alternative access methodsfor a dataset

Metrics Versatility can be measured by the avail-ability of the dataset in different serialisation formatsdifferent languages as well as different access meth-ods

Example Consider a user from a non-Englishspeaking country who wants to use our flight searchengine In order to cater to the needs of such usersour flight search engine should be available in differ-ent languages so that any user has the capability tounderstand it

Provision of Linked Data in different languages con-tributes towards the versatility of the dataset with theuse of language tags for literal values Also providinga SPARQL endpoint as well as an RDF dump as accesspoints is an indication of the versatility of the datasetProvision of resources in HTML format in addition toRDF as suggested by the Linked Data principles is alsorecommended to increase human readability Similarto the uniformity dimension versatility also enhancesthe probability of consumption and ease of processingof the data In order to handle the versatile representa-tions content negotiation should be enabled whereby aconsumer can specify accepted formats and languagesby adding a corresponding accept header to an HTTPrequest

Relations between dimensions Understandability isrelated to the interpretability dimension as it refers tothe subjective capability of the information consumerto comprehend information Interpretability mainlyrefers to the technical aspects of the data that is if it iscorrectly represented Versatility is also related to theinterpretability of a dataset as the more versatile formsa dataset is represented in (for eg in different lan-guages) the more interpretable a dataset is Althoughrepresentational-consistency is not related to any ofthe other dimensions in this group it is part of therepresentation of the dataset In fact representational-consistency is related to the validity-of-documents anintrinsic dimension because the invalid usage of vo-cabularies may lead to inconsistency in the documents

326 Dataset DynamicityAn important aspect of data is its update over time

The main dimensions related to the dynamicity of adataset proposed in the literature are currency volatil-ity and timeliness Table 7 provides these three dimen-sions with their respective metrics The reference foreach metric is provided in the table

3261 Currency Currency refers to ldquothe age ofdata given as a difference between the current date andthe date when the information was last modified byproviding both currency of documents and currencyof triples [50] Similarly the definition given in [42]describes currency as ldquothe distance between the inputdate from the provenance graph to the current dateAccording to these definitions currency only measuresthe time since the last modification whereas we pro-vide a definition that measures the time between achange in the real world and a change in the knowledgebase

Definition 24 (Currency) Currency refers to the speedwith which the information (state) is updated after thereal-world information changes

Metrics The measurement of currency relies ontwo components (i) delivery time (the time when thedata was last modified) and (ii) the current time bothpossibly present in the data models

Example Consider a user is looking for the price ofa flight from Milan to Boston and she receives the up-dates via email The currency of the price informationis measured with respect to the last update of the priceinformation The email service sends the email with a

20identify the time when first data have been published in the LOD21elder than the last modification time

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 23: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 23

Dimension Metric Description Type

Currency currency of statements 1- [(current time - last modified time)(current time - starttime20)] [50]

O

currency of data source (last modified time of the semantic web source lt last mod-ified time of the original source)21 [15]

O

age of data current time - created time [42] OVolatility no timestamp associated with the source check if there are temporal meta-information associated to

a source [38]O

no timestamp associated with the source check if there are temporal meta-information associated toa source

O

Timeliness

timeliness of data expiry time lt current time [15] Ostating the recency and frequency of datavalidation

detection of whether the data published has been validatedno more than a month ago [14]

O

no inclusion of outdated data no of triples not stating outdated data in the dataset totalno of triples in the dataset [14]

O

time inaccurate representation of data 1 - (time-inaccurate instances total no of instances) [14] OTable 7

Comprehensive list of data quality metrics of the dataset dynamicity dimensions how it can be measured and itrsquos type - Subjective orObjective

delay of about 1 hour with respect to the time intervalin which the information is determined to hold In thisway if the currency value exceeds the time interval ofthe validity of the information the result is said to benot current If we suppose validity of the price infor-mation (the frequency of change of the price informa-tion) to be 2 hours the price list that is not changedwithin this time is considered to be out-dated To de-termine whether the information is out-dated or not weneed to have temporal metadata available and repre-sented by one of the data models proposed in the liter-ature [49]

3262 Volatility Since there is no definition forvolatility in the core set of approaches considered inthis survey we provide one which applies in the Webof Data

Definition 25 (Volatility) Volatility can be defined asthe length of time during which the data remains valid

Metrics Volatility can be measured by two compo-nents (i) the expiry time (the time when the data be-comes invalid) and (ii) the input time (the time whenthe data was first published on the Web) Both thesemetrics are combined together to measure the distancebetween the expiry time and the input time of the pub-lished data

Example Let us consider the aforementioned usecase where a user wants to book a flight from Milan toBoston and she is interested in the price informationThe price is considered as volatile information as itchanges frequently for instance it is estimated that theflight price changes each minute (data remains valid

for one minute) In order to have an updated price listthe price should be updated within the time intervalpre-defined by the system (one minute) In case theuser observes that the value is not re-calculated by thesystem within the last few minutes the values are con-sidered to be out-dated Notice that volatility is esti-mated based on changes that are observed related to aspecific data value

3263 Timeliness Timeliness is defined as ldquothedegree to which information is up-to-daterdquo in [5]whereas in [14] the author define the timeliness cri-terion as ldquothe currentness of the data provided by asource

Definition 26 Timeliness refers to the time point atwhich the data is actually used This can be interpretedas whether the information is available in time to beuseful

Metrics Timeliness is measured by combining thetwo dimensions currency and volatility Additionallytimeliness states the recency and frequency of data val-idation and does not include outdated data

Example A user wants to book a flight from Mi-lan to Boston and thus our flight search engine will re-turn a list of different flight connections She picks oneof them and follows all the steps to book her flightThe data contained in the airline dataset shows a flightcompany that is available according to the user require-ments In terms of time-related quality dimension theinformation related to the flight is recorded and re-ported to the user every two minutes which fulfils therequirement decided by our search engine that cor-

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 24: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

24 Linked Data Quality

responds to the volatility of a flight information Al-though the flight values are updated on time the infor-mation received to the user about the flightrsquos availabil-ity is not on time In other words the user did not per-ceive that the availability of the flight was depleted be-cause the change was provided to the system a momentafter her search

Data should be recorded and reported as frequentlyas the source values change and thus never becomeoutdated However this may not be necessary nor idealfor the userrsquos purposes let alone practical feasible orcost-effective Thus timeliness is an important qual-ity dimension with its value determined by the userrsquosjudgement of whether information is recent enough tobe relevant given the rate of change of the source valueand the userrsquos domain and purpose of interest

3264 Relations between dimensions Timelinessalthough part of the dataset dynamicity group is alsoconsidered to be part of the intrinsic group because itis independent of the users context In [2] the authorscompare the definitions provided in the literature forthese three dimensions and identify that often the def-initions given for currency and timeliness can be usedinterchangeably Thus timeliness depends on both thecurrency and volatility dimensions Currency is relatedto volatility since currency of data depends on howvolatile it is Thus data that is highly volatile must becurrent while currency is less important for data withlow volatility

4 Comparison of selected approaches

In this section we compare the selected approachesbased on the different perspectives discussed in Sec-tion 2 (Comparison perspective of selected approaches)In particular we analyze each approach based onthe dimensions (Section 41) their respective metrics(Section 42) types of data (Section 43) level of au-tomation (Section 44) and usability of the three spe-cific tools (Section 45)

41 Dimensions

The Linked Open Data paradigm is the fusion ofthree different research areas namely the SemanticWeb to generate semantic connections among datasetsthe World Wide Web to make the data available prefer-ably under an open access license and Data Manage-ment for handling large quantities of heterogeneous

and distributed data Previously published literatureprovides a thorough classification of the data qual-ity dimensions [55574830946] By analyzing theseclassifications it is possible to distill a core set ofdimensions namely accuracy completeness consis-tency and timeliness These four dimensions constitutethe focus provided by most authors [51] However noconsensus exists on which set of dimensions might de-fine data quality as a whole or the exact meaning ofeach dimension which is also a problem occurring inLOD

As mentioned in Section 3 data quality assessmentinvolves the measurement of data quality dimensionsthat are relevant to the consumer We therefore gath-ered all data quality dimensions that have been re-ported as being relevant for LOD by analyzing the se-lected approaches An initial list of data quality dimen-sions was first obtained from [5] Thereafter the prob-lem being addressed in each approach was extractedand mapped to one or more of the quality dimensionsFor example the problems of dereferencability thenon-availability of structured data and content misre-porting as mentioned in [27] were mapped to the di-mensions of completeness as well as availability

However not all problems present in LOD couldbe mapped to the initial set of dimensions includ-ing the problem of the alternative data representa-tion and its handling ie the dataset versatility Wetherefore obtained a further set of quality dimensionsfrom [14] which was one of the first studies focusingtowards data quality dimensions and metrics applica-ble to LOD Yet there were some problems that did notfit in this extended list of dimensions such as the prob-lem of incoherency of interlinking between datasets orthe different aspects of the timeliness of datasets Thuswe introduced new dimensions such as interlinkingvolatility and currency in order to cover all the identi-fied problems in all of the included approaches whilealso mapping them to at least one dimension

Table 8 shows the complete list of 26 Linked Dataquality dimensions along with their respective fre-quency of occurrence in the included approaches Thistable presents the information split into three dis-tinct groups (a) a set of approaches focusing only ondataset provenance [23165320182117298] (b) aset of approaches covering more than five dimen-sions [6142742] and (c) a set of approaches focus-ing on very few and specific dimensions [712222638451550] Overall it can be observed that the di-mensions provenance consistency timeliness accu-racy and completeness are the most frequently used

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 25: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 25

We can also conclude that none of the approachescover all data quality dimensions that are relevant forLOD

42 Metrics

As defined in Section 3 a data quality metric is aprocedure for measuring an information quality di-mension We notice that most of metrics take theform of a ratio which measures the desired out-comes to the total outcomes [38] For example for therepresentational-consistency dimension the metric fordetermining the re-usage of existing vocabularies takesthe form of a ratio as

no of established vocabularies used in the datasettotal no of vocabularies used in the dataset

Other metrics which cannot be measured as a ratiocan be assessed using algorithms Tables 2 3 4 5 6and 7 provide comprehensive lists of the data qualitymetrics for each of the dimensions

For some of the included approaches the problemits corresponding metric and a dimension were clearlymentioned [145] However for the other approacheswe first extracted the problem addressed along withthe way in which it was assessed (metric) Thereafterwe mapped each problem and the corresponding met-ric to a relevant data quality dimension For exam-ple the problem related to keeping URIs short (iden-tified in [27]) measured by the present of long URIsor those containing query parameters was mappedto the representational-conciseness dimension On theother hand the problem related to the re-use of exist-ing terms (also identified in [27]) was mapped to therepresentational-consistency dimension

We observed that for a particular dimension therecan be several metrics associated with it but one metricis not associated with more than one dimension Ad-ditionally there are several ways of measuring one di-mension either individually or by combining differentmetrics As an example the availability dimension canbe measured by a combination of three other metricsnamely accessibility of the (i) server (ii) SPARQLend-point and (iii) RDF dumps Additionally avail-ability can be individually measured by the availabilityof structured data misreported content types or by theabsence of dereferencability issues

We also classify each metric as being Objectively(quantitatively) or Subjectively (qualitatively) assessedObjective metrics are those which can be quantified

or for which a concrete value can be calculated Forexample for the completeness dimension the metricssuch as schema completeness or property complete-ness can be quantified On the other hand subjectivedimensions are those which cannot be quantified butdepend on the users perception of the respective di-mension (via surveys) For example metrics belong-ing to dimensions such as objectivity relevancy highlydepend on the user and can only be qualitatively mea-sured

The ratio form of the metrics can be generally ap-plied to those metrics which can be measured quanti-tatively (objectively) There are cases when the met-rics of particular dimensions are either entirely sub-jective (for example relevancy objectivity) or entirelyobjective (for example accuracy conciseness) Butthere are also cases when a particular dimension canbe measured both objectively as well as subjectivelyFor example although completeness is perceived asthat which can be measured objectively it also in-cludes metrics which can be subjectively measuredThat is the schema or ontology completeness can bemeasured subjectively whereas the property instanceand interlinking completeness can be measured objec-tively Similarly for the amount-of-data dimension onthe one hand the number of triples instances per classinternal and external links in a dataset can be measuredobjectively but on the other hand the scope and levelof detail can be measured subjectively

43 Type of data

The goal of an assessment activity is the analysis ofdata in order to measure the quality of datasets alongrelevant quality dimensions Therefore the assessmentinvolves the comparison between the obtained mea-surements and the references values in order to enablea diagnosis of quality The assessment considers dif-ferent types of data that describe real world objects ina format that can be stored retrieved and processedby a software procedure and communicated through anetwork Thus in this section we distinguish betweenthe types of data considered in the various approachesin order to obtain an overview of how the assessmentof LOD operates on such different levels The assess-ment can be associated from small-scale units of datasuch as assessment of RDF triples to the assessmentof entire datasets which potentially affect the assess-ment process In LOD we distinguish the assessmentprocess operating on three types of data

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 26: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

26 Linked Data Quality

Table8C

onsiderationofdata

qualitydim

ensionsin

eachofthe

includedapproaches

Bizeretal2009

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

Flemm

ingetal2010

4

4

4

4

4

4

4

4

4

4

4

Boumlhm

etal2010

4

4

Chen

etal2010

4

Gueacuteretetal2011

4

4

Hogan

etal2010

4

4

Hogan

etal2012

4

4

4

4

4

4

4

Leietal2007

4

4

4

4

Mendes

etal2012

4

4

4

4

4

Mostafavietal2004

4Fuumlrberetal2011

4

4

4

4

4

Rula

etal20124

Hartig2008

4

Gam

bleetal2011

4

Shekarpouretal2010

4

Golbeck

etal2006

4

Giletal2002

4

Golbeck

etal2003

4

Giletal2007

4

4

Jacobietal2011

4

Bonattietal2011

4

Approaches Dimensions

Provenance

Consistency

Currency

Volatility

Timeliness

Accuracy

Completeness

Amount-of-data

Availability

Understandability

Relevancy

Reputation

Verifiability

Interpretability

Rep-conciseness

Rep-consistency

Licensing

Performance

Objectivity

Believability

Response-time

Security

Versatility

Validity-of-documents

Conciseness

Interlinking

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 27: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 27

ndash RDF triples which focus on individual triple as-sessment

ndash RDF graphs which focus on entities assessmentwhere entities are described by a collection ofRDF triples [25]

ndash Datasets which focus on datasets assessmentwhere a dataset is considered as a set of defaultand named graphs

We can observe that most of the methods are ap-plicable at the triple or graph level and less on thedataset level (Table 9) Additionally it is seen that9 approaches assess data both at triple and graphlevel [1845387121415850] 2 approaches assessdata both at graph and dataset level [1622] and 4 ap-proaches assess data at triple graph and dataset lev-els [2062627] There are 2 approaches that applythe assessment only at triple level [2342] and 4 ap-proaches that only apply at the graph level [21175329]

However if the assessment is provided at the triplelevel this assessment can usually be propagated at ahigher level such as graph or dataset level For exam-ple in order to assess the rating of a single source theoverall rating of the statements associated to the sourcecan be used [18] On the other hand the assessmentcan be performed at the graph level which can be fur-ther propagated either to a more fine grained level thatis the RDF triple level or to a more generic one that isthe dataset level For example the evaluation of trustof a data source (graph level) can be propagated to thestatements (triple level) that are part of the Web sourceassociated with that trust rating [53]

However there are no approaches that performs as-sessment only at the dataset level (cf Table 9) Thispossible reason is that the assessment of a datasetshould always pass through the assessment of a finegrained level such as triple or entity level and thenshould be propagated to the dataset level

44 Level of automation

Out of the 21 selected approaches 9 provide toolsupport for the assessment of data quality These toolsimplement the methodologies and metrics defined inthe respective approaches However due to the na-ture of the dimensions and related metrics some of theinvolved tools are either semi or fully automated orsometimes only manually applicable Table 9 showsthe level of automation for each of the identified toolsThe automation level is determined by the amount ofuser involvement

Automated The tool proposed in [22] is fully auto-mated as there is no user involvement The tool auto-matically selects a set of resources information fromthe Web of Data (ie SPARQL endpoints andor deref-erencable resources) and a set of new triples as inputand generates quality assessment reports

Manual The WIQA [6] and Sieve [42] tools areentirely manual as they require a high degree of userinvolvement The WIQA Information Quality Assess-ment Framework enables users to filter information us-ing a wide range of quality based information filter-ing policies Similarly using Sieve a user can definerelevant metrics and respective scoring functions fortheir specific quality assessment task This definitionof metrics has to be done by creating an XML filewhich contains the specific configurations for a qualityassessment task Although it gives the user the flexi-bility of tweaking the tool to match their needs it in-volves a lot of time and understanding of the requiredXML file structure as well as specification

Semi-automated The different tools introducedin [14267182321] are all semi-automated as theyinvolve a minimum amount of user involvement Flem-mingrsquos Data Quality Assessment Tool22 [14] requiresthe user to answer a few questions regarding thedataset (eg existence of a human-readable license) orthey have to assign weights to each of the pre-defineddata quality metrics The RDF Validator23 used in [26]checks RDF documents and reports any data qualityproblems that might exist The tool developed in [7]enables users to explore a set of pre-determined dataclusters for further investigation

The TRELLIS user interface [18] allows severalusers to express their trust on a data source and theirrespective data Decisions made by users on a particu-lar source are stored as annotations which can be usedto analyze conflicting information or handle incom-plete information The tRDF24 approach [23] providesa framework to deal with the trustworthiness of RDFdata with a trust-aware extension tSPARQL along withimplementation strategies In all of these componentsa user is able to interact with the data as well as theframework

The approach proposed by [21] provides two appli-cations namely TrustBot and TrustMail TrustBot isan IRC bot that makes trust recommendations to users

22httplinkeddatainformatikhu-berlindeLDSrcAss

23httpwwww3orgRDFValidator24httptrdfsourceforgenet

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 28: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

28 Linked Data Quality

Fig 4 Excerpt of the Flemmings Data Quality Assessment tool showing the result of assessing the quality of LinkedCT with a score of 15 outof 100

based on the trust network it builds Users have theflexibility to add their own URIs to the bot at any timewhile incorporating the data into a graph TrustMailon the other hand is a configurable email client whichdisplays the trust level that can be placed on the senderfor each email message be it at a general level or withregard to a certain topic

Yet another aspect is the amount of knowledge re-quired from users before they can use a tool Althoughthe tool proposed in [22] is fully automated it requiresa certain level of user understanding in order to in-terpret its results The WIQA and Sieve tools requirea deep understanding of not only the resources userswant to access but also how the tools work and theirconfiguration The above-mentioned semi-automatedtools require a fair amount of knowledge from the userregarding the assessed datasets

45 Comparison of tools

In this section we analyze three tools namely Flem-mingrsquos data quality tool [14] Sieve [42] and LODGRe-fine to assess their usability for data quality assess-ment In particular we compare them with regard toease of use level of user interaction and applicabil-ity in terms of data quality assessment also discussingtheir pros and cons

451 Flemmingrsquos Data Quality Assessment ToolFlemmingrsquos data quality assessment tool [14] is a

simple user interface25 where a user first specifiesthe name URI and three entities for a particular datasource Users are then receive a score ranging from 0to 100 indicating the quality of the dataset where 100represents the best quality

After specifying dataset details (endpoint graphsexample URIs) the user is given an option for assign-ing weights to each of the pre-defined data quality met-rics Two options are available for assigning weights(a) assigning a weight of 1 to all the metrics or (b)choosing the pre-defined exemplary weight of the met-rics defined for a semantic data source In the next stepthe user is asked to answer a series of questions regard-ing the datasets which are important indicators of thedata quality for Linked Datasets and those which can-not be quantified These include for example ques-tions about the use of stable URIs the number of ob-solete classes and properties and whether the datasetprovides a mailing list Next the user is presented witha list of dimensions and metrics for each of whichweights can be specified again Each metric is pro-

25Available in German only at httplinkeddatainformatikhu-berlindeLDSrcAssdatenquellephp

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 29: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 29

Table 9Qualitative evaluation of frameworks

Paper Application Goal Type of data Degree of automation Tool supportGeneralSpecific

RDFTriple

RDFGraph

Dataset Manual Semi-automated

Automated

Gil etal2002

G Approach to derive an as-sessment of a data sourcebased on the annotationsof many individuals

4 4 - - 4 - 4

Golbecketal 2003

G Trust networks on the se-mantic web

- 4 - - 4 - 4

Mostafavietal2004

S Spatial data integration 4 4 - - - - -

Golbeck2006

G Algorithm for computingpersonalized trust rec-ommendations using theprovenance of existingtrust annotations in socialnetworks

4 4 4 - - - -

Gil etal2007

S Trust assessment of webresources

- 4 - - - - -

Lei etal2007

S Assessment of semanticmetadata

4 4 - - - - -

Hartig 2008 G Trustworthiness of Dataon the Web

4 - - - 4 - 4

Bizer etal2009

G Information filtering 4 4 4 4 - - 4

Boumlhm etal2010

G Data integration 4 4 - - 4 - 4

Chen etal2010

G Generating semanticallyvalid hypothesis

4 4 - - - - -

Flemmingetal 2010

G Assessment of publisheddata

4 4 - - 4 - 4

Hogan etal2010

G Assessment of publisheddata by identifying RDFpublishing errors and pro-viding approaches for im-provement

4 4 4 - 4 - 4

Shekarpouretal 2010

G Method for evaluatingtrust

- 4 - - - - -

Fuumlrber etal2011

G Assessment of publisheddata

4 4 - - - - -

Gambleetal 2011

G Application of decisionnetworks to quality trustand utility assessment

- 4 4 - - - -

Jacobi etal2011

G Trust assessment of webresources

- 4 - - - - -

Bonatti etal2011

G Provenance assessment forreasoning

4 4 - - - - -

Gueacuteret etal2012

S Assessment of quality oflinks

- 4 4 - - 4 4

Hogan etal2012

G Assessment of publisheddata

4 4 4 - - - -

Mendesetal 2012

S Data integration 4 - - 4 - - 4

Rula etal2012

G Assessment of time relatedquality dimensions

4 4 - - - - -

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 30: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

30 Linked Data Quality

vided with two input fields one showing the assignedweights and another with the calculated value

In the end the user is presented with a score rangingfrom 0 to 100 representing the final data quality scoreAdditionally the rating of each dimension and the totalweight (based on 11 dimensions) is calculated usingthe user input from the previous steps Figure 4 showsan excerpt of the tool showing the result of assessingthe quality of LinkedCT with a score of 15 out of 100

On one hand the tool is easy to use with the form-based questions and adequate explanation for eachstep Also the assigning of the weights for each met-ric and the calculation of the score is straightforwardand easy to adjust for each of the metrics Howeverthis tool has a few drawbacks (1) the user needs tohave adequate knowledge about the dataset in order tocorrectly assign weights for each of the metrics (2)it does not drill down to the root cause of the dataquality problem and (3) some of the main quality di-mensions are missing from the analysis such as accu-racy completeness provenance consistency concise-ness and relevancy as some could not be quantified andwere not perceived to be true quality indicators

452 SieveSieve is a component of the Linked Data Integra-

tion Framework (LDIF)26 used first to assess the qual-ity between two or more data sources and second tofuse (integrate) the data from the data sources basedon their quality assessment In order to use this tool auser needs to be conversant with programming

The input of Sieve is an LDIF provenance meta-data graph generated from a data source Based onthis information the user needs to set the configurationproperty in an XML file known as integrationproperties The quality assessment procedure re-lies on the measurement of metrics chosen by the userwhere each metric applies a scoring function having avalue from 0 to 1

Sieve implements only a few scoring functions suchas TimeCloseness Preference SetMembershipThreshold and Interval Membership whichare calculated based on the metadata provided as in-put along with the original data source The con-figuration file is in XML format which should bemodified based on the use case as shown in List-ing 1 The output scores are then used to fuse thedata sources by applying one of the fusion functionswhich are Filter Average Max Min First

26httpldifwbsgde

KeepSingleValue ByQualityScore LastRandom PickMostFrequent

1 ltSievegt2 ltQualityAssessmentgt3 ltAssessmentMetric id=sieverecencygt4 ltScoringFunction class=TimeClosenessgt5 ltParam name=timeSpan value=7gt6 ltInput path=GRAPHprovenancelasUpdatedgt7 ltScoringFunctiongt8 ltAssessmentMetricgt9 ltAssessmentMetric id=sievereputationgt

10 ltScoringFunction class=ScoredListgt11 ltParam name=priority value=httppt

wikipediaorg httpenwikipediaorggt12 ltInput path=GRAPHprovenancelasUpdatedgt13 ltScoringFunctiongt14 ltAssessmentMetricgt15 ltSievegt

Listing 1 A configuration of Sieve a Data QualityAssessment and Data Fusion tool

In addition users should specify the DataSourcefolder the homepage element that refers to the datasource from which the entities are going to be fusedSecond the XML file of the ImportJobs that down-loads the data to the server should also be modified Inparticular the user should set up the dumpLocationelement as the location of the dump file

Although the tool is very useful overall there aresome drawbacks that decreases its usability (1) thetool is not mainly used to assess data quality for asource but instead to perform data fusion (integra-tion) based on quality assessment Therefore the qual-ity assessment can be considered as an accessory thatleverages the process of data fusion through evalua-tion of few quality indicators (2) it does not providea user interface ultimately limiting its usage to end-users with programming skills (3) its usage is limitedto domains providing provenance metadata associatedwith the data source

453 LODGRefineLODGRefine27 is a LOD-enabled version of Google

Refine which is an open-source tool for refining messydata Although this tool is not focused on data qualityassessment per se it is powerful in performing prelim-inary cleaning or refining of raw data

Using this tool one is able to import several differ-ent file types of data (CSV Excel XML RDFXMLN-Triples or even JSON) and then perform cleaningaction via a browser-based interface By using a di-verse set of filters and facets on individual columns

27httpcodezemantacomsparkica

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 31: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 31

LODGRefine can help a user to semi-automate thecleaning of her data For example this tool can helpto detect duplicates discover patterns (eg alternativeforms of an abbreviation) spot inconsistencies (egtrailing white spaces) or find and replace blank cellsAdditionally this tool allows users to reconcile datathat is to connect a dataset to existing vocabulariessuch that it gives meaning to the values Reconcilia-tions to Freebase28 helps mapping ambiguous textualvalues to precisely identified Freebase entities Recon-ciling using Sindice or based on standard SPARQL orSPARQL with full-text search is also possible29 usingthis tool Moreover it is also possible to extend the rec-onciled data with DBpedia as well as export the dataas RDF which adds to the uniformity and usability ofthe dataset

These feature thus assists in assessing as well as im-proving the data quality of a dataset Moreover by pro-viding external links the interlinking of the dataset isconsiderably improved LODGRefine is easy to down-load and install as well as to upload and perform basiccleansing steps on raw data The features of reconcil-iation extending the data with DBpedia transformingand exporting the data as RDF are added advantagesHowever this tool has a few drawbacks (1) the useris not able to perform detailed high level data qualityanalysis utilizing the various quality dimensions usingthis tool (2) performing cleansing over a large datasetis time consuming as the tool follows a column datamodel and thus the user must perform transformationsper column

5 Conclusions and future work

In this paper we have presented to the best of ourknowledge the most comprehensive systematic reviewof data quality assessment methodologies applied toLOD The goal of this survey is to obtain a clear under-standing of the differences between such approachesin particular in terms of quality dimensions metricstype of data and level of automation

We survey 21 approaches and extracted 26 dataquality dimensions along with their definitions andcorresponding metrics We also identified tools pro-posed for each approach and classified them in rela-tion to data type the degree of automation and requiredlevel of user knowledge Finally we evaluated three

28httpwwwfreebasecom29httprefinederiiereconciliationDocs

specific tools in terms of usability for data quality as-sessment

As we can observe most of the publications focus-ing on data quality assessment in Linked Data are pre-sented at either conferences or workshops As our lit-erature review reveals the number of 21 publicationspublished in the span of 10 years is rather low Thiscan be attributed to the infancy of this research areawhich is currently emerging

In most of the surveyed literature the metrics wereoften not explicitly defined or did not consist of precisestatistical measures Also there was no formal valida-tion of the methodologies that were implemented astools Moreover only few approaches were actuallyaccompanied by an implemented tool That is out ofthe 21 approaches only 9 provided tool support Wealso observed that none of the existing implementedtools covered all the data quality dimensions In factthe best coverage in terms of dimensions was achievedby Flemmingrsquos data quality assessment tool with 11covered dimensions Our survey revealed that the flex-ibility of the tools with regard to the level of automa-tion and user involvement needs to be improved Sometools required a considerable amount of configurationand others were easy-to-use but provided only resultswith limited usefulness or required a high-level of in-terpretation

Meanwhile there is quite much research on dataquality being done and guidelines as well as recom-mendations on how to publish ldquogood data are avail-able However there is less focus on how to use thisldquogood data We deem our data quality dimensions tobe very useful for data consumers in order to assessthe quality of datasets As a next step we aim to inte-grate the various data quality dimensions into a com-prehensive methodological framework for data qualityassessment comprising the following steps

1 Requirements analysis2 Data quality checklist3 Statistics and low-level analysis4 Aggregated and higher level metrics5 Comparison6 Interpretation

We aim to develop this framework for data quality as-sessment allowing a data consumer to select and as-sess the quality of suitable datasets according to thismethodology In the process we also expect new met-rics to be defined and implemented

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 32: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

32 Linked Data Quality

References

[1] AGRAWAL R AND SRIKANT R Fast algorithms for miningassociation rules in large databases In International Confer-ence on Very Large Data Bases (1994) pp 487ndash499

[2] BATINI C CAPPIELLO C FRANCALANCI C AND MAU-RINO A Methodologies for data quality assessment and im-provement ACM Computing Surveys 41 3 (2009)

[3] BATINI C AND SCANNAPIECO M Data Quality Con-cepts Methodologies and Techniques (Data-Centric Systemsand Applications) Springer-Verlag New York Inc SecaucusNJ USA 2006

[4] BECKETT D RDFXML Syntax Specification(Revised) Tech rep World Wide Web Consor-tium 2004 httpwwww3orgTR2004REC-rdf-syntax-grammar-20040210

[5] BIZER C Quality-Driven Information Filtering in the Con-text of Web-Based Information Systems PhD thesis Freie Uni-versitaumlt Berlin March 2007

[6] BIZER C AND CYGANIAK R Quality-driven informationfiltering using the wiqa policy framework Web Semantics 7 1(Jan 2009) 1 ndash 10

[7] BOumlHM C NAUMANN F ABEDJAN Z FENZ DGRUumlTZE T HEFENBROCK D POHL M AND

SONNABEND D Profiling linked open data with prolod InICDE Workshops (2010) IEEE pp 175ndash178

[8] BONATTI P A HOGAN A POLLERES A AND SAUROL Robust and scalable linked data reasoning incorporatingprovenance and trust annotations Journal of Web Semantics 92 (2011) 165 ndash 201

[9] BOVEE M SRIVASTAVA R P AND MAK B A conceptualframework and belief-function approach to assessing overallinformation quality International Journal of Intelligent Sys-tems 18 1 (2003) 51ndash74

[10] BRICKLEY D AND GUHA R V Rdf vocabu-lary description language 10 Rdf schema Techrep W3C 2004 httpwwww3orgTR2004REC-rdf-schema-20040210

[11] CARROLL J Signing rdf graphs Tech rep HPL-2003-142HP Labs 2003

[12] CHEN P AND GARCIA W Hypothesis generation and dataquality assessment through association mining In IEEE ICCI(2010) IEEE pp 659ndash666

[13] DEMTER J AUER S MARTIN M AND LEHMANNJ Lodstats ndash an extensible framework for high-performancedataset analytics In EKAW (2012) LNCS Springer

[14] FLEMMING A Quality characteristics of linked data pub-lishing datasources Masterrsquos thesis Humboldt-Universitaumlt zuBerlin 2010

[15] FUumlRBER C AND HEPP M Swiqa - a semantic web infor-mation quality assessment framework In ECIS (2011)

[16] GAMBLE M AND GOBLE C Quality trust and utility ofscientific data on the web Towards a joint model In ACMWebSci (June 2011) pp 1ndash8

[17] GIL Y AND ARTZ D Towards content trust of web re-sources Web Semantics 5 4 (December 2007) 227 ndash 239

[18] GIL Y AND RATNAKAR V Trusting information sourcesone citizen at a time In ISWC (2002) Springer-Verlag pp 162ndash 176

[19] GOLBECK J Inferring reputation on the semantic web InWWW (2004)

[20] GOLBECK J Using trust and provenance for content filteringon the semantic web In Workshop on Models of Trust on theWeb at the 15th World Wide Web Conference (2006)

[21] GOLBECK J PARSIA B AND HENDLER J Trust networkson the semantic web In Cooperative Intelligent Agents (2003)

[22] GUEacuteRET C GROTH P STADLER C AND LEHMANN JAssessing linked data mappings using network measures InESWC (2012)

[23] HARTIG O Trustworthiness of data on the web In STI Berlinand CSW PhD Workshop Berlin Germany (2008)

[24] HAYES P Rdf semantics Recommendation World Wide WebConsortium 2004 httpwwww3orgTR2004REC-rdf-mt-20040210

[25] HEATH T AND BIZER C Linked Data Evolving the Webinto a Global Data Space 1st ed No 11 in Synthesis Lectureson the Semantic Web Theory and Technology Morgan andClaypool 2011 ch 2 pp 1 ndash 136

[26] HOGAN A HARTH A PASSANT A DECKER S AND

POLLERES A Weaving the pedantic web In LDOW (2010)[27] HOGAN A UMBRICH J HARTH A CYGANIAK R

POLLERES A AND DECKER S An empirical survey oflinked data conformance Journal of Web Semantics (2012)

[28] HORROCKS I PATEL-SCHNEIDER P BOLEY H TABETS GROSOF B AND DEAN M Swrl A semantic web rulelanguage combining owl and ruleml Tech rep W3C May2004

[29] JACOBI I KAGAL L AND KHANDELWAL A Rule-basedtrust assessment on the semantic web In International confer-ence on Rule-based reasoning programming and applicationsseries (2011) pp 227 ndash 241

[30] JARKE M LENZERINI M VASSILIOU Y AND VASSIL-IADIS P Fundamentals of Data Warehouses 2nd ed SpringerPublishing Company 2010

[31] JURAN J The Quality Control Handbook McGraw-Hill NewYork 1974

[32] KIFER M AND BOLEY H Rif overview Tech repW3C June 2010 httpwwww3orgTR2012NOTE-rif-overview-20121211

[33] KITCHENHAM B Procedures for performing systematic re-views

[34] KNIGHT S AND BURN J Developing a framework for as-sessing information quality on the world wide web Informa-tion Science 8 (2005) 159 ndash 172

[35] LEE Y W STRONG D M KAHN B K AND WANGR Y Aimq a methodology for information quality assess-ment Information Management 40 2 (2002) 133 ndash 146

[36] LEHMANN J GERBER D MORSEY M AND NGONGA

NGOMO A-C DeFacto - Deep Fact Validation In ISWC(2012) Springer Berlin Heidelberg

[37] LEI Y NIKOLOV A UREN V AND MOTTA E Detectingquality problems in semantic metadata without the presence ofa gold standard In Workshop on ldquoEvaluation of Ontologies forthe Webrdquo (EON) at the WWW (2007) pp 51ndash60

[38] LEI Y UREN V AND MOTTA E A framework for eval-uating semantic metadata In 4th International Conference onKnowledge Capture (2007) no 8 in K-CAP rsquo07 ACM pp 135ndash 142

[39] LEO PIPINO RICAHRD WANG D K AND RYBOLD WDeveloping Measurement Scales for Data-Quality Dimen-sions vol 1 ME Sharpe New York 2005

[40] MAYNARD D PETERS W AND LI Y Metrics for evalua-

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33

Page 33: IOS Press Quality Assessment Methodologies for Linked Open ... · IOS Press Quality Assessment Methodologies for Linked Open Data ... digital libraries, journals,

Linked Data Quality 33

tion of ontology-based information extraction In Workshop onldquoEvaluation of Ontologies for the Webrdquo (EON) at WWW (May2006)

[41] MEILICKE C AND STUCKENSCHMIDT H Incoherence asa basis for measuring the quality of ontology mappings In3rd International Workshop on Ontology Matching (OM) at theISWC (2008)

[42] MENDES P MUumlHLEISEN H AND BIZER C Sieve Linkeddata quality assessment and fusion In LWDM (March 2012)

[43] MILLER P STYLES R AND HEATH T Open data com-mons a license for open data In LDOW (2008)

[44] MOHER D LIBERATI A TETZLAFF J ALTMAN D GAND PRISMA GROUP Preferred reporting items for system-atic reviews and meta-analyses the prisma statement PLoSmedicine 6 7 (2009)

[45] MOSTAFAVI M G E AND JEANSOULIN R Ontology-based method for quality assessment of spatial data basesIn International Symposium on Spatial Data Quality (2004)vol 4 pp 49ndash66

[46] NAUMANN F Quality-Driven Query Answering for Inte-grated Information Systems vol 2261 of Lecture Notes inComputer Science Springer-Verlag 2002

[47] PIPINO L LEE Y W AND WANG R Y Data qualityassessment Communications of the ACM 45 4 (2002)

[48] REDMAN T C Data Quality for the Information Age 1st edArtech House 1997

[49] RULA A PALMONARI M HARTH A STADTMUumlLLERS AND MAURINO A On the diversity and availability oftemporal information in linked open data In ISWC (2012)

[50] RULA A PALMONARI M AND MAURINO A Captur-ing the age of linked open data Towards a dataset-independentframework In IEEE International Conference on SemanticComputing (2012)

[51] SCANNAPIECO M AND CATARCI T Data quality under acomputer science perspective Archivi amp Computer 2 (2002)1ndash15

[52] SCHOBER D BARRY S LEWIS E S KUSNIERCZYKW LOMAX J MUNGALL C TAYLOR F C ROCCA-SERRA P AND SANSONE S-A Survey-based naming con-ventions for use in obo foundry ontology development BMCBioinformatics 10 125 (2009)

[53] SHEKARPOUR S AND KATEBI S Modeling and evalua-tion of trust with an extension in semantic web Web Seman-tics Science Services and Agents on the World Wide Web 8 1(March 2010) 26 ndash 36

[54] SUCHANEK F M GROSS-AMBLARD D AND ABITE-BOUL S Watermarking for ontologies In ISWC (2011)

[55] WAND Y AND WANG R Y Anchoring data quality dimen-sions in ontological foundations Communications of the ACM39 11 (1996) 86ndash95

[56] WANG R Y A product perspective on total data quality man-agement Communications of the ACM 41 2 (Feb 1998) 58 ndash65

[57] WANG R Y AND STRONG D M Beyond accuracy whatdata quality means to data consumers Journal of ManagementInformation Systems 12 4 (1996) 5ndash33