Deep Product Comparison on the Semantic Webathene-forschung.unibw.de/doc/115676/115676.pdf · Web ease product data integration. This way, it is possible to augment product data by

Deep Product Comparison onthe Semantic Web|AlexStolz

Dissertation

Deep Product Comparisonon the Semantic Web

Alex Stolz

Erstgutachter: Prof. Dr. Martin HeppZweitgutachter:Prof. Dr. Michael Koch

Fakultät für Wirtschafts- und Organisationswissenschaften

Deep Product Comparison on the Semantic Web

Alex Stolz

Vollständiger Abdruck der von der Fakultät für Wirtschafts- und Organisationswis-senschaften der Universität der Bundeswehr München zur Erlangung des akademischenGrades eines

Doktors der Wirtschafts- und Sozialwissenschaften (Dr. rer. pol.)

genehmigten Dissertation.

Gutachter:

1. Univ.-Prof. Dr. Martin Hepp

2. Univ.-Prof. Dr. Michael Koch

Die Dissertation wurde am 3. März 2016 bei der Universität der Bundeswehr Müncheneingereicht und durch die Fakultät für Wirtschafts- und Organisationswissenschaften am28. Juni 2016 angenommen. Die mündliche Prüfung fand am 19. Juli 2016 statt.

Abstract

Search plays a major role in information systems of today. It facilitates the finding ofinformation on our desktop computers and mobile devices, in enterprise intranets, oron the Web. Yet, as the volume of data grows, it becomes increasingly difficult to getthe required information. Problems in particular arise with regard to search efficiency(“Can the information be procured at low cost?”) and search effectiveness (“Are thereturned results satisfying?”). An important use case in this context is the discovery ofproducts on the Web. Product search is challenging for several reasons: (1) The amountof product-related documents has increased over time; (2) the data contained in thosedocuments is mostly unstructured and heterogeneous; (3) products are multi-dimensionalobjects; and, (4) users have often complex information needs. On that account, thequality and granularity of data are critical requirements for product search algorithms onthe Web.

This thesis contributes a search framework for product offers on the Semantic Web, alsoknown as the Web of Data. Structured data on the Web has grown rapidly over the lastfive years. The key drivers have been Linked Data sources and Web pages with embeddedMicrodata or RDFa markup. Structured data can mitigate many of the limitations oftraditional Web searches for products. For instance, global resource identifiers on theWeb ease product data integration. This way, it is possible to augment product data byfine-grained and high-quality product descriptions. In our work, this authoritative data issupplied via manufacturer datasheets and product classification systems. These granularproduct descriptions enable deep product comparison over several product dimensions.A crucial component of our solution is the implementation of a faceted search interface.Faceted search is a proper way to deal with the iterative and incremental nature ofsearch. It engages the user in the search process, letting him continually learn aboutthe option space in an exploratory fashion. As an important innovation, our approach isdata- or instance-driven, i.e. the availability of data determines the options presented tothe user. This is in stark contrast to traditional search interfaces that typically rely ona system-wide, domain-specific, rigid conceptual structure. Our design choice eases tosearch within the often sparse graph of product information on the Web of Linked Data.Furthermore, it extends the feasibility of our approach to other application areas outsidethe narrow scope of e-commerce.

i

Zusammenfassung

Suche ist eine der zentralen Funktionen moderner Informationssysteme. Suchsystemefinden sich etwa auf Arbeitsplatzrechnern und bei mobilen Endgeräten wieder, ins-besondere aber auch in Firmennetzwerken und im World Wide Web. Vor allem fürdie letztgenannten Systeme stellt die steigende Menge an Daten eine immer größereHerausforderung dar. Schwierigkeiten gibt es sowohl hinsichtlich der Sucheffizienz (“Kön-nen die Informationen mit geringem Kostenaufwand beschafft werden?”) als auch derSucheffektivität (“Sind die gelieferten Resultate zufriedenstellend?”). Die Bedeutung unddie wachsenden Probleme bei der Informationssuche treten besonders deutlich bei derProduktsuche im Web hervor. Hierfür lassen sich mehrere Gründe anführen: (1) Einerseitsstieg die Anzahl der Webseiten mit Produktbeschreibungen über die letzten Jahre erhe-blich. (2) Andererseits sind die Inhalte solcher Webdokumente meist unstrukturiert undheterogen. Zusätzlich erschweren (3) der mehrdimensionale Charakter von Produkten und(4) die komplexen Informationsbedürfnisse der Nutzer die herkömmliche Produktsuche.Aufgrund dessen sind eine hohe Qualität und Granularität von Daten als Voraussetzungfür Suchalgorithmen im Web unabdingbar.

In der vorliegenden Dissertation wird ein Rahmenwerk für die Produktsuche im SemanticWeb bzw. Web of Data vorgestellt. Das Semantic Web stellt aktuell bereits eine um-fassende Menge an strukturierten Angebotsdaten bereit. Deren rasantes Wachstum wurdein den letzten fünf Jahren besonders von der Idee eines Linked Open Data angetriebenund war gekennzeichnet durch eine massive Verbreitung von Daten in den syntaktischenVarianten Microdata und RDFa. Mit einer Produktsuche über das Semantic Web lassensich zahlreiche Einschränkungen der traditionellen Websuche verbessern. Zum Einenvereinfachen strukturierte Daten und global gültige Bezeichner im Web die Dateninte-gration. Dadurch lassen sich Produktdaten mit granularen und qualitativ hochwertigenProduktbeschreibungen ergänzen. Solche zusätzlichen Daten werden in dieser Arbeit überHerstellerkataloge und Produktklassifikationsstandards bereitgestellt. Sie erlauben einendetaillierten Produktvergleich, der mehrere Produktdimensionen parallel berücksichtigt.Zum Anderen wird eine Benutzeroberfläche für eine filterbasierte Suche von Produktenentwickelt. Die hierbei verwendete Facettensuche (bzw. Faceted Search) beschreibt einiteratives und inkrementelles Suchparadigma unter aktiver Einbeziehung von Nutzern.Auf diese Weise können Nutzer den Möglichkeitsraum kontinuierlich auf explorative

iii

iv

Weise erkunden. Eine wichtige Innovation des hier vorgestellten Beitrags ist eine daten-bzw. instanzgetriebene Suchoberfläche, bei der die Auswahlmöglichkeiten für eine weitereNavigation auf Basis der im Moment vorhandenen Daten erzeugt werden. Diese Ideesteht im Gegensatz zu den sonst üblichen Benutzeroberflächen für die Produktsuche, diein der Regel einer systemweiten, domänenspezifischen und starr vorgegebenen Strukturfolgen. Diese Entwurfsentscheidung erleichtert außerdem die Suche in den bisher häufigrecht dünn besetzten Graphen an Produktdaten im Web of Linked Data. Schließlichkann der Ansatz problemlos auf weitere Anwendungsgebiete außerhalb des E-Commerceübertragen werden.

Acknowledgments

This thesis is primarily a result of my own enduring commitment to it. However, the entirePhD project also involved substantial effort, resources, and sacrifices from additionalindividuals I herewith take the opportunity to say thank you.

Above all, I am greatly indebted to my PhD supervisor, Univ.-Prof. Dr. Martin Hepp, forhis strong support and encouragement throughout the work on my thesis. I have been anextremely lucky person to meet him, especially he giving me the opportunity to work inan overly stimulating and inspiring research environment.

Furthermore, I owe thanks to my friends and colleagues that have accompanied me overthe years, in particular Dr. Christian Fürber, Dr. Mouzhi Ge, Tobias Ostheim, AndreasRadinger, Dr. Bene Rodriguez-Castro, Dr. Uwe Stoll, László Török, and Francesca Zarl(sorted in alphabetical order by their last names).

I dedicate this thesis to my beloved family – my wife Barbara with our son Luis, myparents Helga and Alois, my two sisters Beate and Verena, and my brother Dieter.Without your relentless patience, backup, and appreciation it would have never beenpossible to bring to completion such a complex and longstanding thesis project.

Many thanks to all of you!!

Alex StolzNeubiberg, February 2017

v

Contents

List of Tables ix

List of Figures xi

List of Listings xv

List of Abbreviations xix

1 Introduction 1

1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5 Research Method and Contributions . . . . . . . . . . . . . . . . . . . . . 131.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Background and Related Work 19

2.1 Relevant Economic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 E-Business and E-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . 272.3 Semantic Web and Linked Data . . . . . . . . . . . . . . . . . . . . . . . . 522.4 Semantic Data Interoperability . . . . . . . . . . . . . . . . . . . . . . . . 882.5 Product Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3 Data Collection 125

3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1263.2 State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . 1293.3 Sweet-Spot Deep Crawling Approach . . . . . . . . . . . . . . . . . . . . . 1333.4 Evaluation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

vi

Contents vii

4 Product Model Master Data 151

4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1524.2 State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . 1564.3 Product Model Master Data for the Semantic Web . . . . . . . . . . . . . 1584.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1654.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

5 Product Type Information 173

5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1745.2 State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . 1775.3 Deriving Product Ontologies from Knowledge Organization Systems . . . 1805.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1875.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1915.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

6 Cleansing and Enrichment 195

6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1966.2 Typology of Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1976.3 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2076.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2356.5 Implementation of a Data Management Web User Interface . . . . . . . . 2386.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

7 Faceted Product Search on the Semantic Web 245

7.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2477.2 State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . 2497.3 Adaptive Faceted Search Interface for Product Offers . . . . . . . . . . . . 2527.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2607.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

8 Discussion and Conclusion 269

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2708.2 Contributions and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . 2718.3 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2738.4 Critical Review and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 2748.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2778.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

A User Survey 281

Contents viii

A.1 System Usability Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282A.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

B Index of DVD Contents 293

C Online Tools and Web Resources 295

Bibliography 297

List of Tables

2.1 Transaction activities [cf. PRW08, p. 42] . . . . . . . . . . . . . . . . . . . 232.2 Categories of e-commerce transactions by entities involved [cited from Gri03] 292.3 Characterization of MDM, PDM, PLM, PIM, and ERP . . . . . . . . . . 342.4 Date and time formats as defined by ISO 8601 . . . . . . . . . . . . . . . . 392.5 Selected snippet related to “kilogram” from the UN/CEFACT Common

Code [Uni09a] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.6 Characteristics of product identifier types . . . . . . . . . . . . . . . . . . 442.7 High-level comparison of product categorization standards . . . . . . . . . 482.8 (X)HTML attributes defined for RDFa [based on Adi+13, Section 5] . . . 662.9 Most important contributors to the definition of the term ontology . . . . 73

3.1 Structured data in the Web Data Commons [based on WebNDa; WebNDd;WebNDc] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3.2 Instance count and average number of properties in crawl dataset . . . . . 1423.3 Comparison of entity frequency in WDC and in GRC . . . . . . . . . . . . 1463.4 Comparison of the amount of RDF triples in shops for WDC and GRC,

sorted in descending order by the number of triples in GRC . . . . . . . . 1493.5 Comparison of the amount of RDF triples in shops for WDC and GRC,

filtered by domains for which WDC contains more triples than GRC . . . 149

4.1 Comparison of product features between manufacturers and retailers . . . 1534.2 Mapping of product details from BMEcat to GoodRelations . . . . . . . . 1604.3 Mapping of product features from BMEcat to GoodRelations . . . . . . . 1614.4 Mapping of a catalog group system in BMEcat to a rdfs:subClassOf hierarchy1624.5 Validation of BMEcat conversions . . . . . . . . . . . . . . . . . . . . . . . 1674.6 Product features in BSH BMEcat versus data from retailers publishing

GoodRelations markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1684.7 Product searches for a digital camera model on popular e-marketplaces . . 170

5.1 Statistics of product classification standards and category systems . . . . 190

ix

List of Tables x

6.1 Obstacles with respective solutions . . . . . . . . . . . . . . . . . . . . . . 2086.2 Statistics of entities in the crawl corpus . . . . . . . . . . . . . . . . . . . 2366.3 Data quality problems in the crawl corpus . . . . . . . . . . . . . . . . . . 237

7.1 Variety of properties and values in an automotive dataset . . . . . . . . . 2617.2 Results of SUS experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 2647.3 Web shops with number of matching items . . . . . . . . . . . . . . . . . . 266

List of Figures

1.1 Google rich snippet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Multi-parametric view of a powered hedge trimmer . . . . . . . . . . . . . 81.3 Deep product comparison on the Web . . . . . . . . . . . . . . . . . . . . 101.4 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 B2B and B2C e-commerce [adapted from Cha09a, p. 65] . . . . . . . . . . 292.3 Models of master data exchange: (a) Bilateral versus (b) multilateral [from

SLÖ08] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4 Structural complexity increase of knowledge organization systems [adapted

from Nat05, p. 17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.5 Semantic Web layer cake [adapted from DFH11, p. 20] . . . . . . . . . . . 542.6 Evolution of the LOD cloud diagram . . . . . . . . . . . . . . . . . . . . . 572.7 Relationship between URI, URL, and URN [based on BFM05] . . . . . . . 582.8 RDF triple represented as a graph . . . . . . . . . . . . . . . . . . . . . . 602.9 Example as an RDF graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 622.10 RDF graph that corresponds to the Turtle example . . . . . . . . . . . . . 642.11 RDFS language additions . . . . . . . . . . . . . . . . . . . . . . . . . . . 712.12 Google rich snippet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752.13 Agent-Promise-Object principle [based on Hep15b] . . . . . . . . . . . . . 772.14 Side-by-side comparison of RDF graph and SPARQL graph . . . . . . . . 842.15 Taxonomy of schema matching approaches [from RB01] . . . . . . . . . . 902.16 Ontology matching process [from ES07, p. 44] . . . . . . . . . . . . . . . . 922.17 Classification of matching techniques for ontology matching [from ES07,

p. 65] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932.18 Side-by-side comparison of precision and recall . . . . . . . . . . . . . . . 1042.19 Faceted search interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112.20 Processes of an NLP system [from Bat95] . . . . . . . . . . . . . . . . . . 1172.21 Match classes [based on Col+06] . . . . . . . . . . . . . . . . . . . . . . . 123

xi

List of Figures xii

3.1 Distribution of syntaxes in the Web Data Commons . . . . . . . . . . . . 1313.2 Average number of entities per domain in Common Crawl corpora (log-

scaled y-axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1323.3 Share of structured product offer data with respect to domains with RDFa

(for gr:Offering) and Microdata (for s:Offer) in the Web Data Commonsfrom 2012–2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3.4 Share of structured product offer data with respect to all structured datamarkup in the Web Data Commons from 2012–2014 . . . . . . . . . . . . 133

3.5 Flowchart of the crawling algorithm . . . . . . . . . . . . . . . . . . . . . 1373.6 Distribution of items per shop (log-scaled y-axis) . . . . . . . . . . . . . . 1423.7 Boxplot of the distribution of items per shop . . . . . . . . . . . . . . . . 1423.8 Ten most represented shops by offer count . . . . . . . . . . . . . . . . . . 1433.9 Frequency of offer properties in crawl (upper 90% – 20 out of 55) . . . . . 1443.10 Frequency of flat offer properties in crawl (upper 90% – 24 out of 46) . . . 1443.11 Frequency of product properties in crawl (upper 90% – 11 out of 43) . . . 1453.12 Frequency of product model properties in crawl (upper 90% – 17 out of 24)1463.13 Comparison of the E/D-ratios for WDC and GRC . . . . . . . . . . . . . 147

4.1 Enriching shop pages with product master data from manufacturers basedon “strong identifiers” [from Hep12a] . . . . . . . . . . . . . . . . . . . . . 155

4.2 Retailer and manufacturer data in GoodRelations . . . . . . . . . . . . . . 1564.3 BMEcat 2005 skeleton [based on SLK05a] . . . . . . . . . . . . . . . . . . 1574.4 Boxplot of the product offer count (with EANs) across Web shops in the

crawl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1684.5 Boxplots of the distribution of shop offers per EAN . . . . . . . . . . . . . 1694.6 Frequency distribution of EANs with respect to the number of product

offers for a particular EAN in the dataset . . . . . . . . . . . . . . . . . . 171

5.1 Conceptual dynamics of the eCl@ss product categorization standard [basedon ECl14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

5.2 Conceptual architecture of PCS2OWL . . . . . . . . . . . . . . . . . . . . 1815.3 GenTax applied to a subset of the Google product taxonomy [cf. HdB07] . 1845.4 Reverse-engineering of the Google product taxonomy . . . . . . . . . . . . 1885.5 Examples of valid and invalid subsumption relations from the GPC hierar-

chy when interpreted as product classes . . . . . . . . . . . . . . . . . . . 189

6.1 Property hierarchy of quantitative values in GoodRelations [based on Hep11]227

List of Figures xiii

6.2 Base and derived units in QUDT linked via a common type qudt:LengthUnit[based on NAS10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

6.3 Data management Web user interface . . . . . . . . . . . . . . . . . . . . . 2396.4 Axiom replacement mechanism . . . . . . . . . . . . . . . . . . . . . . . . 241

7.1 Mock-up of a faceted search interface for e-commerce . . . . . . . . . . . . 2537.2 Screenshot of our faceted product search prototype . . . . . . . . . . . . . 2547.3 Incremental search cycle among multiple search paradigms . . . . . . . . . 2577.4 Screenshot of a product details modal window with instance-based search

filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2597.5 Change of option space with 100 random walk iterations over a decision

tree for 875 automobile offers . . . . . . . . . . . . . . . . . . . . . . . . . 2627.6 Screenshot of the search interface with real data from a household crawl . 267

List of Listings

2.1 Example in RDF/XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632.2 Example in N-Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642.3 Example in Turtle/N3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652.4 Example in RDFa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672.5 Example in JSON-LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.6 Example in Microformats . . . . . . . . . . . . . . . . . . . . . . . . . . . 692.7 Example in Microdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702.8 Schema.org in Microdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762.9 Example query in SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.1 Example of product details in Turtle/N3 . . . . . . . . . . . . . . . . . . . 1614.2 Example of product features in Turtle/N3 . . . . . . . . . . . . . . . . . . 1624.3 Example of catalog group information in Turtle/N3 . . . . . . . . . . . . . 163

5.1 Calculating the number of hierarchy levels of product classification systems1905.2 Annotation example in Microdata syntax . . . . . . . . . . . . . . . . . . 192

6.1 Categorizing products with textual properties . . . . . . . . . . . . . . . . 2026.2 SPARQL SELECT query to retrieve products with prices in “Euros” . . . 2036.3 Linking two entities with the gr:includesObject modeling pattern . . . . . 2056.4 Linking two entities with the gr:includes modeling shortcut . . . . . . . . 2056.5 Modeling of intervals in GoodRelations . . . . . . . . . . . . . . . . . . . . 2066.6 Price modeling patterns in schema.org . . . . . . . . . . . . . . . . . . . . 2066.7 Namespace declarations used for Turtle/N3 and SPARQL examples . . . . 2096.8 Adding RDF datatypes to plain literals . . . . . . . . . . . . . . . . . . . . 2106.9 OWL definition of the gr:hasValueInteger property . . . . . . . . . . . . . 2106.10 SPARQL CONSTRUCT query to recover the correct datatype from schema

information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2116.11 Assigning correct RDF datatypes to literals with incorrect datatypes . . . 2116.12 Converting the invalid data value “5,0” to “5.0” . . . . . . . . . . . . . . . 2116.13 SPARQL CONSTRUCT query to convert invalid numerical values . . . . 2126.14 Redundant product models with the same EAN . . . . . . . . . . . . . . . 213

xv

List of Listings xvi

6.15 SPARQL CONSTRUCT query for product models based on arbitraryproduct identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

6.16 owl:sameAs links between redundant product model entities . . . . . . . . 2136.17 Redundant product models based on the combination of manufacturer

name and MPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2156.18 SPARQL CONSTRUCT query for consolidating redundant product models

based on identical pairs of brand names and MPNs . . . . . . . . . . . . . 2156.19 Product offering definition in schema.org and GoodRelations . . . . . . . . 2166.20 SPARQL CONSTRUCT query to convert a product offer in schema.org to

the respective offer in GoodRelations . . . . . . . . . . . . . . . . . . . . . 2166.21 Product model definition in schema.org (actually schema.rdfs.org) and

GoodRelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2176.22 Axiom to translate among two product model classes and product model

instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2176.23 SPARQL SELECT query and triples returned by selecting all GoodRela-

tions product models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2176.24 Two equivalent modeling patterns for prices in schema.org . . . . . . . . . 2186.25 SPARQL CONSTRUCT query to translate between two equivalent model-

ing patterns within the same schema . . . . . . . . . . . . . . . . . . . . . 2186.26 Modeling shortcut and expanded version for attaching a product to an offer2196.27 SPARQL CONSTRUCT query to expand a shortcut pattern for products

to its canonical long variant . . . . . . . . . . . . . . . . . . . . . . . . . . 2196.28 Quantitative values as point values and intervals . . . . . . . . . . . . . . 2206.29 SPARQL CONSTRUCT query to convert point values to intervals . . . . 2206.30 Product model information based on matching EANs . . . . . . . . . . . . 2216.31 SPARQL CONSTRUCT query to establish a link between products and

product models with matching EANs . . . . . . . . . . . . . . . . . . . . . 2216.32 Product with and without product features from the product model . . . 2226.33 SPARQL CONSTRUCT query for the inheritance of product features from

the product model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2226.34 Product variant with and without product features from a related product

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2236.35 SPARQL CONSTRUCT query for the inheritance of product features from

product variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2246.36 Products where one (a toner cartridge) is a consumable for another (a

printer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

List of Listings xvii

6.37 SPARQL CONSTRUCT query to add gr:isConsumableFor link based onthe MPN of one product contained in the product name of the other product225

6.38 Two examples of where some unit codes (code of the unit of measurementand the currency code) are missing for quantitative values . . . . . . . . . 226

6.39 SPARQL CONSTRUCT query to assign the default currency value wher-ever missing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

6.40 SPARQL CONSTRUCT query to recover missing unit codes in quantitativevalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.41 Comparison of non-granular and granular quantitative value descriptions . 2286.42 SPARQL CONSTRUCT query that applies a heuristic to extract a value

and a unit code from a free-text field . . . . . . . . . . . . . . . . . . . . . 2286.43 Intervals modeled in text as compared to individual intervals . . . . . . . 2296.44 SPARQL CONSTRUCT query to convert intervals in text (decimals or

integers) to intervals modeled using appropriate properties . . . . . . . . . 2296.45 Base and derived units in QUDT represented in N3 syntax [adapted from

Mas+11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2316.46 Unit conversion of quantitative values in GoodRelations . . . . . . . . . . 2326.47 Provision of additional axioms that are not covered by QUDT . . . . . . . 2336.48 Example of a populated currency exchange rate instance . . . . . . . . . . 2346.49 SPARQL CONSTRUCT rule for currency conversion with SPARQL . . . 2356.50 SPARQL SELECT with owl:deprecated . . . . . . . . . . . . . . . . . . . . 2416.51 Interchangeable execution of two cleansing rules . . . . . . . . . . . . . . . 242

List of Abbreviations

AI Artificial IntelligenceAP Application ProtocolAPI Application Programming InterfaceASCII American Standard Code for Information InterchangeASIN Amazon Standard Identification NumberAtom Atom Syndication FormatB2B Business-to-BusinessB2C Business-to-ConsumerB2G Business-to-GovernmentBIM Binary Independence ModelBME Bundesverband Materialwirtschaft, Einkauf und Logistik (Engl.: Federal

Association of Materials Management, Purchasing and Logistics)BPREF Binary PreferenceC2B Consumer-to-BusinessC2C Consumer-to-ConsumerC2G Consumer-to-GovernmentCAD Computer-aided DesignCAE Computer-aided EngineeringCAM Computer-aided ManufacturingCET Central European TimeCMS Content Management SystemCOINS COmmon INterest SeekerCPA Classification of Products by ActivityCPC Central Product ClassificationCPU Central Processing UnitCPV Common Procurement VocabularyCSV Comma-separated ValuesCTR Click-through RateCURIE Compact URICWA Closed-World AssumptioncXML Commerce XML

xix

List of Abbreviations xx

DAML DARPA Agent Markup LanguageDAML-ONT DAML Ontology LanguageDBMS Database Management SystemEAN European Article NumberebXML Electronic Business using XMLEDI Electronic Data InterchangeEDM Engineering Data ManagementELMAR Electronic Market Data FeedeOTD ECCMA Open Technical DictionaryEPC Electronic Product CodeERP Enterprise Resource PlanningETIM ElektroTechnisches InformationsModell (Engl.: Electro-Technical Infor-

mation Model)F-Logic Frame LogicFOAF Friend of a FriendFTP File Transfer ProtocolG2B Government-to-BusinessG2C Government-to-ConsumerG2G Government-to-GovernmentGATE General Architecture for Text EngineeringGDSN Global Data Synchronization NetworkGPC Global Product ClassificationGRAPPA Generic Request Architecture for Passive Provider AgentsGTIN Global Trade Item NumberHCI Human-Computer InteractionHTML Hypertext Markup LanguageHTTP Hypertext Transfer ProtocolID3 Iterative Dichotomizer 3IMEI International Mobile Station Equipment IdentityIoT Internet of ThingsIP Internet ProtocolIR Information RetrievalIRI Internationalized Resource IdentifierISBN International Standard Book NumberISO International Organization for StandardizationJSON JavaScript Object NotationJSON-LD JSON for Linked Data

List of Abbreviations xxi

KOS Knowledge Organization SystemLARKS Language for Advertisement and Request for Knowledge SharingLDF Linked Data FragmentLOD Linked Open DataLSI Latent Semantic IndexingMAP Mean Average PrecisionMDM Master Data ManagementMPN Manufacturer Part NumberN3 Notation 3NER Named Entity RecognitionNLP Natural Language ProcessingNLTK Natural Language ToolkitOAGIS Open Applications Group Integration SpecificationOCML Operational Conceptual Modeling LanguageOEM Original Equipment ManufacturerOGP Open Graph ProtocolOIL Ontology Inference LayerOM Ontology for Units of Measure and Related ConceptsOPDM Ontology-based Product Data ManagementOSI Open Systems InterconnectionOWA Open-World AssumptionOWL Web Ontology LanguagePDM Product Data ManagementPHP Hypertext PreprocessorPIM Product Information ManagementPLM Product Lifecycle ManagementPOS Part-of-SpeechPRICAT Price Catalog MessagePRODAT Product Data MessagePTO Product Types OntologyPZN Pharmazentralnummer (Engl.: Central Pharmaceutical Number)QA Question AnsweringQUDT Quantities, Units, Dimensions and TypesRDF Resource Description FrameworkRDFa Resource Description Framework in AttributesRDFS RDF SchemaRDQL RDF Data Query Language

List of Abbreviations xxii

REST Representational State TransferRFC Request for CommentsRFID Radio-Frequency IdentificationRIF Rule Interchange FormatRNTD RosettaNet Technical DictionaryRQL RDF Query LanguageRSS Really Simple Syndication, sometimes Rich Site Summary or RDF Site

SummaryRuleML Rule Markup LanguageSaaS Software as a ServiceSCM Supply Chain ManagementSEO Search Engine OptimizationSERP Search Engine Results PageSeRQL Sesame RDF Query LanguageSHADE SHared DEpendency EngineeringSHOE Simple HTML Ontology ExtensionSI International System of UnitsSKOS Simple Knowledge Organization SystemSKU Stock Keeping UnitSPARQL SPARQL Protocol and RDF Query LanguageSPIN SPARQL Inferencing NotationSQL Structured Query LanguageSTEP Standard for the Exchange of Product Model DataSUS System Usability ScaleSVG Scalable Vector GraphicsSWRL Semantic Web Rule LanguageTCP Transmission Control ProtocolTED Tenders Electronic DailyTF-IDF Term Frequency--Inverse Document FrequencyTREC Text REtrieval ConferenceTurtle Terse RDF Triple LanguageUBL Universal Business LanguageUCUM Unified Code for Units of MeasureUDF User-defined FunctionUNA Unique-Names AssumptionUNSPSC United Nations Standard Products and Services CodeUPC Universal Product Code

List of Abbreviations xxiii

URI Uniform Resource IdentifierURL Uniform Resource LocatorURN Uniform Resource NameVIN Vehicle Identification NumberVSO Vehicle Sales OntologyW3C World Wide Web ConsortiumWDC Web Data CommonsWSD Word Sense DisambiguationWWW World Wide WebWZ Klassifikation der Wirtschaftszweige (Engl.: German Classification of

Economic Activities)xCBL XML Common Business LibraryXHTML Extensible Hypertext Markup LanguageXML Extensible Markup LanguageXOL XOL Ontology Exchange LanguageXRO Exchange Rate Ontology

1 Introduction

1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Large Quantity of Unstructured and Heterogeneous Product Data on theWeb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Complex Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.3 Insufficient Support for User Interaction . . . . . . . . . . . . . . . . . . 8

1.2.4 Research Problem and Research Hypothesis . . . . . . . . . . . . . . . . 9

1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 Implications of Constrained Web Searches . . . . . . . . . . . . . . . . . 10

1.3.2 Economic Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Research Method and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

The World Wide Web (WWW) has undergone a remarkable evolution over the last twodecades, where the second half was characterized by new ways of humans interactingwith Web applications, commonly known as the Web 2.0 [ORe05; ORe07]. Its novelpractices and principles also accelerated the dynamics of content publishing. Static Webpages have gradually been replaced by dynamic Web content based on information fromdatabases and other Web services (Web application programming interfaces (APIs)). Inconcrete terms, Web users set up blogs (also “Weblogs”) [e.g. Nar+04; Her+04; Sch+04]and Wikis [e.g. LC01] almost effortlessly in order to discuss topics of interest and tofoster collaboration. As a result, a lot of useful content has since been generated, whichculminated in a considerable amount of Web documents that we are facing today – andthe number of documents has quickly grown beyond what humans can process, especiallyif there is no mature search or recommendation engine in place that is able to cope withthe large quantity, diversity, and granularity of data.

This problem can be well illustrated for e-commerce. There is an ongoing trend towardspurchasing goods online, which caused numerous e-marketplaces to emerge over time.

1

1 Introduction 2

Consequently, more and more sellers have been pushing their products for sale intoelectronic marketplaces, and ultimately, onto the Web. From 2005 to 2014, for example,retail e-commerce sales in the USA has grown almost three-fold relative to total retailsales, totaling 7.7 percent (96 billion U.S. dollars) of the entire U.S. retail sales marketin the fourth quarter of 2014 [Uni14]. Between 2013 to 2014 alone, it grew by 14.7%[Uni14]. By comparison, total retail sales increased by only 3.8% over the same timeperiod [Uni14]. In Germany, the overall e-commerce turnover for 2015 is estimated at43 billion Euros [Han14] (roughly 49 billion U.S. dollars at an exchange rate of 1.13461

(EUR/USD)), which is after all three times as much as it was only ten years ago [Han14].This increase of the value of the product volume traded online has an immediate impacton the efficiency and effectiveness of product searches and product recommendations onthe Web; because the higher the degree of choice, the more prevalent is the problem ofproduct search and discovery. In other words, a growing share of e-commerce increasesthe problem of consumer choice and all associated information processing on the Web.

In fact, it is currently very difficult and time-consuming to seek for products and serviceson the Web. For example, consider the following information needs:

• Who can I employ to do the brickwork for my house?

• What is the best hotel for my weekend trip to Paris?

• What restaurant shall I choose for having dinner tonight?

• Which book should I read to learn Python programming?

• Which digital camera shall I buy?

Although traditional Web searches would return results to some of these questions, itoften requires substantial human effort in order to satisfy the user’s actual informationneeds. Product search on the Web is complicated, because the necessary information isoften contained in many different sites and pages and because the final choice is basedon multidimensional trade-off decisions [cf. BLP98], e.g. between sometimes conflictingfeatures. There are also learning effects in the search process, and there is an additionaltrade-off decision between the costs of investing more search effort and the expectablegain in the form of a better final choice [cf. Sti61; BSV12].

1Exchange rate as of February 25, 2015. Available online athttp://www.currency2currency.org/EUR/USD/20150225 (accessed on February 25, 2015)

http://www.currency2currency.org/EUR/USD/20150225

1.1 State of the Art 3

1.1 State of the Art

Search systems nowadays play a major role in helping to find information on our desktopcomputers and mobile devices, in enterprise intranets, or on the Web; for an overview ofthe history of information retrieval, see e.g. [SC12]. While it is relatively straightforwardto develop custom search systems for well-controlled collections [BP98], e.g. documentsearches on personal computers with rigid data and file structures and moderate amountsof data, the situation is quite different for more heterogeneous systems like the openWeb. On the Web, virtually everyone is free to publish contents from anywhere and inwhatever language or data format is preferred [cf. BP98].

Classical search can only provide limited capabilities for fulfilling the complex informationneeds of today’s Internet users. Traditional, document-based Web search has its originsin information retrieval (IR) [MRS09] research and is thus operating over a full-text indexof documents found on the Web. Its core task is to find a number of relevant resourcesamong a collection of documents, regardless of whether their contents are structured ornot; it is also an open research challenge to combine information from multiple documents.Traditional research determines the relevance of documents relative to a user’s keywordquery based on the frequency of the search terms appearing in the document and in thedocument collection, e.g. using a cosine similarity2 score based on term frequency–inversedocument frequency (TF-IDF)3. In addition to that, other algorithms may be appliedto complement IR metrics, but they often vary from search engine to search engine.PageRank [Pag+98] e.g., a prominent ranking measure, considers the link structure ofdocuments. The algorithm computes the importance of a document relying on a statisticalanalysis about incoming and outgoing links, which value is then further propagated tothe adjacent nodes in the link graph. According to this, a Web page is generally themore relevant the more popular Web pages are linking to it. Modern search solutions likeGoogle use a combination of very sophisticated approaches for keyword-based search; foran overview, see e.g. [BP98; Bif+05; Eva07; Su+14; Goo15c].

Since the early 2000s, structured data4 has found its way into the document-based Webin the form of the Semantic Web. The Semantic Web is the name of an extension ofthe current Web with the addition of assigning well-defined meaning to information,allowing computers and individuals to better cooperate, and in particular for computers

2The angle between two term vectors q and d determines their similarity. q is the query term vector,whereas d denotes the document term vector.

3Product of term frequency and inverse document frequency, thus adding relevance weights to the termvectors.

4From here on we refer to the term structured data as data that underlies an explicit, formally defineddata model.

1 Introduction 4

to support humans in the task of combining and interpreting information on the Web[BHL01]. The Semantic Web relies on the Resource Description Framework (RDF)data model that was proposed as a W3C Recommendation in 2004 [MM04], and onUniform Resource Identifiers (URIs) for uniquely identifying resources [BFM05]. Firstapplications that were built on top of RDF encoded meaning in an XML-based RDFsyntax called RDF/XML [GS14]. Later on, the dissemination of Semantic Web contenton the Web has profited from the emergence of data formats like the Resource DescriptionFramework in Attributes (RDFa) and Microdata. These data formats enable to embedstructured data in traditional Web content in HTML, which caused the number of Webpages exposing structured data to increase significantly, as surveyed in [MP12; Biz+13;MPB14]. In the domain of e-commerce, a large body of online product data is alreadyexpressed this way using the GoodRelations [Hep08a; Hep12b] (mainly in RDFa syntax)and schema.org [SchND] (mainly in Microdata syntax) vocabularies [cf. MPB14]. Suchstructured information can be of help for improving the search experience and the accuracyof product searches.

Market-leading search engines have already started to make sense out of this structureddata. One effect that becomes immediately apparent is that they reward product pagesthat feature structured data markup by nicely decorating search results [cf. Mik08;Haa+11] and, thus, highlighting them prominently on the search engine results pages(SERPs), commonly referred to as rich snippets [GGH09] (Google) or rich captions[MicND] (Bing). A search result as displayed on Google, annotated with product ratings,price details and stock availability, is illustrated in Figure 1.1. In addition, searchengines benefit from structured product descriptions by obtaining relevance signals fromthe shop pages, which they might use to better assess the relevance of a page for aparticular query. Moreover, the structured markup can be combined with knowledgethat the search engines gathered over time, to accomplish novel and more useful formsof SERPs. Google Inc., for example, announced the Knowledge Graph5 in 2012, whichrepresents a knowledge base that is currently used to augment traditional search resultswith supplementary info boxes presenting summaries for certain entities in response touser queries6. While this knowledge base is up to now largely backed by third-party datasources (e.g. Freebase7, Wikipedia8, or the CIA World Factbook9) [Sin12], search enginesobtain a considerable part of their valuable information from their large body of crawled

5https://www.google.com/intl/es419/insidesearch/features/search/knowledge.html (accessedon February 22, 2016)

6http://www.business2community.com/online-marketing/strings-things-quick-primer-semantic-search-0621611 (accessed on July 22, 2014)

7http://www.freebase.com/ (accessed on May 12, 2014)8http://www.wikipedia.org/ (accessed on July 22, 2014)9https://www.cia.gov/library/publications/the-world-factbook/ (accessed on July 22, 2014)

https://www.google.com/intl/es419/insidesearch/features/search/knowledge.html

http://www.business2community.com/online-marketing/strings-things-quick-primer-semantic-search-0621611

http://www.business2community.com/online-marketing/strings-things-quick-primer-semantic-search-0621611

http://www.freebase.com/

http://www.wikipedia.org/

https://www.cia.gov/library/publications/the-world-factbook/

1.2 Problem Statement 5

Figure 1.1: Google rich snippet

documents from the Web, where they infer knowledge by utilizing advanced techniquessuch as natural language processing (NLP) or machine learning [cf. Don+14]. With theincreased availability of structured data, building up such knowledge bases is simplified alot, which promises to be very useful for the dynamic domain of products and services.

1.2 Problem Statement

With respect to state-of-the-art solutions, there are three main aspects that complicateproduct searches on the traditional, document-based Web, namely

1. the growing amount of product data published online which is distributed, weaklystructured, and heterogeneous,

2. complex information needs of Web users that are constrained by keyword-basedsearch user interfaces, and

3. insufficient support for user interaction and limited opportunities for the user tolearn about the option space.

In the following, we discuss the characteristics of these three problems in more detail.Thereupon, we identify the research problem and define the research hypothesis for thisthesis.

1.2.1 Large Quantity of Unstructured and Heterogeneous Product Data on theWeb

The essential characteristics of content published on the Web that prevent realizing deepproduct comparison, search, and discovery are

• the vast amount of product data,

• the data being mostly raw and unstructured,

• distributed content, and

1 Introduction 6

• heterogeneous descriptions and representations.

The publication of more and more product data on the Web leads to a greater varietyof searchable products. At the same time, it makes finding and comparing products onthe Web increasingly challenging. For instance, introducing a new cell phone model bya manufacturer implies the publication of several respective offers by numerous onlinevendors with varying descriptions. This seems desirable at a first glance, because goodssupplied by many vendors gives customers more choice and increases the chance of abetter match; besides that, prices are lowered as a result of competition. However, theaddition of product data increases the search space (or option space, information space),which makes it more difficult and expensive to find product offers relevant to a particularquery. The increasing specificity [cf. PRW08, p. 43] and variety of these products furthercauses the option space to grow. While years ago there only existed a small number ofcell phones with very similar specifications, the present market for electronics is offeringa wealth of mobile devices with ample product configurations ranging from classicalmobile phones (i.e. feature phones) over intelligent smartphones and “phablets”10 to tabletcomputers.

Raw and unstructured data further inhibits product search and comparability on the Web,especially for very specific products, where the quality is not easy to grasp for potentialbuyers. A considerable part of the product data on the Web is maintained in relationaldatabases, where it is typically stored in a well-structured form. However, once put onlinemuch of the original information gets lost, because Web markup languages are not able topreserve the data structure [e.g. Haa+11]. Markup languages like the Hypertext MarkupLanguage (HTML) [Hic+14] are mainly designed for presenting information to humansvia their Web browsers, and machines are not able to extract information easily out ofthis semi-structured Web content.

Product data on the Web is residing in distributed data silos that are largely disconnected[cf. BHB09]. Since the Web is a distributed system, content can be created from anywhereby various people or systems, which to consolidate is very difficult. Moreover, the linkageamong these disparate Web sites is generally weak, because Web links between documentscarry no precise meaning about the types of relationships that hold between resources[BHB09].

Another problem of the Web is the high degree of heterogeneity, which is caused by thedistributed nature of the WWW and the lack of consensus between different publishers ofproduct data. An important example of heterogeneity in natural languages is the existence10A class of mobile devices positioned in between mobile phones and tablet computers.


of homonyms, i.e. terms with different meaning that are spelled the very same way, andsynonyms, i.e. distinct terms with identical or similar meaning [cf. NO95]. Homonymsand synonyms are very frequent in natural languages [cf. RB01], which complicates thereliable distinction of entities for IR-based search algorithms. Entity recognition becomesparticularly challenging if search engines are unaware of the context (environmentalvariables), i.e. they are missing important contextual information either about the dataitem itself (the topic the data is about), or about the user intention (the topic theinformation need is about). Furthermore, heterogeneity is often attributable to differentlanguages and standards for products. A few examples in the field of e-commerce areproduct descriptions in English versus in German, units of measurement in inch versus incentimeter, price specifications given in U.S. dollars versus Euros, or the use of differentclassification systems for the organization of products.

In business-to-business (B2B) scenarios, it is very common to categorize products ac-cording to product categorization standards (e.g. eCl@ss11 or United Nations StandardProducts and Services Code (UNSPSC)12) and proprietary catalog group systems, whereasin business-to-consumer (B2C) situations custom product category systems and producttaxonomies are prevalent. On the Web, for example, there exist category systems like theGoogle product taxonomy13 and custom taxonomies to better organize Web shop items.The problem with classifications is that they are often designed for specific purposesand thus apply to different contexts [Hep06], and categories may be arranged in distinctstructures and expressed using different terminology [cf. RB01]. The finding of automaticways to harmonize such schemas is a difficult endeavor tackled by the schema matching[RB01] and ontology matching [SE13] research communities.

1.2.2 Complex Information Needs

People who consider to purchase products online have typically varying information needs,often of complex nature. Sometimes, they are already familiar with the characteristicsof the products they are looking for (known-item seek [MR06, pp. 33–35]). More often,however, they only have vague knowledge about the option space, for which the needarises to do exploratory or research searches [MR06, pp. 33–35].

Since products are multi-dimensional objects (see Figure 1.2), current approaches thatoperate on unstructured, uni-dimensional data fail to support multi-parametric searches.Keyword searches e.g. are often inappropriate with regard to users’ complex and varying11http://www.eclass.de/ (accessed on May 16, 2014)12http://www.unspsc.org/ (accessed on May 16, 2014)13http://www.google.com/basepages/producttype/taxonomy.en-US.txt (accessed on July 22, 2014)

http://www.eclass.de/

http://www.unspsc.org/

http://www.google.com/basepages/producttype/taxonomy.en-US.txt

1 Introduction 8

information needs. To illustrate their main shortcoming, imagine the following multi-parametric keyword query for a specific product (in this case, a powered hedge trimmerfor gardening):

“Hedge trimmer, powered, light-weight, blade length of about 20 inches, at a pricelower than 200 U.S. dollars, sorted by cheapest first.”

Blade Length

Length

Weight

No Load SpeedDisplacement

Engine Power

Price

Warranty Duration

Figure 1.2: Multi-parametric view of a powered hedge trimmer

The limited capabilities of keyword-based search engines only allow for searching doc-uments that contain exactly the same terms as specified in the query (i.e. computethe cosine similarity of term vectors), potentially expanding the query by synonyms orpast search results that turned out relevant based on user clicks. Implementing highlypreferable features like currency (U.S. dollars to Euros) and unit conversions (inches tocentimeters), or correctly interpreting fuzzy and ambiguous terms like light-weight, atabout, or sort by cheapest first, are non-trivial tasks. Notwithstanding the fact that searchengines are continually making progress at better understanding user inputs in the formof keyword queries, there is still room for improvement, especially with respect to productsearch and discovery over structured data.

1.2.3 Insufficient Support for User Interaction

Product search is not a static, one-turn search task that can be translated into a query, butinstead includes a learning effect about the option space, and a relaxation or refinementof constraints and preferences. Current approaches do not provide sufficient support forthis kind of user interaction. In general, users are too much involved in the process ofinformation integration from multiple sources (e.g. combining product feature data fromone site, with reviews from a second, and offers from multiple others), and too little

1.3 Motivation 9

able to contribute human intelligence and judgment into the process, nor to adjust theirsearch based on the outcome of the last take.

1.2.4 Research Problem and Research Hypothesis

Research problem. The traditional Web has limitations regarding deep product com-parison, primarily due to the vast amount of unstructured and heterogeneous data, limitedcapabilities for data integration, and the missing support for advanced Web searches.

Our research strives to overcome these shortcomings by taking advantage of the SemanticWeb, Linked Data, and related technologies. The Semantic Web is characterized bystructured data with well-defined meaning formalized using ontologies [Gru93; Bor97;SBF98; GOS09]. By means of the GoodRelations ontology and schema.org, a lot ofstructured e-commerce data has already been made available on the Web. Accordingly,the research hypothesis is defined as follows:

Research hypothesis. The Semantic Web with its underlying data model (RDF), thenotion of unique global identifiers for describing entities (URIs), and widely acceptedvocabularies for e-commerce (GoodRelations and schema.org), allows for the fine-graineddescription of products and product offers and facilitates the integration of differentinformation sources. It thus constitutes a suitable infrastructure for product search anddiscovery on the Web.

This thesis proposes a solution to help overcome the main drawbacks of shallow productsearches over the traditional, document-based Web. Our approach enables deep productcomparison based on granular product descriptions published as Linked Data on the Web.In other words, we develop a search framework over real, structured product data availableonline as GoodRelations in RDFa and/or Microdata. Moreover, with the inherent abilityto integrate Web resources on the Semantic Web, we are able to leverage the visibility ofproduct offers with sparse or low-quality product details, as often published by Web shopowners.

1.3 Motivation

Our research was motivated by the unsatisfying situation of current product searches onthe Web, and by the economic relevance of product search and discovery on markets.

1 Introduction 10

1.3.1 Implications of Constrained Web Searches

As a result of the limited capabilities of current search approaches, users frequently basetheir buying decisions on sparse information rather than taking into account comprehensiveproduct details for product comparison. More specifically, very early in the search processpeople tend to

1. reduce searches to one or two dimensions of the product, i.e. they merely rely on theprice tag and on product names or descriptions instead of more detailed qualitativeand quantitative product characteristics,

2. make a preliminary selection in favor of a small number of products, i.e. theynarrow down the option space very quickly in order to save time-consuming manualcomparisons of product items over multiple product dimensions, and

3. further investigate and compare among the selected products, e.g. by visiting themanufacturer Web sites and doing manual side-by-side comparison of productdatasheets.

This approach, unfortunately, makes it hard to find close-to-optimal product offers onthe Web, because it prematurely reduces the option space on the basis of incompleteinformation. It also risks to unfairly favor low-priced, but potentially suboptimal goods,over the best product for a given need. Furthermore, the effort needed to find and visitthe pages with the relevant information, usually deep links of Web sites, is substantial,as illustrated in Figure 1.3.

Search Engine Result Page

Site 1 Page 1

Site 2

Page 2

Page 3

Page 4

Page 5

Figure 1.3: Deep product comparison on the Web

Until the user is – if ever – satisfied with the collected results, he first needs to visit acouple of Web sites listed on the SERPs and navigate through deep Web links in order

1.3 Motivation 11

to gather the necessary information. This task possibly spans a series of Web searches.Even if a decision is made at some point, chances are that somewhere else, in one of themany data silos, there would have been relevant information leading to superior results.

1.3.2 Economic Relevance

Suboptimal product searches may create a series of problems, among others considerabletime wasted on finding answers to an information need, wrong buying decisions becauseof incorrect, incomplete, or outdated information, or lost revenue due to unsatisfiedcustomers.

From an economic point of view, many of the possible shortcomings of poor productsearches are connected with the transaction cost theory [PRW08, p. 42]. In particular,excessively high search costs14 constitute a remarkable amount of the overall expensesin a market economy of today’s information age. According to an estimate from 2005,knowledge workers spend about 38% of their working time searching for information[McD05]. Similarly, people looking for products and services on the Web often dedicateprecious time and money to their searches. The extent of search costs is marked by thedifficulty of finding relevant information. The more specific the products, the more effortis usually needed to find the right product offer.

Another important driver of search costs are information asymmetries between marketparticipants. In the worst case, lack of information might lead to adverse selectionscenarios. A classical example is the “market for lemons” [Ake70], where in a used carsmarket good cars are not sold because their quality is uncertain and not visible to thecustomers. A similar situation may accrue for products offered on the Web. If searcheson the Web are very shallow, additional value propositions are not rewarded properlyand hence there is no incentive for vendors of high-quality products to participate in themarket. Because prices of qualitatively superior products are too high to attract potentialcustomers, they are deselected very early in the search process in favor of low-qualityproducts. The products remaining on the market are those with older technology, inferiorspecification, or less product features. In other words, the technical limitations of the Webthat impede a vendor’s ability to articulate the value proposition of a superior productproperly can lead to a market in which such a product will no longer be offered.

The field of search theory has dedicated extensive research to quantifying the effect ofuncertainty on searches in markets. In his seminal work “The Economics of Information”,14Search costs are the costs that accrue during the information gathering phase of a transaction, i.e. the

initial phase of a transfer of property rights (see Chapter 2).

1 Introduction 12

Stigler [Sti61] developed a formal model to derive the optimal number of searches forproducts on a market for goods. According to him, because there is price dispersion (aneffect of information asymmetries) in the market, the optimum number of searches is afunction of the search costs in relation to the expected gain of lowering the price in everyadditional search iteration [Sti61].

In view of the economic implications, an important quality criterion of search is to minimizesearch costs by fostering transparency and user engagement. Instead of restricting usersto shallow keyword searches over text corpora, allowing them to compare productsover multiple product dimensions would better account for their individual preferences(see Figure 1.2). By means of the multi-dimensionality of products (see Section 1.2.2),users could more easily identify and sort out “lemons” offered on the Web, which wouldultimately lower search frictions by preventing unfair discrimination of high-qualityproducts. More precisely, it may empower users to make rational choices by sacrificing themarginally lower prices of goods for significantly higher qualities of substitute products.Up until now, however, buying decisions on the Web are mostly based on the price tagof products and to a lesser degree on product features and quality [cf. Kar+05; Cha09b;Nel70]. This makes it extremely difficult for manufacturers and retailers to express thevalue proposition of their products, especially if there are multiple competing products onthe market. In other words, they cannot easily communicate the comparative advantagesof their items to potential customers. Hence, if searches would operate on richer productdescriptions, then even very specific products could benefit, because additional relevancesignals could be conveyed to search engines and other consuming applications, whichwould render them more visible to potential customers.

1.4 Research Questions

In order to verify our research hypothesis stated in Section 1.2.4, we identified five researchquestions (RQ) with each of them being dedicated a chapter of this thesis.

RQ 1. How can we obtain structured product offer data from the Web and what are itsmain characteristics? (Chapter 3)

RQ 2. How can we enrich product offers from the Web with granular, high-quality, andcomprehensive product details from product model master data? (Chapter 4)

RQ 3. How can we supply product category information in order to support the organiza-tion and aggregation of, and the navigation over product data? (Chapter 5)

1.5 Research Method and Contributions 13

RQ 4. What are the major gaps in the product data obtained in RQ 1, RQ 2, and RQ 3?How can we cleanse and integrate these data sources into a consolidated, enriched, andaugmented view on product offers? (Chapter 6)

RQ 5. How can we realize product search with support for deep product comparison andincremental learning that is based on SPARQL queries over RDF data? (Chapter 7)

1.5 Research Method and Contributions

In this thesis, we develop a framework for incremental product search based on productand offer data from the Semantic Web.

In our thesis, we predominantly use quantitative research methods, namely in Chapter 3,Chapter 4, and Chapter 6, where we analyze different aspects of the data collected froma Web crawl and BMEcat catalogs, or in Chapter 5, where we report on structured dataderived from product classification systems. In Chapter 7, we rely on experimental andqualitative research methods when we empirically evaluate our research prototype with ausability survey.

Furthermore, as most of our work is concerned with developing software artifacts todemonstrate the viability of our approach, we touch on principles from design scienceresearch for information systems, as researched by Hevner, March, Park, and Ram[Hev+04]:

“[The design science paradigm] seeks to create innovations that define the ideas,practices, technical capabilities, and products through which the analysis, design,implementation, management, and use of information systems can be effectively andefficiently accomplished (Denning 1997; Tsichritzis 1998).” [Hev+04]

As a proof of concept, we show that our proposal is able to cope with real structurede-commerce data from the Web.

Subsequently, we present the main contributions of this thesis (see Figure 1.4). They arein alignment with the aforementioned research questions (see Section 1.4) and, accordingly,they correspond to Chapters 3-7 of the thesis.

Contribution 1. Collect real structured product and offer data from the Web.

We collect a sample of product offers from real Web shops that provide valuable insightsinto the nature of the structured data published on the Web, e.g. How many of theoffered items feature product identifiers like Global Trade Item Numbers (GTINs) ormanufacturer part numbers (MPNs)?, How much granularity is supplied by structured

1 Introduction 14

Deep

Pro

duct

Sea

rch

and

Disc

over

y

Product Model Master Data

Product Category Data

Product and Offer Data +

Integration,Cleansing and

Enrichment

Product Model Master Data

Product and Offer Data

Product Category Data

Contribution 1

Contribution 2

Contribution 3

Contribution 4 Contribution 5

Contribution 4

Figure 1.4: Contributions of this thesis

product descriptions of product offers on the Web?, etc. The contribution is more preciselycomposed of the following tasks:

1. Identify and develop a catalog of data sources on the Web that contain structuredproduct data.

2. Extract structured content from the Web pages found on the Web.

3. Analyze the nature of the data.

In this context, a parallelized Web crawler for semantic e-commerce data is presented.

Contribution 2. Develop a methodology to generate high-quality product model masterdata for the Semantic Web.

The granularity of structured product data published by shop owners as RDFa orMicrodata on the Web is limited. This is considered a serious bottleneck for deep productcomparison. We thus describe a novel integration approach to support online product offerswith the addition of authoritative product model master data from manufacturers andlarge retailers. In this regard, we suggest a command-line tool to convert product catalogsin the established BMEcat catalog exchange format [SLK05a] to the GoodRelationse-commerce vocabulary [Hep08a] for the Semantic Web.

Contribution 3. Derive product type information from existing product categorizationstandards and proprietary product category systems.

Products can be further described semantically by product categories. This allows formore intelligent processing of product data, e.g. related products can be grouped togetherfor accounting purposes. Product categories from product classification systems alsorepresent important distinguishing characteristics for product search. In this contribution,

1.6 Publications 15

we apply an enhanced version of the GenTax algorithm [HdB07] to convert classificationsystems and taxonomies into respective Web Ontology Language (OWL) hierarchies thatare compatible with the GoodRelations vocabulary for e-commerce.

Contribution 4. Analyze conceptual gaps between product offer data on the Web andproduct-related information of derived data sources. Collate and combine the product datainto a consolidated data space.

The data sources obtained in Contributions 1-3 need to be integrated and cleansed inorder to obtain a consolidated data space of product offers as necessary for product search.More precisely, the data is collated in an RDF store with a SPARQL endpoint. After theconceptual gaps and the gaps in the product data have been identified, the data sourcesare consolidated, enriched, and augmented using respective SPARQL CONSTRUCTqueries.

Contribution 5. Build a faceted search interface which supports deep product comparisonand incremental learning via user interaction.

We develop a prototype as a proof-of-concept that exemplifies product search overSemantic Web data. The search system is realized using a faceted search interface overproduct offer data, which allows to compare product offers based on product detailsrather than the price tag only. The incremental search strategy lets the user graduallyrefine and relax the search scope, thereby providing a means to learn about the optionspace. To our knowledge, this is the first comprehensive attempt to support deep productcomparison over Semantic Web data sources.

1.6 Publications

With permission by the PhD committee and in accordance with the regulations at theUniversität der Bundeswehr München, parts of the work presented in this thesis havealready been published at peer-reviewed conferences. In the following, we disclose therelevant publications along with the topics they deal with:

• Product model master data for the Semantic Web:

A. Stolz, B. Rodriguez-Castro, and M. Hepp: “Using BMEcat Catalogs as a Leverfor Product Master Data on the Semantic Web”. In: Proceedings of the 10thExtended Semantic Web Conference (ESWC 2013). Montpellier, France: SpringerBerlin Heidelberg, 2013, pp. 623–638.

1 Introduction 16

• Product category information for the Semantic Web:

A. Stolz, B. Rodriguez-Castro, A. Radinger, and M. Hepp: “PCS2OWL: A GenericApproach for Deriving Web Ontologies from Product Classification Systems”. In:Proceedings of the 11th Extended Semantic Web Conference (ESWC 2014). Anis-saras/Hersonissou, Crete, Greece: Springer Berlin Heidelberg, 2014, pp. 644–658.

• Currency conversion in SPARQL as a contribution to data cleansing:

A. Stolz and M. Hepp: “Currency Conversion the Linked Data Way”. In: Proceedingsof the First Workshop on Services and Applications over Linked APIs and Data(SALAD2013). Montpellier, France: CEUR Workshop Proceedings, 2013, pp. 44–55.

• Faceted search for deep product comparison on the Semantic Web:

– A. Stolz and M. Hepp: “Adaptive Faceted Search for Product Comparison onthe Web of Data”. In: Proceedings of the 15th International Conference onWeb Engineering (ICWE 2015). Rotterdam, The Netherlands: Springer BerlinHeidelberg, 2015, pp. 420–429.

– A. Stolz and M. Hepp: “An Adaptive Faceted Search Interface for StructuredProduct Offers on the Web”. In: Proceedings of the 4th International Workshopon Intelligent Exploration of Semantic Data (IESD 2015). Bethlehem, PA,USA, 2015, no pages.

The author of this thesis further authored and co-authored publications and technicalreports during the work on this thesis that were not directly included in this work. Whererelevant, they are cited as external references:

• A. Stolz, M. Ge, and M. Hepp: “GR4PHP: A Programming API for ConsumingE-Commerce Data from the Semantic Web”. In: Proceedings of the First Workshopon Programming the Semantic Web (PSW 2012). Boston, MA, USA, 2012, nopages.

• A. Stolz and M. Hepp: “From RDF to RSS and Atom: Content Syndication withLinked Data”. In: Proceedings of the 24th ACM Conference on Hypertext and SocialMedia (Hypertext 2013). Paris, France: ACM, 2013, pp. 236–241.

• A. Radinger, B. Rodriguez-Castro, A. Stolz, and M. Hepp: “BauDataWeb: TheAustrian Building and Construction Materials Market as Linked Data”. In: Pro-ceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS2013). Graz, Austria: ACM, 2013, pp. 25–32.

1.7 Thesis Outline 17

• A. Stolz, B. Rodriguez-Castro, and M. Hepp: RDF Translator: A RESTful Multi-Format Syntax Converter for the Semantic Web. Technical Report TR–2013–1.E-Business and Web Science Research Group, Universität der Bundeswehr München,2013.

• A. Stolz and M. Hepp: GR2RSS: Publishing Linked Open Commerce Data asRSS and Atom Feeds. Technical Report TR–2014–1. E-Business and Web ScienceResearch Group, Universität der Bundeswehr München, 2014.

• A. Stolz and M. Hepp: “Towards Crawling the Web for Structured Data: Pitfallsof Common Crawl for E-Commerce”. In: Proceedings of the 6th InternationalWorkshop on Consuming Linked Data (COLD 2015). Bethlehem, PA, USA, 2015,no pages.

1.7 Thesis Outline

This thesis is organized as follows:

• Chapter 2 introduces background and related work relevant for the topic of thisthesis.

• Chapter 3 describes the implementation of a Web crawler and the data collectionprocess, analyzes the nature of the instance data, and summarizes relevant statistics.

• Chapter 4 presents a converter from product catalogs encoded in the XML-basedBMEcat format to GoodRelations in RDF.

• Chapter 5 details an approach on how to derive Web ontologies for products andservices with product classes and features from product classification systems.

• In Chapter 6, we develop a typology of common data quality problems and gapsin product data, report statistics on the prevalence of obstacles in the Web crawlfrom Chapter 3, and propose a data management interface for SPARQL-compliantRDF stores.

• Grounding on the preceding works, we suggest a faceted product search interfaceover RDF data in Chapter 7.

• Finally, Chapter 8 concludes our work by summarizing the contributions, discussingthe results and limitations, and pointing to future work.

2 Background and Related Work

2.1 Relevant Economic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.1 Transaction Cost Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.1.1 Asset Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.1.2 Bounded Rationality . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.2 Information Economics and Search Theory . . . . . . . . . . . . . . . . . 25

2.1.3 Utility and Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 E-Business and E-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.1 Types of E-Commerce Transactions . . . . . . . . . . . . . . . . . . . . . 28

2.2.2 Types of Goods for E-Commerce . . . . . . . . . . . . . . . . . . . . . . 30

2.2.3 Product Master Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.4 Product-related Information Systems . . . . . . . . . . . . . . . . . . . . 32

2.2.4.1 Product Data Management . . . . . . . . . . . . . . . . . . . . 32

2.2.4.2 Product Lifecycle Management . . . . . . . . . . . . . . . . . 33

2.2.4.3 Product Information Management . . . . . . . . . . . . . . . . 33

2.2.4.4 Enterprise Resource Planning . . . . . . . . . . . . . . . . . . 34

2.2.5 Content Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.6 Standards for B2B Data Interchange . . . . . . . . . . . . . . . . . . . . 35

2.2.6.1 Transaction and Catalog Exchange Standards . . . . . . . . . 36

2.2.6.2 Code Standards . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.2.6.3 Product Identifiers for Electronic Business . . . . . . . . . . . 41

2.2.7 Product and Services Classification Systems . . . . . . . . . . . . . . . . 44

2.2.7.1 Knowledge Organization . . . . . . . . . . . . . . . . . . . . . 44

2.2.7.2 Product Categorization Standards . . . . . . . . . . . . . . . . 47

2.2.7.3 Proprietary Product Classification Systems and Taxonomies . 49

2.2.8 Electronic Marketplaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.9 Electronic Tendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.3 Semantic Web and Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.3.1 Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.3.1.1 World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . . 52

19


2.3.1.2 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.3.1.3 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.3.2 Unique Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.3.3 Resource Description Framework . . . . . . . . . . . . . . . . . . . . . . 59

2.3.4 RDF Serialization Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.3.4.1 RDF/XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.3.4.2 Turtle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.3.4.3 RDFa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.3.4.4 JSON-LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.3.4.5 Non-RDF Syntaxes for the Semantic Description of Data . . . 68

2.3.5 Ontology Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

2.3.5.1 RDF Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.3.5.2 OWL Web Ontology Language . . . . . . . . . . . . . . . . . . 71

2.3.6 Ontologies and Global Schemas . . . . . . . . . . . . . . . . . . . . . . . 73

2.3.6.1 Schema.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

2.3.6.2 GoodRelations . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

2.3.6.3 Simple Knowledge Organization System . . . . . . . . . . . . 81

2.3.7 Query and Rule Languages . . . . . . . . . . . . . . . . . . . . . . . . . 82

2.3.8 Storage and Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

2.3.8.1 RDF Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

2.3.8.2 SPARQL Endpoints . . . . . . . . . . . . . . . . . . . . . . . . 86

2.3.8.3 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

2.3.8.4 Open-World and Closed-World Assumptions . . . . . . . . . . 87

2.3.8.5 Non-Unique-Names Assumption . . . . . . . . . . . . . . . . . 88

2.4 Semantic Data Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

2.4.1 Data Integration and Heterogeneity . . . . . . . . . . . . . . . . . . . . . 89

2.4.2 Schema and Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . 90

2.4.2.1 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . 90

2.4.2.2 Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . . 91

2.4.3 Data and Instance Matching . . . . . . . . . . . . . . . . . . . . . . . . . 94

2.4.4 String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

2.4.5 Data Cleansing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

2.5 Product Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

2.5.1 Information Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

2.5.2 Search Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98


2.5.3 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

2.5.3.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

2.5.3.2 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

2.5.3.3 Evaluation Criteria for Information Retrieval . . . . . . . . . . 103

2.5.3.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

2.5.4 Human-Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . 107

2.5.4.1 Static versus Dynamic Search . . . . . . . . . . . . . . . . . . 107

2.5.4.2 Lookup versus Learning . . . . . . . . . . . . . . . . . . . . . 108

2.5.4.3 Searching versus Browsing . . . . . . . . . . . . . . . . . . . . 109

2.5.4.4 Interaction Paradigms for Search . . . . . . . . . . . . . . . . 109

2.5.4.5 Faceted Search Interfaces . . . . . . . . . . . . . . . . . . . . . 110

2.5.4.6 Design Guidelines for Search Interfaces . . . . . . . . . . . . . 112

2.5.5 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

2.5.5.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

2.5.5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

2.5.5.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

2.5.6 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 116

2.5.6.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

2.5.6.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

2.5.7 Matchmaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

2.5.7.1 Characteristics of Matchmaking . . . . . . . . . . . . . . . . . 119

2.5.7.2 Matchmaking and Information Retrieval . . . . . . . . . . . . 122

2.5.7.3 Ranking with Match Degrees . . . . . . . . . . . . . . . . . . . 122

2.5.7.4 Related Research Fields and Applications Areas . . . . . . . . 124

This thesis focuses on product search and discovery on the Semantic Web. Hence, relevanttheoretical background and related work include the economic significance of search,related aspects from e-business and e-commerce, the concepts of the Semantic Weband Linked Data, aspects of semantic data integration, as well as the main principlesof disciplines at the intersection of product search, namely information retrieval (IR),human-computer interaction (HCI), recommender systems, natural language processing(NLP), and matchmaking. See Figure 2.1 for a chapter outline.


2.1 Relevant Economic Concepts

2.2 E-Business andE-Commerce

2.3 Semantic Web andLinked Data

2.4 Semantic Data Interoperability

2.5 Product Search

Figure 2.1: Chapter outline

2.1 Relevant Economic Concepts

In the following, we introduce relevant economic concepts related to product search anddiscovery, including transaction cost theory, information economics, search theory, andutility.

2.1.1 Transaction Cost Theory

The trading of products and services over the market does not only involve the costs ofthe goods per se, but also additional costs that accrue due to the exchange of propertyrights on a good that is the subject of a transaction [PRW08, p. 42]. These costs arereferred to as transaction costs, sometimes also coordination costs [cf. Coa60; Wil81;PRW08, p. 42].

Transaction costs were first addressed in 1937 by Ronald Coase in his seminal article “TheNature of the Firm”, albeit not yet termed as such. In seeking to explain the existenceof the institution of a firm as an alternative to purchasing everything on the market,Coase argued that using the market mechanism is not for free [Coa37]. According tohim, a market transaction encompasses the costs of the good, but also the costs of usingthe market [Coa37]. The amount of transaction costs that accrue by using the marketmechanism can thus be used as a theoretical model to explain the existence of firms.

Oliver E. Williamson raised the public awareness of transaction costs. While Williamsonprincipally shared Coase’s conceptions, he added that internal costs such as for governance

2.1 Relevant Economic Concepts 23

and coordination are present within firms [Wil81]. In other words, organizations have tospend on resources that go beyond the costs of the transformation process itself, e.g. timeand money for instructing unexperienced workforce or overhead to control opportunisticbehavior [cf. Wil81].

Transaction costs accumulate as part of the activities of a transaction. In Table 2.1, weillustrate typical transaction activities on the example of purchasing a laptop. Theseactivities are in chronological order: Search, selection, negotiation, contract set-up,exchange, supervision, and enforcement [cf. PRW08, p. 42].

Table 2.1: Transaction activities [cf. PRW08, p. 42]

Activity Example

1. search Seek and compare relevant laptop models2. select Choose suppliers that offer potentially interesting laptops3. negotiate Negotiate terms and conditions with one of the suppliers4. set-up contract Set up a contract with the supplier5. exchange Get laptop from supplier and transfer money in return for the laptop6. supervise Check product quality and stick to the contract7. enforce Exercise consumer rights (e.g. require a refund)

In his organizational failure framework, Oliver E. Williamson identifies four factors thatinfluence the level of these transaction costs [cited from PRW08, p. 43]:

1. Bounded rationality : Actors are not perfectly rational because they are missingimportant information; either, because they do not have access to the relevantinformation, or because they are incapable of processing the information.

2. Opportunism: Opportunistic behavior such as individuals’ profit maximizationcauses markets to set up control mechanisms.

3. Specificity : Specific investments create dependencies between business partners,which can lead to opportunistic exploitations (opportunism).

4. Uncertainty/complexity : Environmental variables like prices, conditions, quantitiesare not easily predictable (bounded rationality).

In the following, we discuss specificity and bounded rationality as relevant factors forsearch costs.

2.1.1.1 Asset Specificity

One of the key drivers of transaction costs is the specificity of assets [PRW08, p. 43, p. 46].Specificity is the value loss of a good if it is used for the next best utilization [PRW08,


p. 43]. For instance, a wedding cake is a very specific good, because if not served at theparticular wedding event it becomes immediately worthless. Similarly, investments inhighly specialized machines are considered very specific. On the contrary, a kilogram ofpotatoes is not very specific, because its use is not confined by the purpose for which itwas initially acquired [cf. PRW08, p. 43].

Williamson [Wil83] distinguishes four dimensions of asset specificity:

1. Site specificity, e.g. location-sensitive investments, such as a manufacturer processingcopper and located next to a copper mine.

2. Physical asset specificity, e.g. the investment in a fully customized ERP system.

3. Human asset specificity, e.g. the investment in specialized employees.

4. Dedicated asset specificity, e.g. the investment in plants serving a single purpose.

Malone, Yates, and Benjamin [MYB87] identified yet another dimension, namely

5. Time specificity, e.g. time-critical assets such as goods that are perishable or anewspaper that becomes outdated.

The specificity of goods has been increasing since decades, favored by the globalizationand mass customization of products that fostered the diversity of products and servicesavailable on markets [cf. PRW08, p. 9, p. 306]. Entering a bakery today in contrast to ahundred years ago, gives us a large variety of different sorts of bread. Rather than beingrestricted to only one type of bread, today we can select between dozens of them.

As a general rule of thumb, the higher the specificity of a product, the harder it isto procure it on a market. The rationale can be found in higher transaction costs,caused by specific investments and the associated risk of opportunistic behavior [PRW08,p. 44]. Nonetheless, the Web is often considered to facilitate transactions over the market[MYB87]. While in the past significant effort was necessary to find and select suppliers(e.g. using yellow pages), it has meanwhile changed, because vendors increasingly use theWeb as their primary sales channel.

2.1.1.2 Bounded Rationality

Many economic models make facilitating assumptions about rational actors with perfectinformation (known as homo economicus [Per95]), i.e. utility-maximizing individuals andprofit-maximizing firms [cf. Sim59]. In reality, though, the information of individualsis incomplete, i.e. actors have to deal with environmental factors like uncertainty and

2.1 Relevant Economic Concepts 25

complexity as well as cognitive limitations in accessing and processing information [cf.PRW08, p. 43]. Bounded rationality [Sim97, p. 118], as how Herbert A. Simon refersto this problem, causes people to make decisions under uncertainty (e.g. by applyingheuristics), which at some point might lead to satisfying but not necessarily optimalsolutions [cf. Sim97, p. 119]. Bounded rationality is thus considered a trigger for highertransaction costs and, more specifically, for search costs.

2.1.2 Information Economics and Search Theory

As Stigler [Sti61] once pointed out, the value of information has for a long time beeninsufficiently considered by economists.

“One should hardly have to tell academicians that information is a valuable resource:knowledge is power. And yet it occupies a slum dwelling in the town of economics.”[Sti61]

The field of information economics studies the relevance, value, and characteristics ofinformation in economies and economic decisions. As such, it deals with topics likeinformation as a good, information asymmetry, and the price mechanism [cf. Sti61;SV99].

Because information has an influential character on decision-making of economic sub-jects, it plays an important role throughout the transformation process of a company.Information is considered as a value-adding good just like physical goods that can bepurchased and sold [cf. SV99, pp. 3f.]. Indeed, “people are willing to pay for information”[SV99, p. 3].

For a company, an information advantage often translates into a competitive advan-tage. An information advantage is determined by information asymmetries [cf. Ake70], asituation where one party has more or better information than the other party. Informa-tion asymmetries are prominently used by the theoretical model of the principal-agentproblem [PRW08, pp. 47–51; Ros73; JM76]. In a principal-agent problem, one party(agent) acts on behalf of a second party (principal) [e.g. Ros73; JM76]. The agent hasan information advantage over the principal, e.g. being aware of its own strengths andweaknesses (hidden characteristics), actions (hidden action), and goals (hidden inten-tion) [cf. Spr90]. The agent could thus behave opportunistically by concealing certaininformation. Typical principal-agent situations occur in employer-employee-relationships,doctor-patient-relationships, or insurance contracts [e.g. Ros73]. But they could as wellappear in electronic business transactions, for example when a customer depends on thegoodwill of vendors and the accuracy and transparency of their product descriptions. In


general, the roles of the principal and the agent can be bidirectional, either depend on thepoint of view, or change during a transaction (e.g. the agent makes transaction-specificinvestments). The risk of opportunistic behavior inherent to principal-agent relationshipsoften leads to the problems of adverse selection [Ake70], moral hazard [e.g. Arr63; Hol79],and hold-up [Gol76].

An often-discussed problem occurring in (online) marketplaces is price dispersion [e.g.PRS04; Hop08; BS00; Sti61]. Price dispersion is the result of consumers being under-informed about the prices of the goods [Hop08], which can be attributed to informationasymmetries. The different price setting of the same goods by different vendors leadsnot only to market inefficiencies but also to considerable search effort. Thus, loweringthe search costs plays a crucial role for reducing price dispersion in online markets [cf.BS00]. For a few categories of otherwise homogeneous goods, this might be a simplechallenge and price comparison services are trying to fulfill that need. In a market withan increasing option space of product choices with a large number of relevant productdimensions this turns into a very hard problem.

Search theory studies the economic behavior in markets with search frictions, i.e. whereindividuals have imperfect information and invest time and effort in searching [Pis01].While Pissarides [Pis01] focused in his research particularly on the labor market, usingsearch theory to explain frictional unemployment [cf. Pis01], Stigler studied price disper-sion. Stigler [Sti61] developed a function for the optimal number of searches based onsearch frictions due to price dispersion in markets. According to this, the optimal amountof search is reached when the marginal cost of an additional search (i.e., requesting aprice quotation from an additional seller) matches or exceeds the expected returns [Sti61].For lower-priced products, this can mean that even a low-cost search does not necessarilypay off [Nel70].

2.1.3 Utility and Preferences

Jeremy Bentham determined utility as the “property in any object, whereby it tends toproduce benefit, advantage, pleasure, good, or happiness” [Ben23].

The school of hedonism, which dates back to ancient Greece, was first concerned withutility maximization, more precisely with individuals that seek happiness (or pleasure)while minimizing pain [Wei12]. Utilitarianism is a prominent school of thought groundedon the principles of hedonism. It is represented by Jeremy Bentham and John StuartMill [Wei12]. Bentham was the founder of the greatest-happiness principle, denotingthat an action is considered moral if it creates happiness to society as a whole, to the

2.2 E-Business and E-Commerce 27

greatest number of people [Ben23]. Unlike Bentham, who considered all differences amongpleasures to be quantifiable, Mill argued that pleasures exhibit different qualities, bytreating physical forms of pleasure inferior to higher, intellectual pleasures [Mil06].

In microeconomics, utility is typically modeled as a utility function [NS08, p. 87]. Theutility function for a bundle consisting of two goods can be written as

utility = U(x, y) (2.1)

The utility function U describes the preference structure of an individual over a combi-nation of goods, i.e. x and y [cf. NS08, p. 89]. Preferences are assumed to be complete,transitive, and continuous [NS08, pp. 87f.]. Completeness means that for each pair ofgoods there must exist a preference relation, i.e. either x is preferred to y or vice versa, orboth alternatives are equally preferred [NS08, p. 87]. Transitivity requires that preferencesare consistent, i.e. that if x is preferred to y and y to z, then x is preferred to z [NS08,p. 87]. And finally, continuity describes the case that similar goods must exhibit a similarutility, i.e. if x and y are similar, and x is preferred to z, then also y is preferred to z

[NS08, pp. 87f.].

More contemporary research efforts in psychology indicate that increasing numbers ofoptions make people feel less confident and unhappy. In The Paradox of Choice: WhyMore Is Less, Barry Schwartz addresses the problem of too much choice in today’s society,which raises the individual’s expectations that are not easily satisfiable, and people areunhappy because they have the impression of not having made the best decision [Sch04].Oulasvirta, Hukkinen, and Schwartz [OHS09] argue that this effect is also present insearch engines of today, evaluating the effect of the number of search results displayed tothe user.

As already noted for bounded rationality [Sim59], people frequently have imperfectinformation, thus they tend to choose options that to some degree match their expectationsrather than trying to find the optimum. In other words, because the price of findingthe optimum is often considerable, in some cases it may be superior to satisfice (e.g. tosettle for a “fair price”) than to maximize or optimize (e.g. to aim for the “best price”) [cf.Sim59; cf. Sim97, p. 119].

2.2 E-Business and E-Commerce

When referring to business activities over the Internet, the two terms e-business ande-commerce are frequently used. Their relationship is not always clear-cut [cf. PRW08,


p. 274; cf. Cha09a, p. 13]. Quite often, e-commerce is considered a subdiscipline ofe-business [Cha09a, pp. 13f.]. The two other prevalent perceptions are that e-business ande-commerce are treated as equivalent and used synonymously, or that they are consideredas partially overlapping [Cha09a, pp. 13f.]. Similarly, the term e-government is often usedto refer to e-commerce applied to the public sector [Cha09a, pp. 28f.].

The term e-business was coined by the International Business Machines Corporation (IBM)and made public in 1997 as part of an advertisement campaign for their Internet-basedtransaction services. Consequently, IBM was the first to give a definition for e-business,stating that “e-business is about transforming key business processes by using Internettechnologies” [IBM11]. According to this, e-business entails all business-related activitiesthat are conducted over the Internet. This may in particular include activities such asthe collaboration between business partners, customer service, or intra-organizationaltransactions. Under this premise, concepts like e-procurement, e-commerce, e-sales, ore-marketplaces fall into the broad definition of e-business. We assent to this broadmeaning of e-business in our subsequent discussion about e-commerce.

In contrast to e-business, e-commerce is about trading goods over electronic systems. Itgenerally puts more emphasis on the transactional aspect, i.e. the processes and activitiesrelated to purchasing and selling products and services online: “Electronic commerce is theexchange, distribution, or marketing of goods or services over the Internet” [Gol08]. Theaffected activities may encompass procurement, sales, advertising, service provisioning,and payment. E-commerce helps augmenting the quality of the decision-making process,lowering the costs, and increasing the speed of transactions [KR03].

2.2.1 Types of E-Commerce Transactions

Electronic transactions involve entities that act as either selling or buying parties [KW97,pp. 4f.; Cha09a, p. 11]. Sellers are often referred to as provider, producer, supplier, orserver, whereas buyers are also known as consumer, customer, or client. This distinctionof sell-side and buy-side e-commerce serves to define the types of e-commerce transactionswith respect to the business entities involved.

The three most established and widely known transaction categories are business-to-business (B2B), business-to-consumer (B2C), and consumer-to-consumer (C2C) [Cha09a,p. 26; BGG01]. Table 2.2 shows all possible combinations of transaction categories.

B2B e-commerce describes exchanges of products and services between supplier andmanufacturer, manufacturer and wholesaler, or wholesaler and retailer. On the other hand,


Table 2.2: Categories of e-commerce transactions by entities involved [cited from Gri03]Buyer

Business Government Consumer

SellerBusiness B2B B2G B2CGovernment G2B G2G G2CConsumer C2B C2G C2C

B2C denotes the transactions between retailer and consumer, less often between wholesalerand consumer (see Figure 2.2). By virtue of the Internet, traditional distribution channelslike those where intermediaries are chained up as in Figure 2.2 become gradually lessrigid [Cha09a, p. 65]. E.g., a manufacturer could open an online store where consumerscan purchase products directly.

B2B

Supplier Manufacturer Wholesaler Retailer Consumer

Supply Chain

B2C

Figure 2.2: B2B and B2C e-commerce [adapted from Cha09a, p. 65]

The most prominent example of a B2C marketplace is Amazon, a large online retailerwhose business model relies on selling products to end customers. C2C, having been lesscommon in the past, gained a lot of interest through eBay, an electronic auction andclassifieds platform that, as an intermediary, facilitates transactions between individuals[cf. BGG01]. On eBay, people and organizations can both advertise and purchase goods,whereas the transaction fulfillment is controlled and handled by the platform provider.

Besides those addressed so far, there exist additional, less evident forms of transactionrelationships (see Table 2.2), namely between business and government (B2G), consumerand government (C2G), consumer and business (C2B), government and government(G2G), government and consumer (G2C), and government and business (G2B) [e.g.Gri03].

New application areas for e-business are recently gaining traction. Smartphones andtablet computers not only make up for the largest number of computing devices sold inthe last years [cf. Int15], but they also provide novel ways of conducting e-commerce,termed as mobile commerce, mobile e-commerce, or m-commerce [e.g. VVK00; Sen00;BPJ02; TBH06]. Similarly, the new distribution channels opened up by social media arecommonly referred to as social commerce [e.g. TBL10; WZ12], or more specifically, in therealm of Facebook, f-commerce [e.g. FH13].


2.2.2 Types of Goods for E-Commerce

A good in the context of e-commerce is known as a commodity or product that can betraded [cf. Mar04; Hil99]. The term good is often used synonymously with the termscommodity and product. The term commodity was used in particular by early economiststo refer to goods [cf. Hil99]. A commodity, according to Marx, is “ ‘any thing necessary,useful or pleasant in life,’ an object of human wants, a means of existence in the broadestsense of the word” [Mar04, p. 20]. Commodities are determined by their use-value andexchange-value, meaning that they can satisfy human needs and can be exchanged foranother commodity in a market [Mar04, pp. 19–21]. Hill [Hil99] characterizes goods asfollows:

“Goods are entities of economic value over which ownership rights can be established.If ownership rights can be established they can also be exchanged, so that goodsmust be tradable.” [Hil99]

There are several possible ways of classifying products or goods, whereof we subsequentlypresent some popular distinctions relevant for electronic search and discovery. Onepossibility is to group products by their physical properties [Hil99], i.e.

• tangible goods, and

• intangible goods.

In his article about the clarification on the difference of intangible goods and services(both of intangible nature), Hill argues that intangible goods, unlike services, have allessential economic characteristics of goods [Hil99]. Hence, “the traditional dichotomybetween goods and services should be replaced by a breakdown between tangible goods,intangible goods and services” [Hil99]. While tangible goods are physical goods, productsthat can be touched like food, clothing, electronic devices, or sports equipment, intangiblegoods are non-physical products [cf. Hil99]. Information goods, playing an importantrole in information economics, are a special subset of intangible goods, which have aparticular economic value [cf. Hil99]. The value of information goods is determined by theinformation they provide. Classical examples of this kind of goods are digital photographs,music, movies, spreadsheets, or software. Unlike physical goods, the reproduction ofinformation goods does not cost significant additional amounts of money1 [SV99, p. 3].

Nelson [Nel70] makes a distinction between

• search goods, and1The cost structure of information goods is usually made of high fix costs, namely the costs for design

and production of the first copy, and rather low variable costs, namely the costs for producingadditional copies [SV99, p. 3].


• experience goods.

Search goods are those goods that are easy to evaluate without having seen or experiencedthem. By contrast, it is harder to assess the quality and characteristics of experiencegoods unless they have been used [Nel70]. This is particularly relevant for informationgoods, because for them it is generally difficult to assess their quality before havingconsumed them.

From a microeconomic point of view, products can be classified according to theirrelationships to each other. Economists dealing with microeconomics [e.g. NS08, p. 185]distinguish two important groups of products, namely

• complements, and

• substitutes.

Complements, on one side, are products that are complementing each other, e.g. productaccessories like cream for coffee or sugar for tea. Substitutes, on the other side, areproducts that can be replaced for one another, namely they are of similar use (i.e. theycreate similar utility, e.g. coffee and tea) or by different manufacturers with the same ora similar functionality (e.g. Pepsi Cola and Coca Cola) [cf. NS08, p. 185]. Further, it ispossible that products exhibit no obvious relationship, e.g. milk and automobile. It doesnot mean, however, that they are necessarily independent. In the given example, there isat least no evident relationship, unless milk is primarily consumed during driving a caror was invented as fuel for automobiles.

2.2.3 Product Master Data

Master data refers to information artifacts about business entities, such as parties, places,and things [Whi+06a]. Accordingly, product master data is master data related toproducts, such as manufacturer information, product features, and product images. As itsmain characteristics, master data is (1) shared between applications and business processesamong one or more departments or organizations; (2) relatively static and infrequentlychanged during business processes; and, (3) potentially arranged hierarchically (e.g. anemployee is member of a department within a firm) [Los09, p. 8]. The information storedabout such objects comprises attributes, definitions, roles, connections, categories, andmetadata [Los09, p. 6]. Because master data represents the core business entities of anorganizational unit that persist for extended periods of time, master data is often storedcentrally, referenced and accessed by various departments when needed.


Transactional and analytical systems use master data for executing business operationsand reporting business information. The data generated by these systems is referred to astransactional (or operational) and analytical data [Ora11]. Transactional systems create,modify, archive, or delete operational data around existing master data. This comprisesinvoices and orders, that refer to product information and customer details, respectively.Meanwhile, analytical systems compile decision-supporting reports relying on metricsabout suppliers, customers, products, or employees like revenues, costs, and performanceindicators [MV05].

The efforts to set up policies, methods, and infrastructure to capture, integrate, andshare relevant master data between organizations’ stakeholders and information systemsare called master data management (MDM) [Los09, pp. 8f.].

2.2.4 Product-related Information Systems

For quite some time, data has been stored in a decentralized fashion within companies.The increase in functionality and applications for desktop computers made it commonfor departments and employees to autonomously manage information, e.g. as text files,spreadsheets, or even printed on paper [cf. Los09, p. 3; Cha09a, pp. 165f.]. This ledto many disparate data silos with heterogeneous information, which made it almostimpossible to share information in an efficient way [Cha09a, pp. 165f.].

Nowadays, there exist various approaches and systems for the centralized organization andsynchronization of product data within and across company borders. The most prominentconcepts are product data management (PDM), product lifecycle management (PLM),product information management (PIM), and enterprise resource planning (ERP).

2.2.4.1 Product Data Management

Product data management (PDM) first arose in the 80s and 90s of the twentieth century asa concept derived from engineering data management (EDM) [SI05, p. 1]. PDM describesthe technology and software systems to integrate all information artifacts related toproducts. This includes capturing information from other systems like computer-aideddesign (CAD), computer-aided engineering (CAE), and computer-aided manufacturing(CAM), as well as providing interfaces to enterprise resource planning (ERP) and supplychain management (SCM) systems [cf. LMS07]. As such, PDM was a good fit for a smallto medium-sized company in order to maintain and share product information within anorganization’s borders.


2.2.4.2 Product Lifecycle Management

Product lifecycle management (PLM) describes an integrated management approach thatspans the entire product lifecycle [LMS07]. The idea mainly emerged at the end of thetwentieth century, focusing on product and support innovation in an Internet-enabledglobalization of markets and mass customization of products [Liu+09].

PLM aims at raising companies’ global competitiveness by facilitating the information flowwithin and across companies at every stage of the product lifecycle [e.g. SI05, p. 1; Liu+09;LMS07]. In other words, the challenge of PLM is to manage product information alongthe full product lifecycle [Sta11, p. 3; LMS07]. This specifically implies to support the fivephases of product development, namely imagination, definition, realization, use/support,and retirement/disposal [Sta11, p. 2] (i.e. the full path “from cradle to grave” [Sta11,p. 1]).

PLM is sometimes considered as an advancement over PDM [SI05, p. 244]. In additionto PDM, it can contribute useful product-related information that emerge in later stagesof the product development. Such additional information complements the productspecifications and product metadata as already present in PDM systems [SI05, pp. 7f.].

2.2.4.3 Product Information Management

The goal of product information management (PIM) is to create “one shared source ofproduct information” from where it can be distributed to different sales channels [Abr14,p. 3].

Online vendors face several challenges where spreadsheets are inappropriate [Abr14,p. 1]. The global market poses special requirements, forcing companies to sell productsto stakeholders from different countries, at varying prices, and under differing brands[Abr14, p. 2]. In particular, this involves the need to offer a multitude and variety ofproducts online, increasing customer needs of detailed product information, or cross-mediapublishing [Abr14, p. 2].

PIM systems store product information in a central repository where heterogeneoussources of product master data are reconciled, and create a single product view for allstakeholders [cf. Whi07]. This single product view aims at serving various distributionchannels such as printed product catalogs, flyers, Web pages, online stores, mobile phones,tablets, etc. [Abr14, p. 1]. For instance, a PIM system can be used simultaneously toempower the content management system (CMS) of an online shop, to create electronic


product data feeds for business partners, and to compile a print catalog for customerhome delivery.

2.2.4.4 Enterprise Resource Planning

Enterprise resource planning (ERP) describes a system integration approach to solve thefragmented applications infrastructure, as it has governed companies in the 1990s [Cha09a,p. 166]. ERP typically refers to a monolithic system that functionally integrates thedepartments within a company, e.g. procurement, production, marketing, logistics, finance,and human resources [Cha09a, p. 167]. Unlike the previously presented approaches, ERPis not product-centered, but also encloses master data of other business entities, alongsidewith operational and analytical data, e.g. material requirements, orders, and invoices.Nevertheless, ERP is often used together with other systems like PDM or PIM. Amongthe most popular ERP system vendors are SAP, Oracle, IFS, and Microsoft [Gua+15; cf.BSG99].

In Table 2.3, we contrast MDM, PDM, PLM, PIM, and ERP among several dimensionsthat we distilled from our previous descriptions.

Table 2.3: Characterization of MDM, PDM, PLM, PIM, and ERP

Concept Focus MDM PDM PLM PIM ERP

Systems approach � + � + +Product-data-oriented � + + + �Master-data-oriented + + � + �Cross-corporate integration � � + � �Full product lifecycle support naa �

(engineering)+ �

(sales)na

a For some concepts a specific dimension is not applicable, which we have marked by “na” for “notavailable” or “not applicable”.

2.2.5 Content Integration

With corporate environments growing more globally, exchange of product information hasto take place between organizations. Hence, the integration problem aggravates becauseof several heterogeneous applications and data sources with missing standardization beingforced to interoperate [cf. Fen+01].

Within a single company’s boundaries, the problem can be solved using data warehousingsystems [Los09, pp. 3f.; cf. Inm02, pp. 31–33]. The employment of data warehouses reflectsthe traditional view of data integration [DHI12, p. 272], which is about integrating data


from a variety of databases into a unified one [DHI12, p. 272]. Even if there exist otherapproaches (portals, operational data stores, federated database systems, peer-to-peerintegration, among others [ZD04]), data warehouses are still very popular in businesscontexts [cf. DHI12, pp. 272f.]. Traditional data warehouses, however, have an importantlimitation, i.e. they collect historical data and generate reports from various applications,which might not properly reflect the live status of the data at the original systems [Inm02,p. 35].

In business settings with many organizations, integration becomes even more challengingbecause of the variety of formats, schemas, etc. This problem is addressed by contentintegration, i.e. the “integration of operational information across enterprises” [SH01].Content integration strives to consolidate potentially volatile content available fromdisparate organizations, in different formats, and at different levels of granularity. Inparticular, this includes besides rich data sets also weakly structured information objects,like for example images, scans, videos, PDF files, presentations, or informal notes andreports. Data warehousing techniques are too limited when integrating content fromdifferent data providers [SH01]. While Stonebraker and Hellerstein [SH01] propose a datafederation approach with materialized views as a possible solution for content integration,the problem is still prevalent in many enterprises of today.

2.2.6 Standards for B2B Data Interchange

In e-business and e-commerce, there is a need for establishing consensus between businessparties regarding the structure and semantics of the exchanged business documentsand the identification of business objects. Otherwise, misunderstandings or transactionfailures may arise that ultimately create unnecessary costs and hamper the willingnessof organizations to continue doing business with each other. Standardization is thus afrequently mentioned requirement for the automated exchange of product information [e.g.HT05, p. 191]. By means of standardization, positive network effects can accrue [FS85;KS85] that raise the compatibility and interoperability within and across enterprises byreducing internal process costs as well as external transaction costs.

There is a series of important standards relevant for e-business that can roughly begrouped into process standards, transaction standards, catalog standards, classificationstandards, and identification standards [GI12, p. 6]. In the following, we take a closerlook at these standards. In addition, we present code standards that ensure frictionlesselectronic communication between trading organizations from different domains andcountries.


2.2.6.1 Transaction and Catalog Exchange Standards

In general, there are two established models of exchanging product information betweencorporate entities, as depicted in Figure 2.3. The first is bilateral exchange [SLÖ08], thatis when vendors are setting up individual information channels with all their suppliers.This type of exchange becomes difficult to manage with a growing number of transactionpartners. In particular, the number of required mappings to harmonize different productdescriptions increases exponentially [Fen+01]. The alternative is a master data pool,provided by a data intermediary that procures master data from various manufacturers,potentially curates it, and furnishes it to different vendors. This is known as multilateralexchange [SLÖ08] and the key idea underlying business-to-business marketplaces [Fen+01].Instead of up to m ⇥ n, as few as m + n individual communications or mappings arenecessary to set up [Fen+01].

Vendor 1

Manufacturer 1 Manufacturer 2 Manufacturer m

Vendor 2 Vendor n...

...

Vendor 1 Vendor 2 Vendor n...

Manufacturer 1 Manufacturer 2 Manufacturer m...

Master Data Pool(a) (b)

Figure 2.3: Models of master data exchange: (a) Bilateral versus (b) multilateral [from SLÖ08]

The Global Data Synchronization Network (GDSN) [SLÖ08; GS1ND] goes beyond thesetraditional exchange models, aiming at setting up a network of data pools that aresynchronized in a way that supplier and customer sides obtain real-time and high-qualityproduct data. The data pools are controlled and consistently updated by means of acentral global registry. The registry ensures that whatever data pool has been selectedby the trading partners, they always obtain the most recent product information viapublish-subscribe [cf. SLÖ08]. Every change to product data is automatically propagatedto all participating data pools in the network.

Transaction Formats For the bilateral exchange of data, standard data exchangeformats have been established in the past. Long before the existence of the World WideWeb (WWW), in the 1960s, companies had already started to replace the paper-based


communication by electronic exchanges of business documents relying on electronic datainterchange (EDI) [Cha09a, pp. 176f.]. EDI uses a set of standard messages for theautomated processing by computers, which, however, is difficult to grasp by humansin terms of syntax, structure, and semantics. More recently, the most popular parts ofthe EDI standard have been mapped to Extensible Markup Language (XML) syntaxin order to comfort modern software systems and to reduce communication costs [cf.Hue00; PW97]. With XML, EDI-based data exchange can further benefit from thebroad tooling support for XML, i.e. the syntactical correctness and completeness of atransmitted document can be validated much more easily than for a message stream [cf.PW97; cf. PRW08, p. 150]. Over time, the EDI communities have developed a number ofindustry-specific subsets, such as ODETTE focusing on the automotive sector in Europe,or UN/EDIFACT covering the fields of administration, commerce and transport [cf.PRW08, p. 150].

Besides EDI-based standards, a series of other data exchange formats obtained wideacceptance, such as Electronic Business using XML (ebXML), Commerce XML (cXML),XML Common Business Library (xCBL), Universal Business Language (UBL), OpenApplications Group Integration Specification (OAGIS), RosettaNet, or OpenTRANS[cf. HT05, pp. 205–208; SLK04]. These exchange formats support typical processes of abusiness transaction, namely the automated exchange of product catalogs, quotations,customer orders, delivery notes, invoices, and payments.

Standards for Representing and Exchanging Product Details The exchange standardspresented so far are rather generic, and often provide insufficient support for describingproducts and services in much detail. For this reason, there exist complementing extensionsand subsets of these standards.

Product Data Message (PRODAT) and Price Catalog Message (PRICAT) are two messagetypes in the EANCOM subset of UN/EDIFACT that can be used to exchange productinformation [DLS01, p. 1532; HT05, p. 205]. PRODAT describes a message type for EDIthat supports the exchange of product master data. Among others, it contains messagesto indicate product-related details like product group information (message code “PGI”),currencies (“CUX”), and physical measurement values (“MEA”) [UN 14]. PRICAT definesa set of messages to exchange product catalogs transferred between trading partnersby permitting them to indicate commercial details, such as price information, terms forpayment and transport, and packaging information, but also product characteristics andcategories [UN 12].


The Standard for the Exchange of Product Model Data (STEP), also known as ISO 10303[Pra05], is a family of standards for the exchange of product master data [SLK04]. TheSTEP standard is intended for representing relevant product information throughoutthe full product life-cycle, including for example the specification of CAD/CAE/CAMobjects as generated and used by enterprises involved in the design, engineering, andmanufacturing of a product [cf. Pra05]. The standard has evolved into a modulararchitecture facilitating the development of application protocols (APs) to serve productrepresentation among several applications [cf. Pra05]. STEP is formalized using a system-neutral and machine-understandable data modeling language, EXPRESS [Int04], andusually exchanged via a STEP file [Int02a]. EXPRESS schemas can also be illustratedusing a human-friendly graphical notation (EXPRESS-G) [Int04], or, serialized as XML(e.g. STEP-XML), as proposed in ISO 10303-28 [Int07a]. This neutrality of its specificationlanguage makes the STEP standard appropriate for both single- and cross-domain usage,e.g. integrating multiple ERP systems as well as connecting CAD systems with ERPsystems. Nevertheless, STEP is not appropriate for describing commercial productproperties, such as used for e-sales and e-procurement [SLK04].

Product catalog exchange formats can be considered standards for the exchange ofproduct data via electronic product catalogs, mostly between PIM system operators(typically manufacturers and suppliers) and clients (retailers and customers). As opposedto transaction formats, pure product catalog exchange formats concern product-relatedinformation rather than transaction-related information in general. Numerous transactionformats have built-in support for exchanging product specifications; popular examples inXML are RosettaNet, cXML, and xCBL [cf. SLK04].

BMEcat [SLK05a; HS00] is an XML-based exchange format for product catalog datadeveloped by the eBusiness Standardization Committee consortium in Germany, consistingof members like Fraunhofer IAO, Universität Duisburg-Essen, and industrial organizations.The standard has also proved successful in other European countries [SLK04]. It can beused together with B2B XML standards like e.g. OpenTRANS, which defines businessdocuments for the transactional part, whereas BMEcat focuses on the product catalog tobe transmitted [SLK05a, p. 7].

2.2.6.2 Code Standards

Trading partners often speak different languages that go beyond the use of compatibledata formats. Consensus needs to be established about date formats, locations, languages,


currencies, and units of measurement. In the following, we summarize code standards fore-commerce that many transaction standards adhere to.

Date and Time Formats The use of compatible formats for date and time is criticalin many applications. For example, a date “12/10/2015” can mean two different things,namely

a) October 12, 2015 (DD/MM/YYYY 2), or

b) December 10, 2015 (MM/DD/YYYY ).

The ability to reliably distinguish these two variants is critical in business contexts. Thepotential for misinterpretation further aggravates if the date format is of the following form:“12/10/15”. To overcome this problem, the International Organization for Standardization(ISO) suggests a standard, ISO 8601 [Int88], for representing dates and times in a uniformway. It is a powerful standard that covers a variety of possible formats (localized time anddate, calendar week numbers, time intervals, recurring time intervals, durations). Thebasic date and time formats are outlined in Table 2.43. ISO 8601 also allows more granulardescriptions of dates and times. For instance, fractions of a second can be attached to thebasic formats for date and time, e.g. “14:30:00.05”. Alternatively, time zone informationcan be supplied, e.g. “14:30:00+01:00” for Central European Time (CET).

Table 2.4: Date and time formats as defined by ISO 8601

Concept Format Example

Date YYYY-MM-DD 2015-10-12Time hh:mm:ss 14:30:00Date and time YYYY-MM-DDThh:mm:ss 2015-10-12T14:30:00

Country Codes ISO 3166 is a standard code for indicating geographical or geopoliticalareas (i.e. countries, states, provinces). The standard is divided into three parts, namelyISO 3166-1 [Int13b], ISO 3166-2 [Int13c], and ISO 3166-3 [Int13d]:

ISO 3166-1 : Regions, that are further divided into three different types of codings:

a) alpha-2 : Two-letter country codes, e.g. “DE” for Germany2YYYY denotes the year in the Gregorian calendar (referred to CCYY with CC for century digits in

[Int88]), MM the month, and DD the day of the month, respectively. Similarly, we refer to hourswith hh, minutes with mm, and seconds with ss.

3Separating hyphens and colons denote the more readable extended format [Int88]. They could also beomitted.


b) alpha-3 : Three-letter country codes, e.g. “DEU” for Germany

c) numeric: Three-digit country codes, e.g. “276” for Germany

ISO 3166-2 : Countries and administrative subdivisions (regions, districts, provinces,federated states), e.g. “DE-BY” for Bavaria, Germany

ISO 3166-3 : Deprecated country names, e.g. “YUCS” for Yugoslavia

The two-letter country codes, defined in ISO 3166-1 alpha-2, are well-established andserve as the basis for the subdivisions specified in ISO 3166-2 [Hep08b, p. 24]. Two-lettergeographical codes are used for example for top-level domains of Web addresses, e.g.http://www.example.de for a German Web site address.

Language Codes Language codes are defined by ISO 639 [IntND]. ISO 639 entails sixstandards (part 1 to part 6). Their specifications differ by the length of the letter code,ranging from two to four letters (alpha-2, alpha-3, and alpha-4). The number of languagesencoded varies accordingly, i.e. ISO 639-1 [Int02b] covers most international languages,whereas ISO 639-2 [Int98], consisting of only one additional letter, covers more languages[IntND]. Examples for ISO 639-1 are “de” for German and “en” for English. In ISO 639-2,for legacy reasons, some languages have two three-letter codes. E.g., “deu” and “ger” areboth valid codes for German. ISO 639-3 [Int07b] preserves only the native code of eachmajor language, i.e. “deu”. Furthermore, it covers not only living languages but alsodead and ancient languages [IntND], e.g. old-high and middle-high German languages arerepresented with “goh” and “gmh”.

Codes for Units of Measure The UN/CEFACT Common Code standard defines codesconsisting of up to three alphanumeric characters for the description of units of measure-ment [Uni09b]. The codes defined in the standard allow for the automated exchange ofphysical measures in international trading. The standard provides codes for both baseand derived International System of Units (SI) units [Nat08, p. 9]. For mass, the baseunit is kilogram4. Accordingly, gram is a derived unit for mass. The standard includesconversion factors that relate derived units to their compatible base units. The conversionfactor for gram with respect to kilogram is e.g., as expected, “0.001” (see Table 2.5).

4The Bureau International des Poids et Mesures5 (BIPM) keeps a reference mass prototype of exactlyone kilogram made of platinum-iridium in Sèvres, near Paris, France [Nat08, p. 18].

http://www.example.de


Table 2.5: Selected snippet related to “kilogram” from the UN/CEFACT Common Code [Uni09a]

Common Code Name Conversion Factor Symbol

MC microgram 10�9 kg µgDJ decagram 10�2 kg dagDG decigram 10�4 kg dgKGM kilogram kg kgGRM gram 10�3 kg gCGM centigram 10�5 kg cgTNE tonne (metric ton) 103 kg t

Currency Codes Currency codes are the “units of measurement” for monetary amounts.They greatly facilitate international electronic transactions by eliminating potentialambiguities about prices. ISO 4217 [Int08] is an international three-letter code standardfor currencies. Currency codes are composed of the two-letter ISO 3166-1 [Int13b] countrycodes and, where possible, the initial letter of the currency name (e.g. “USD” for USdollars, “CHF” for Swiss francs) [Int08]. There are exceptions, though, as for examplewith Euros (“EUR”). Furthermore, ISO 4217 defines corresponding three-digit numericcodes [Int08]. E-business heavily relies on currency codes, e.g. e-commerce platforms andWeb shops need to show price details in different currencies and calculate price valuesbased on currency exchange rates. But even in our everyday lives, we are often confrontedwith currency codes, e.g. on passenger transport tickets of trains and airlines.

2.2.6.3 Product Identifiers for Electronic Business

Electronic systems use product identifiers for the seamless integration and reliableelectronic exchange of product information. The most relevant types of product identifiersfor e-business are product item identifiers, product model identifiers, custom productidentifiers, and product class identifiers.

For the subsequent discussion of the product identifier types, let us consider an examplefeaturing the following product information about (a particular instance of) a car radio:

JVC KD-R741BTE Auto CD-Receiver

GPC: 10001527

Brand: JVC

Serial no.: 01-12-14-R741BTE-171

EAN-13: 4975769403750

MPN: KD-R741BTE

SKU: A2318110


Product Item Identifiers A product item identifier globally and uniquely identifies anindividual instance of a particular product type. Examples of product item identifiers areserial numbers, either for tangible or intangible products. This includes software productkeys, Vehicle Identification Numbers (VINs), or International Mobile Station EquipmentIdentity (IMEI) numbers to identify mobile phones. The Electronic Product Code (EPC)is a universal product identifier to identify virtually every physical object [GS1ND, p. 22].EPC codes are typically encoded into radio-frequency identification (RFID) chips [GS1ND,p. 22].

The specified serial number of our example globally identifies a unique instance of acar radio, e.g. the one that we just purchased from a retail store. The serial number istypically assigned by the manufacturer, often obeying to a generally accepted standard(e.g. VIN).

Product Model Identifiers A product model identifier globally and uniquely identifiesthe make and model of a product, i.e. a bundle of identical trade items, but not anindividual product item. It can be considered as an identifier for the prototype or blueprintof a product [cf. GS116, p. 25]. An important product model identifier developed by GS1is the Global Trade Item Number (GTIN) [GS115]. GTIN distinguishes four importantcodes with different lengths, namely GTIN-8, GTIN-12, GTIN-13, and GTIN-14 [GS115].The trailing number indicates the amount of digits that the identifier contains. On tradeitems, GTIN is encoded into EAN/UPC barcodes [GS1ND, p. 17] or, alternatively, intoEPC codes used with RFID tags [GS1ND, pp. 21f.]. While the European Article Number(EAN) is mainly used in Europe, the Universal Product Code (UPC) is more popular inNorth America and other English-speaking countries. Both EAN-13 (13 digits) and UPC(12 digits) product codes can be mapped into the newer GTIN-14 number, namely bypadding them with zeros to the left until 14 digits have been reached [GS115; GS116,p. 26].

Amazon employs a proprietary product model identifier, the Amazon Standard Identi-fication Number (ASIN) [Ama16]. Some industries even developed own identificationstandards: Books, for instance, can be identified using the well-established InternationalStandard Book Number (ISBN) [Int05], and the pharmaceutical industry standardizedthe PharmaCode (or Pharmaceutical Binary Code) for medicine and health-care products.The German equivalent is the Pharmazentralnummer (Engl.: Central PharmaceuticalNumber) (PZN) [Inf15].

In our example, a 13-digit EAN code of the product model is provided (“4975769403750”),which is equivalent to the GTIN-13 code. The corresponding ASIN code that Amazon


would assign is “B00B59P4YW”.

Custom Product Identifiers Manufacturers and vendors frequently assign custom codesto their products for internal reference. In practical scenarios, and to a lesser degree inacademia, the terms manufacturer part number (MPN) and stock keeping unit (SKU)have been widely deployed to define product part numbers within the scope of manu-facturer and vendor catalogs, respectively [e.g. EBa15; GooND]. More precisely, whenmanufacturers design and produce product parts, they often assign a MPN in orderto refer to them internally. Similarly, vendors use SKUs to identify products of theirinventories. Nonetheless, unlike GTIN numbers for example, these identifiers are generallynot designed for use as globally unique identifiers.

Multiple MPNs may be assigned to one and the same product make and model. Thisholds especially true if more than one manufacturer is producing parts with respect to thesame product specification. These manufacturers are considered as producing non-OEMparts, as opposed to original equipment manufacturer (OEM) parts only produced by theoriginal manufacturer. We can see this for example in the automotive sector, where spareparts are often manufactured by various companies.

For the product item in our example, both the MPN (“KD-R741BTE”) and SKU(“A2318110”) are given.

Product Class Identifiers In e-business contexts, similar products (or product models)are often grouped into categories (see Section 2.2.7), which can be assigned globalidentifiers themselves.

The GPC code [GS105], as used in our example, is part of a multi-level hierarchicalclassification standard that, if read from top to bottom, is divided into the four tiersSegment > Family > Class > Brick. The lowest level of the Global Product Classi-fication (GPC) standard denotes the most specific category. Each category is codifiedusing an eight-digit number. Our car radio is characterized by a very specific category(“10001527” – “Car Audio CD Players/Changers”), i.e. it belongs to the lowest possiblelevel of the GPC hierarchy:

Segment 68000000 (Audio Visual/Photography)

Family 68030000 (In-car Electronics)

Class 68030200 (Car Audio)

Brick 10001527 (Car Audio CD Players/Changers)


Comparison In Table 2.6, we summarize the main characteristics of the product-relatedidentifier types by comparing them by their usual scope (locally or globally valid), therange (single item, or a group of the same or similar items), and the issuer of theidentifier.

Table 2.6: Characteristics of product identifier types

Identifier Scope Range Issuer

Item global single item manufacturerModel global same items manufacturerCustom local same items manufacturer, vendorClass local or global similar items maintainer of the classification

2.2.7 Product and Services Classification Systems

This section presents some classification systems for organizing products and services. Inthis context, let us first clarify the term classification and introduce related terms that,even though often used interchangeably, generally mean different things.

2.2.7.1 Knowledge Organization

Organization and classification is inherent to humans [Hod00, p. 4]. From the beginningof our lives, we consciously and unconsciously categorize observations that we have madeand compare them among each other. Just think of toy building blocks of different shapesand colors that a young child tries to match into given forms.

In literature, there does not exist a single, common understanding of classificationor knowledge organization [Hep03, pp. 50–52; Hjø08]. Hjørland e.g. summarizes theorganization of knowledge into a narrower and a broader perspective [cf. Hjø08]. Whilethe broader meaning focuses on the social aspects of knowledge organization, such as theorganization of reality (e.g. biological systematics, geographical classification, etc.), thenarrower meaning refers to the technical challenges to ease the storage and retrieval ofinformation (e.g. the building of taxonomies and thesauri) [Hjø08].

Many of the techniques for knowledge organization arose from the field of library science[Gar04; cf. Hod00, p. 3; cf. Hed10, pp. 1f.]. In library and information science, aclassification represents a type of Knowledge Organization System (KOS) [Hod00, p. 3].KOSs, according to Hodge, “encompass all types of schemes for organizing information andpromoting knowledge management” [Hod00, p. 3]. Their purpose is to group information


assets of digital libraries in a way to facilitate the information discovery process. Thethree main groups of KOSs are [Hod00, pp. 5–7; Hed10, p. 2]:

1. Term lists: Authority files, glossaries, dictionaries, and gazetteers.

2. Classifications and categories: Subject headings, categorization schemes, and tax-onomies.

3. Relationship lists: Thesauri, semantic networks, and ontologies.

Hedden summarizes these kinds of KOSs to controlled vocabularies, taxonomies, thesauri,and ontologies [Hed10, pp. 3–15]. The expressivity and complexity of these KOSs increasesin the provided order, as illustrated in Figure 2.4.

Term List Taxonomy Thesaurus Ontology

ExpressiveComplex

Simple

Figure 2.4: Structural complexity increase of knowledge organization systems [adapted from Nat05,p. 17]

Controlled Vocabulary A controlled vocabulary, as understood by Hedden, “may coverany kind of knowledge organization system, with the possible exclusion of highly structuredsemantic networks or ontologies” [Hed10, p. 3]. The understanding of a controlledvocabulary would thus typically range from term lists to simple relationship lists. Inits simplest and most common form, a controlled vocabulary describes a closed list ofterms [Hed10, p. 3; Gar04], e.g. a predefined list of category or concept names. Withthe limited range of available terms, the freedom to choose between arbitrary terms isconstrained, which makes the use of a controlled vocabulary more deterministic andmanageable, also in collaborative environments. Controlled vocabularies help ensureconsistency, e.g. by indexing or tagging documents of a digital library more homogeneouslyand unambiguously [Hed10, p. 3]. More sophisticated controlled vocabularies also providecross-referencing across terms and offer synonym sets or synonym rings [Hed10, p. 4;Nat05, p. 18], e.g. to help disambiguate terms and categories or to improve recall duringinformation retrieval [cf. Nat05, p. 18].

Taxonomy The National Information Standards Organization (NISO) defines a taxon-omy as a “collection of controlled vocabulary terms organized into a hierarchical structure”[Nat05, p. 9]. For Hedden, the term taxonomy carries a broader meaning [Hed10, p. 1,p. 6], namely representing “any means of organizing concepts of knowledge” [Hed10, p. 1].


Instead, Hedden uses the more accurate term hierarchical taxonomies to refer to thenarrower understanding of taxonomies [Hed10, p. 6].

By considering a taxonomy as a hierarchical arrangement of terms for concepts orcategories, the hierarchy is built up of parent-child/broader-narrower relationships [Nat05,p. 9; Gar04], or is-a relations. In a biological hierarchy, for example, a cow is amammal, whereas a mammal is an animal (with having potentially skipped some hierarchylevels). However, different interpretations of hierarchical relationships are possible, i.e.is-a relations do not necessarily mean the same in different contexts and in differentclassification schemes. Brachman [Bra83] addressed this problem, summarizing andanalyzing a number of possible interpretations of the is-a relation.

Thesaurus By the definition of NISO, a thesaurus is a “controlled vocabulary arrangedin a known order and structured so that the various relationships among terms aredisplayed clearly and identified by standardized relationship indicators” [Nat05, p. 9].In a thesaurus, a collection of terms can be organized using different standard types ofrelationships [e.g. Gar04; Hed10, p. 10; Hod00, p. 6]:

• Hierarchical (e.g. is-a relations),

• associative (e.g. related terms), and

• equivalence (e.g. synonyms) [Hed10, p. 10; Hod00, p. 6].

These characteristics imply that a thesaurus merely extends a taxonomy by additionalrelationships with richer semantics [Gar04; Hed10, p. 10]. However, unlike a taxonomy,a thesaurus is less strict about a hierarchical order between its terms [Hed10, pp. 10f.].In a thesaurus, if a broader/narrower relationship between terms does not hold, therelationship is simply omitted [Hed10, pp. 10f.].

In the ISO 25964 standard, the ISO publishes guidelines for the development andmaintenance of thesauri (part 1 [Int11]) and for establishing interoperability with othervocabularies (part 2 [Int13a]).

Ontology Semantic networks6 [e.g. Sow13; Leh92] and ontologies [e.g. Gru93] are morepowerful than thesauri, because they add further semantic richness to facilitate informationsharing about a domain of discourse [Hed10, pp. 12f.; Gru93]. An ontology possiblyentails all relationships of taxonomies and thesauri, plus it defines additional elements

6“A semantic network or net is a graph structure for representing knowledge in patterns of interconnectednodes and arcs.” [Sow13]


in order to express facts of a specific knowledge domain in greater detail [Hed10, p. 12].These additions can be domain-specific (because terms and relationships can be definedat will), which permits for example to express the following sentences:

A human is a mammal.

All mammals are mortals.

A proper definition and a more in-depth discussion about ontologies follow in Sec-tion 2.3.6.

2.2.7.2 Product Categorization Standards

The aforementioned classification schemes (especially taxonomies and thesauri) providethe basis for product (or services) categorization. Product categorization standards arewidely accepted knowledge structures that often consist of thousands of hierarchicallyarranged categories. As already noted in Section 2.2.6.3, categories are typically assignedproduct class identifiers in order to be able to globally and unambiguously reference tothem.

Product categories basically permit business partners to trade and negotiate basedon common definitions for product types [GS105, p. 4]. More specifically, a productclassification standard can have several benefits, among others [cf. GS105, p. 6]

• more accurate product descriptions through shared product features, attributes,and multi-lingual support,

• lesser likelihood of redundant redefinitions,

• seamless integration of product data between business partners, and

• improved search and discovery of product items.

In summary, it strengthens the competitiveness of a company, because betting on awidely used product classification standard facilitates collaboration on a global scale, andincreases the likelihood for the products and services to be shown to target audiences [cf.HLS07].

The structure of product categorization standards is usually hierarchical [HLS07], namelycategories (or classes) are arranged as in a taxonomy. Hence, product categorizationstandards are often referred to as product taxonomies. The branches of categorizationstandards can have variable or fixed depth, sometimes with varying conceptual coverageand level of detail [HLS07]. Furthermore, many standards provide means to describe


categories in greater detail, namely by specifying product features, feature values, andtranslations of categories into various languages [HLS07]. Among the most relevantproduct categorization standards in industry are, in alphabetical order, Common Procure-ment Vocabulary (CPV) [Eur08a], eCl@ss7, ECCMA Open Technical Dictionary (eOTD)[EleND], ElektroTechnisches InformationsModell (Engl.: Electro-Technical InformationModel) (ETIM)8, GPC [GS105], RosettaNet Technical Dictionary (RNTD), and UnitedNations Standard Products and Services Code (UNSPSC)9. Their main properties areoutlined in Table 2.7. As all these standards are available in multiple languages, wedecided against including this information in the table.

Table 2.7: High-level comparison of product categorization standards

Standard Industrial Scope Number of HierarchyLevels

Includes Features

CPV Public procurement (EU) 4 �eCl@ss Cross-industrial 4 +EOTD Cross-industrial 1 +ETIM Electronics 2 +GPC Cross-industrial 4 +RNTD Electronics 1 +UNSPSC Cross-industrial 4–5 �

Since requirements are different, it is quite natural that, over time, there evolved multipleproduct categorization standards for different domains and purposes [cf. HLS07]. Thelevel of detail and the scope determine whether we refer to a standard as being verticalor horizontal [cf. HLS07]. A vertical standard aims to cover a specific domain or industryaccurately and in great depth (e.g. CPV, ETIM, and RNTD), whereas a horizontalstandard captures a broad spectrum of domains or industries (e.g. eCl@ss, eOTD, GPC,and UNSPSC) but, of course, at the sacrifice of less detail (see the second column inTable 2.7).

Despite the sheer number of available product classification standards, they also revealweaknesses. Fensel et al. [Fen+01] noted that UNSPSC as a horizontal standard is notvery handy, i.e. it is not very descriptive, unintuitive, and shallow. Hepp, Leukel, andSchmitz [HLS07] provided metrics that confirm that UNSPSC is not only shallow but alsothe level of detail varies between different branches within the standard. Hepp, Leukel,and Schmitz [HLS07] defined a set of metrics whereupon they quantitatively analyzed fourproduct and services classification standards, namely the horizontal standards eCl@ss,UNSPSC, eOTD, and one vertical standard, RNTD. Their major finding was that each

7http://www.eclass.de/ (accessed on May 16, 2014)8http://www.etim.de/ (accessed on May 16, 2014)9http://www.unspsc.org/ (accessed on May 16, 2014)


http://www.etim.de/



standard is associated with trade-offs (the results indicate that only very few branchesshow comprehensive coverage [HLS07]), which is potentially attributable to the difficultyof catching up with real-world changes in huge knowledge bases, but also due to differingrequirements and scopes [cf. HLS07].

2.2.7.3 Proprietary Product Classification Systems and Taxonomies

The use of product categorization standards might not be feasible in some cases, inparticular when

• there is already a legacy (e.g. a proprietary) classification system in place,

• adopting a product classification standard would be beyond the immediate businessneeds for a given use case, or

• a product classification standard that would accommodate the requirements imposedby the particular domain is missing.

The last item addresses the problem outlined at the end of Section 2.2.7.2, i.e. that thecoverage of an existing product classification standard is not comprehensive enough for agiven use case [cf. HLS07].

If any of the aforementioned conditions are met, then a supplier, manufacturer, retailer,or vendor could opt for a proprietary classification system to organize their products.This has the advantage that it requires less commitment and is more flexible with respectto custom requirements. The downside is the lack a consensus among users, which makesproprietary classification systems fairly hard to use in collaborative settings. Peopleand organizations frequently employ proprietary classification systems and taxonomiesfor empowering custom-tailored solutions. Examples are application-specific categoryhierarchies for indexing and retrieval, as for example navigation structures in Web shops orhierarchical directories of search engines. Among the prominent examples for a proprietaryclassification system on the Web is the Google product taxonomy [Goo13]. It is a publiclyavailable category structure released by Google Inc. that, among others, has the goal tosupport shop owners in specifying the correct Google product category for their productssubmitted to Google Shopping10. Many Web shops interested in having their productsproperly indexed by Google shopping are thus incentivized to adhere to this proprietarycategory system.

10http://www.google.com/shopping (accessed on May 8, 2014)

http://www.google.com/shopping


2.2.8 Electronic Marketplaces

Literature about electronic markets reveals at least two distinct conceptions of markets:

1. A general view that concerns the market as a governance mechanism in contrast tothe hierarchical organization [MYB87]11.

2. A narrower, system-oriented view with the market as an intermediary and facilitatorthat matches sellers and buyers, facilitates transactions, and provides the legal andregulatory infrastructure [Bak98].

In the context of this thesis, the system-oriented view of a market is the more interest-ing one, because it matches the idea of electronic marketplaces as places for trading.“Electronic markets [...] form a single selected institutional and technical platform forelectronic commerce. Their common feature is the market coordination mechanism”[PRW08, p. 274].

An electronic marketplace, e-marketplace, or online market, brings together multiplebuyers and sellers in a virtual market [Gri03]. Grieger [Gri03] conducted a rich surveyabout electronic marketplaces with summarizing selected definitions from literature. Onthe basis of his work, Grieger describes a marketplace as follows:

“A marketplace as a historically evolved institution allows customers and suppliers tomeet at a certain place and at a certain time in order to communicate and to announcebuying or selling intentions, which eventually match and may be settled. Today theinstitution market still does the same, but has occasionally been remodelled due tothe evolution of media.” [Gri03]

According to Grieger [Gri03], the understanding of a market as of today largely correlateswith the market definition from centuries ago, when people converged at market squaresfor trading their goods. In comparison with traditional markets, however, the geographicalconditions have essentially changed for electronic marketplaces. Nowadays, trading isubiquitous, i.e. virtually conducted everywhere.

Schmid and Lindemann [SL98] suggest a division of marketplace transactions that isprincipally compatible with the transaction activities we have introduced in Section 2.1.1,even if more coarse-grained. They group transactions into three consecutive markettransaction phases, namely information, agreement, and settlement [SL98]. Productsearch naturally belongs to the first of these phases, i.e. the information phase.

11Malone, Yates, and Benjamin [MYB87] argue that information technology reduces coordination costson markets that favors markets more than hierarchies.


2.2.9 Electronic Tendering

Tendering is a process that has traditionally been known for being lengthy and complex[Du+04]. In general, it implies that the party demanding a good or service advertisestendering documents (formal request for tenders, comprising an early specification of theexpected deliverables or terms and conditions), and companies that are able to deliver thegood or service can bid by sending respective offers (known as tenders) [TB96, p. 66f.].Usually, the cheapest offer with sufficiently good quality is accepted from the list ofbidders, given that both sides agree upon the terms and conditions [TB96, pp. 69f.].

In a strict sense, tendering is conducted by governmental institutions that outsourcetasks to companies, frequently construction and engineering work [TB96, p. 66; Ker+00;TS08]. In particular, public institutions behave as the contracting entities (principals)whereas enterprises are the potential contractors that bid for the contract [Du+04]. Withthe open bidding process, governments seek to foster transparency and save costs byintensifying competition.

Electronic tendering (e-tendering) can further enhance the classical tendering process [e.g.TS08]. It makes sense to regard e-tendering as a special case of electronic procurement (ore-procurement) that mainly represents transaction activities involving businesses and/orpublic authorities (see Table 2.2 in Section 2.2.1). While the traditional tendering islargely paper-based, time-intensive and often geographically restricted (because of thedifficulty of effective dissemination of the tendering details) [TS08], electronic tenderingnot only enhances the overall procurement process for the contractee, but also facilitatesthe bidding process for the contractor leading to lower transaction costs. In a study from2008, Tindsley and Stephenson surveyed experts in the field of construction in the UnitedKingdom and reported about enhanced communication, time-savings, and cost reductionsthrough e-tendering as compared to traditional tendering [TS08].

The complex electronic tendering process requires sophisticated system support. It isespecially critical that electronic tendering systems be able to support the full electronictendering process [Ker+00] while implementing high security standards that are needed tofulfill legal requirements [Du+04]. One prominent online service where public tenders canbe looked up is provided by the European Union (EU) published via Tenders ElectronicDaily (TED)12. TED is regularly updated and ensures easy access to tenders for goods andservices by taking advantage of the CPV categorization standard (see Section 2.2.7.2).

12http://ted.europa.eu/ (accessed on May 8, 2014)

http://ted.europa.eu/


2.3 Semantic Web and Linked Data

As a second pillar of this thesis (besides the business-related aspects discussed previously,see Figure 2.1), the Semantic Web and Linked Data constitute the technological under-pinnings of our approach. Subsequently, we thus summarize the main ideas, principles,and technologies related to these two topics.

2.3.1 Web

Before getting more technical, it is necessary to become acquainted with the key terminol-ogy. For this reason, let us briefly outline the three most popular Web-related concepts,namely WWW, Semantic Web, and Linked Data.

2.3.1.1 World Wide Web

The World Wide Web (WWW) represents a huge global repository of resources, likedocuments and services, organized in a decentralized fashion that avails itself of thenetworked infrastructure and protocols provided by the Internet [cf. BF99, pp. 16–23]. Animportant characteristic of the Web is its openness [AH11, p. 2, pp. 6f.], i.e. that everyoneis able to publish and disseminate thoughts, ideas, and contributions. The term WWWis often abbreviated as WWW and frequently shortened to Web. It was invented by TimBerners-Lee in 1989 while working at the Centre European pour la Recherche Nucleaire13

(CERN) in Switzerland and is nowadays developed by the World Wide Web Consortium(W3C), that Tim Berners-Lee is currently heading as its director. The development of theWeb was primarily motivated by a social requirement rather than a technical need [BF99,p. 123]: Its initial aim was to facilitate the collaboration between research scientists. Inthis regard, Tim Berners-Lee stated in his book Weaving the Web: The Original Designand Ultimate Destiny of the World Wide Web by Its Inventor :

“The Web is more a social creation than a technical one. I designed it for a socialeffect – to help people work together – and not as a technical toy.” [BF99, p. 123]

The WWW embodies the marriage of two ground-breaking technologies, the Internet andhypertext [BF99, p. 6]. While the Internet is a large network of computers with a basicprotocol stack (e.g. TCP/IP as the network protocol family for the message transport),hypertext regards the linkage between documents [Nel65]. In the context of the Web, theidea of hypertext has resulted in the concept of hyperlinks.13Engl.: European Organization for Nuclear Research

2.3 Semantic Web and Linked Data 53

The three fundamental building blocks of the WWW are Uniform Resource Identifiers(URIs), Hypertext Transfer Protocol (HTTP), and Hypertext Markup Language (HTML)[BF99, p. 36]:

• URIs [BFM05] uniquely identify Web resources, as we will discuss in more detail inSection 2.3.2.

• HTTP denotes an application layer protocol traditionally specified in RFC 2616[Fie+99], which has been superseded by multiple Requests for Comments (RFCs)in 2014, that are RFC 7230–7235: RFC 7230 [FR14c] defines standard messagesfor client and server communications over the Web; RFC 7231 [FR14d] describesmethods, status codes, and message headers; RFC 7232 [FR14b] specifies condi-tional requests; RFC 7233 [FLR14] addresses range requests to get partial content;RFC 7234 [FNR14] covers relevant aspects of caching (e.g. browser and proxycaches); and, RFC 7235 [FR14a] discusses authentication.

• HTML [Hic+14] is a semi-structured markup language intended for human con-sumption via Web browsers. Anchor links (or hyperlinks) defined within HTMLdocuments make it possible to navigate across different Web documents. For Webresources, these links are typically represented using HTTP URIs.

2.3.1.2 Semantic Web

From the early twenty-first century onwards, much effort of the W3C has gone intoadvancing the idea of the Semantic Web [SHB06]. Meanwhile, many people world-widehave started working and conducting research in the field of the Semantic Web. TheSemantic Web basically constitutes an extension of the traditional Web:

“The Semantic Web is not a separate Web but an extension of the current one, inwhich information is given well-defined meaning, better enabling computers andpeople to work in cooperation.” [BHL01]

Unlike the traditional Web, which helps people to publish Web pages and consumeinformation in a human-friendly way (i.e. using HTML) via their Web browsers, theSemantic Web strives to help machines understand and process the content published onthe Web more easily [cf. BF99, p. 177; cf. AH11, pp. 5f.].

As opposed to the traditional Web that is commonly known as the document-based Web,the Semantic Web is also referred to as the Web of Data, where data is becoming afirst-class citizen. As Tim Berners-Lee has stated in an interview in 2006, “[...] datais a precious thing and will last longer than the systems themselves” [Bri06], meaning


that with a focus on data it is possible to survive the decline of applications, but it alsogives rise to novel applications and mash-ups that can rely on the readily available data.Similarly, data, once connected and integrated in an intelligent way, is able to unleashunprecedented potential [AH11, pp. 3f.]. The Semantic Web follows the separation ofconcerns [Dij82; cf. Par72] design principle of software engineering, featuring a high levelof modularization. I.e., the data model and the semantics are separate from syntax, andsyntax is separate from the technologies used to identify resources. Similarly, because datais kept independent of its applications, it is possible to materialize multiple applicationscenarios with the same data.

The Semantic Web avails itself of the same technology stack as the Web, but with afew important additions necessary to better preserve meaning in the exchange of data.Figure 2.5 depicts an adapted version of the Semantic Web layer cake14 [cf. DFH11,p. 20] that visualizes the core technology stack of the Semantic Web. In our depiction,we omitted advanced topics addressed by the Semantic Web stack in [DFH11, p. 20] butpartly irrelevant to this thesis, namely cryptography, rule languages like the SemanticWeb Rule Language (SWRL) [Hor+04] and the Rule Interchange Format (RIF) [Bol+07],unifying logic, proof, and trust. Instead, we summarized rules and logical inferencesinto a common term reasoning [e.g. KD11, pp. 245–257]. Also, since XML has from thebeginnings of the Semantic Web been considered the de-facto serialization standard forthe Resource Description Framework (RDF) [SR14], we have updated this to a moregeneral term data formats that now properly entails current serialization formats likeXML [Bra+08], Terse RDF Triple Language (Turtle) [PC14], JavaScript Object Notation(JSON) [Bra14], etc.

User Interface and Applications

URI / IRI

RDF Data Formats(XML, Turtle, JSON, ...)

RDFS / OWLSPARQL

Reasoning

Unicode

Figure 2.5: Semantic Web layer cake [adapted from DFH11, p. 20]

14http://www.w3.org/2007/03/layerCake.svg (accessed on May 6, 2014)

http://www.w3.org/2007/03/layerCake.svg


The Semantic Web layer cake is organized hierarchically by every layer abstracting fromthe layers underneath, which is similar to the OSI reference model for networking [II94,p. 28] (i.e. the upper layers rely on the technology provided by the bottom layers). Thelowest layer in Figure 2.5 represents the character set (Unicode) that every layer on top ofit is supposed to adhere to. Furthermore, it defines URIs [BFM05] and InternationalizedResource Identifiers (IRIs) [DS05] as identifiers for Web resources. On the second-lowestlayer of Figure 2.5, the RDF data model is introduced, along with the data formats thatcan be used to encode it. RDF Schema (RDFS) [BG14] and the Web Ontology LanguageOWL [DS04] as ontology languages complement the RDF model with capabilities torealize advanced modeling patterns, such as subsumption relationships or constraints.The SPARQL Protocol and RDF Query Language (SPARQL) [HS13] denotes a protocoland the query language that make it possible to query RDF datasets, typically via aSPARQL endpoint [e.g. Bui+13]. Based on logical axioms defined at the RDFS andOWL levels, reasoners can infer additional, implicit knowledge [e.g. KD11, pp. 245–257].Finally, a user interface or application can take advantage of the capabilities offered bythe inferior levels in the Semantic Web stack.

In a nutshell, the Semantic Web makes the following contributions:

1. It facilitates data integration through global Web resource identifiers.

2. It provides a data model for making assertions about real-world objects.

3. It adds meaning (disambiguation) through ontologies.

4. It offers query mechanisms and reasoning capabilities to consume data that waspublished on the Semantic Web.

2.3.1.3 Linked Data

The Linked Data movement has gained momentum over the past decade. Once again,it was first proposed by Tim Berners-Lee [cf. Ber06], who already pioneered the WWWand the Semantic Web. Bizer, Heath, and Berners-Lee [BHB09] characterize the idea ofLinked Data as follows:

“Linked Data is simply about using the Web to create typed links between data fromdifferent sources.” [BHB09]

In comparison to the Semantic Web, which largely deals with meaning and logics, theLinked Data idea focuses on moving towards a single huge data space of linked dataexpressed using RDF, in particular putting emphasis on the publishing aspects. Themain building blocks of Linked Data are URIs and HTTP [BHB09]: URIs uniformly refer


to data at global scale, and HTTP makes URIs dereferenceable so that information canbe looked up easily.

In order to make Linked Data become a reality, Berners-Lee suggested in 2006 four rulesthat every data provider should adhere to in order to accomplish the goal of a globaldata space of interconnected data. These Linked Data principles [e.g. Ber06; HB11, p. 7;BHB09; BK12, p. 34] read as follows [Ber06]:

“1. Use URIs as names for things2. Use HTTP URIs so that people can look up those names.3. When someone looks up a URI, provide useful information, using the standards(RDF*, SPARQL)4. Include links to other URIs. so that they can discover more things.” [Ber06]

Every dataset encompassed by these four principles falls under the term Linked Data,including proprietary and unpublished datasets that are linked in a certain way. Incontrast, Linked Open Data (LOD) describes that part of Linked Data that representsfreely available data (especially governmental data [cf. BK12]). Based on their levelof publishing quality, LOD datasets can be classified as 1-star (?) to 5-star (? ? ? ? ?)LOD15 [e.g. Ber06; HB11, p. 26; BK12, p. 17], where 5-star data considerably easesconsumption:

? Available on the Web (irrespective of format) under an open license (e.g. pictureof a chart)

?? Machine-readable structured data (e.g. Excel spreadsheet)

? ? ? Non-proprietary data format (e.g. comma-separated values (CSV))

? ? ?? Open W3C standards to identify things (e.g. URIs and RDF)

? ? ? ? ? Links to other datasets

Following these guidelines, many tools have been developed to quickly bring data ofvarious data silos into the Web of Data (“to bootstrap the Web of Data” [HB11, p. 30]),among others D2R Server, Triplify, and Virtuoso Universal Server [BHB09]. To trackthe progress of these efforts, people from the Linked Data community early on startedto visualize the interlinked datasets in the LOD cloud diagram [cf. Sch+14]. Figure 2.6documents the evolution of LOD from 2007 to 2014 [cf. Sch+14]. Updated over time, thisdiagram has constituted a popular indicator for the growth of the LOD graph. However,in the meantime, LOD grew that big that the entailing graph has become difficult to puton a single diagram.15http://5stardata.info/ (accessed on May 13, 2014)

http://5stardata.info/


SW

Conference

Corpus

DBpedia

RDF Book Mashup

DBLPBerlin

Revyu

Project Guten-berg

FOAF

Geo-names

Music-brainz

Magna-tune

Jamendo

World

Fact-

book

DBLPHannover

SIOC

Sem-

Web-

Central

Euro-

stat

ECS

South-

amptonBBC

Later +TOTP

Fresh-meat

Open-

Guides

Gov-Track

US Census Data

W3CWordNet

flickrwrappr

Wiki-

company

OpenCyc

NEW! lingvoj

Onto-world

NEW!

NEW!

NEW!

(a) LOD graph as of November 10, 2007[from Sch+14]

Linked Datasets as of August 2014

Uniprot

AlexandriaDigital Library

Gazetteer

lobidOrganizations

chem2bio2rdf

MultimediaLab University

Ghent

Open DataEcuador

GeoEcuador

Serendipity

UTPLLOD

GovAgriBusDenmark

DBpedialive

URIBurner

Identifiers

EionetRDF

lobidResources

WiktionaryDBpedia

Viaf

Umthes

RKBExplorer

Courseware

Opencyc

Olia

Gem.Thesaurus

AudiovisueleArchieven

DiseasomeFU-Berlin

Eurovocin

SKOS

DNBGND

Cornetto

Bio2RDFPubmed

Bio2RDFNDC

Bio2RDFMesh

IDS

OntosNewsPortal

AEMET

ineverycrea

LinkedUser

Feedback

MuseosEspaniaGNOSS

Europeana

NomenclatorAsturias

Red UnoInternacional

GNOSS

GeoWordnet

Bio2RDFHGNC

CticPublic

Dataset

Bio2RDFHomologene

Bio2RDFAffymetrix

MuninnWorld War I

CKAN

GovernmentWeb Integration

forLinkedData

Universidadde CuencaLinkeddata

Freebase

Linklion

Ariadne

OrganicEdunet

GeneExpressionAtlas RDF

ChemblRDF

BiosamplesRDF

IdentifiersOrg

BiomodelsRDF

ReactomeRDF

Disgenet

SemanticQuran

IATI asLinked Data

DutchShips and

Sailors

Verrijktkoninkrijk

IServe

Arago-dbpedia

LinkedTCGA

ABS270a.info

RDFLicense

EnvironmentalApplications

ReferenceThesaurus

Thist

JudaicaLink

BPR

OCD

ShoahVictimsNames

Reload

Data forTourists in

Castilla y Leon

2001SpanishCensusto RDF

RKBExplorer

Webscience

RKBExplorerEprintsHarvest

NVS

EU AgenciesBodies

EPO

LinkedNUTS

RKBExplorer

Epsrc

OpenMobile

Network

RKBExplorerLisbon

RKBExplorer

Italy

CE4R

EnvironmentAgency

Bathing WaterQuality

RKBExplorerKaunas

OpenData

Thesaurus

RKBExplorerWordnet

RKBExplorer

ECS

AustrianSki

Racers

Social-semweb

Thesaurus

DataOpenAc Uk

RKBExplorer

IEEE

RKBExplorer

LAAS

RKBExplorer

Wiki

RKBExplorer

JISC

RKBExplorerEprints

RKBExplorer

Pisa

RKBExplorer

Darmstadt

RKBExplorerunlocode

RKBExplorer

Newcastle

RKBExplorer

OS

RKBExplorer

Curriculum

RKBExplorer

Resex

RKBExplorer

Roma

RKBExplorerEurecom

RKBExplorer

IBM

RKBExplorer

NSF

RKBExplorer

kisti

RKBExplorer

DBLP

RKBExplorer

ACM

RKBExplorerCiteseer

RKBExplorer

Southampton

RKBExplorerDeepblue

RKBExplorerDeploy

RKBExplorer

Risks

RKBExplorer

ERA

RKBExplorer

OAI

RKBExplorer

FT

RKBExplorer

Ulm

RKBExplorer

Irit

RKBExplorerRAE2001

RKBExplorer

Dotac

RKBExplorerBudapest

SwedishOpen Cultural

Heritage

Radatana

CourtsThesaurus

GermanLabor LawThesaurus

GovUKTransport

Data

GovUKEducation

Data

EnaktingMortality

EnaktingEnergy

EnaktingCrime

EnaktingPopulation

EnaktingCO2Emission

EnaktingNHS

RKBExplorer

Crime

RKBExplorercordis

Govtrack

GeologicalSurvey of

AustriaThesaurus

GeoLinkedData

GesisThesoz

Bio2RDFPharmgkb

Bio2RDFSabiorkBio2RDF

Ncbigene

Bio2RDFIrefindex

Bio2RDFIproclass

Bio2RDFGOA

Bio2RDFDrugbank

Bio2RDFCTD

Bio2RDFBiomodels

Bio2RDFDBSNP

Bio2RDFClinicaltrials

Bio2RDFLSR

Bio2RDFOrphanet

Bio2RDFWormbase

BIS270a.info

DM2E

DBpediaPT

DBpediaES

DBpediaCS

DBnary

AlpinoRDF

YAGO

PdevLemon

Lemonuby

Isocat

Ietflang

Core

KUPKB

GettyAAT

SemanticWeb

Journal

OpenlinkSWDataspaces

MyOpenlinkDataspaces

Jugem

Typepad

AspireHarperAdams

NBNResolving

Worldcat

Bio2RDF

Bio2RDFECO

Taxon-conceptAssets

Indymedia

GovUKSocietal

WellbeingDeprivation imd

EmploymentRank La 2010

GNULicenses

GreekWordnet

DBpedia

CIPFA

Yso.fiAllars

Glottolog

StatusNetBonifaz

StatusNetshnoulle

Revyu

StatusNetKathryl

ChargingStations

AspireUCL

Tekord

Didactalia

ArtenueVosmedios

GNOSS

LinkedCrunchbase

ESDStandards

VIVOUniversityof Florida

Bio2RDFSGD

Resources

ProductOntology

DatosBne.es

StatusNetMrblog

Bio2RDFDataset

EUNIS

GovUKHousingMarket

LCSH

GovUKTransparencyImpact ind.Households

In temp.Accom.

UniprotKB

StatusNetTimttmy

SemanticWeb

Grundlagen

GovUKInput ind.

Local AuthorityFunding FromGovernment

Grant

StatusNetFcestrada

JITA

StatusNetSomsants

StatusNetIlikefreedom

DrugbankFU-Berlin

Semanlink

StatusNetDtdns

StatusNetStatus.net

DCSSheffield

AtheliaRFID

StatusNetTekk

ListaEncabezaMientosMateria

StatusNetFragdev

Morelab

DBTuneJohn PeelSessions

RDFizelast.fm

OpenData

Euskadi

GovUKTransparency

Input ind.Local auth.Funding f.

Gvmnt. Grant

MSC

Lexinfo

StatusNetEquestriarp

Asn.us

GovUKSocietal

WellbeingDeprivation ImdHealth Rank la

2010

StatusNetMacno

OceandrillingBorehole

AspireQmul

GovUKImpact

IndicatorsPlanning

ApplicationsGranted

Loius

Datahub.io

StatusNetMaymay

Prospectsand

TrendsGNOSS

GovUKTransparency

Impact IndicatorsEnergy Efficiency

new Builds

DBpediaEU

Bio2RDFTaxon

StatusNetTschlotfeldt

JamendoDBTune

AspireNTU

GovUKSocietal

WellbeingDeprivation Imd

Health Score2010

LoticoGNOSS

UniprotMetadata

LinkedEurostat

AspireSussex

Lexvo

LinkedGeoData

StatusNetSpip

SORS

GovUKHomeless-

nessAccept. per

1000

TWCIEEEvis

AspireBrunel

PlanetDataProject

Wiki

StatusNetFreelish

Statisticsdata.gov.uk

StatusNetMulestable

Enipedia

UKLegislation

API

LinkedMDB

StatusNetQth

SiderFU-Berlin

DBpediaDE

GovUKHouseholds

Social lettingsGeneral Needs

Lettings PrpNumber

Bedrooms

AgrovocSkos

MyExperiment

ProyectoApadrina

GovUKImd CrimeRank 2010

SISVU

GovUKSocietal

WellbeingDeprivation ImdHousing Rank la

2010

StatusNetUni

Siegen

OpendataScotland Simd

EducationRank

StatusNetKaimi

GovUKHouseholds

Accommodatedper 1000

StatusNetPlanetlibre

DBpediaEL

SztakiLOD

DBpediaLite

DrugInteractionKnowledge

BaseStatusNet

Qdnx

AmsterdamMuseum

AS EDN LOD

RDFOhloh

DBTuneartistslast.fm

AspireUclan

HellenicFire Brigade

Bibsonomy

NottinghamTrent

ResourceLists

OpendataScotland SimdIncome Rank

RandomnessGuide

London

OpendataScotland

Simd HealthRank

SouthamptonECS Eprints

FRB270a.info

StatusNetSebseb01

StatusNetBka

ESDToolkit

HellenicPolice

StatusNetCed117

OpenEnergy

Info Wiki

StatusNetLydiastench

OpenDataRISP

Taxon-concept

Occurences

Bio2RDFSGD

UIS270a.info

NYTimesLinked Open

Data

AspireKeele

GovUKHouseholdsProjectionsPopulation

W3C

OpendataScotland

Simd HousingRank

ZDB

StatusNet1w6

StatusNetAlexandre

Franke

DeweyDecimal

Classification

StatusNetStatus

StatusNetdoomicile

CurrencyDesignators

StatusNetHiico

LinkedEdgar

GovUKHouseholds

2008

DOI

StatusNetPandaid

BrazilianPoliticians

NHSJargon

Theses.fr

LinkedLifeData

Semantic WebDogFood

UMBEL

OpenlyLocal

StatusNetSsweeny

LinkedFood

InteractiveMaps

GNOSS

OECD270a.info

Sudoc.fr

GreenCompetitive-

nessGNOSS

StatusNetIntegralblue

WOLD

LinkedStockIndex

Apache

KDATA

LinkedOpenPiracy

GovUKSocietal

WellbeingDeprv. ImdEmpl. Rank

La 2010

BBCMusic

StatusNetQuitter

StatusNetScoffoni

OpenElection

DataProject

Referencedata.gov.uk

StatusNetJonkman

ProjectGutenbergFU-BerlinDBTropes

StatusNetSpraci

Libris

ECB270a.info

StatusNetThelovebug

Icane

GreekAdministrative

Geography

Bio2RDFOMIM

StatusNetOrangeseeds

NationalDiet Library

WEB NDLAuthorities

UniprotTaxonomy

DBpediaNL

L3SDBLP

FAOGeopolitical

Ontology

GovUKImpact

IndicatorsHousing Starts

DeutscheBiographie

StatusNetldnfai

StatusNetKeuser

StatusNetRusswurm

GovUK SocietalWellbeing

Deprivation ImdCrime Rank 2010

GovUKImd Income

Rank La2010

StatusNetDatenfahrt

StatusNetImirhil

Southamptonac.uk

LOD2Project

Wiki

DBpediaKO

DailymedFU-Berlin

WALS

DBpediaIT

StatusNetRecit

Livejournal

StatusNetExdc

Elviajero

Aves3D

OpenCalais

ZaragozaTurruta

AspireManchester

Wordnet(VU)

GovUKTransparency

Impact IndicatorsNeighbourhood

Plans

StatusNetDavid

Haberthuer

B3Kat

PubBielefeld

Prefix.cc

NALT

Vulnera-pedia

GovUKImpact

IndicatorsAffordable

Housing Starts

GovUKWellbeing lsoa

HappyYesterday

Mean

FlickrWrappr

Yso.fiYSA

OpenLibrary

AspirePlymouth

StatusNetJohndrink

Water

StatusNetGomertronic

Tags2conDelicious

StatusNettl1n

StatusNetProgval

Testee

WorldFactbookFU-Berlin

DBpediaJA

StatusNetCooleysekula

ProductDB

IMF270a.info

StatusNetPostblue

StatusNetSkilledtests

NextwebGNOSS

EurostatFU-Berlin

GovUKHouseholds

Social LettingsGeneral Needs

Lettings PrpHousehold

Composition

StatusNetFcac

DWSGroup

OpendataScotland

GraphSimd Rank

DNB

CleanEnergyData

Reegle

OpendataScotland SimdEmployment

Rank

ChroniclingAmerica

GovUKSocietal

WellbeingDeprivation

Imd Rank 2010

StatusNetBelfalas

AspireMMU

StatusNetLegadolibre

BlukBNB

StatusNetLebsanft

GADMGeovocab

GovUKImd Score

2010

SemanticXBRL

UKPostcodes

GeoNames

EEARodAspire

Roehampton

BFS270a.info

CameraDeputatiLinkedData

Bio2RDFGeneID

GovUKTransparency

Impact IndicatorsPlanning

ApplicationsGranted

StatusNetSweetie

Belle

O'Reilly

GNI

CityLichfield

GovUKImd

Rank 2010

BibleOntology

Idref.fr

StatusNetAtari

Frosch

Dev8d

NobelPrizes

StatusNetSoucy

ArchiveshubLinkedData

LinkedRailway

DataProject

FAO270a.info

GovUKWellbeing

WorthwhileMean

Bibbase

Semantic-web.org

BritishMuseum

Collection

GovUKDev LocalAuthorityServices

CodeHaus

Lingvoj

OrdnanceSurveyLinkedData

Wordpress

EurostatRDF

StatusNetKenzoid

GEMET

GovUKSocietal

WellbeingDeprv. imdScore '10

MisMuseosGNOSS

GovUKHouseholdsProjections

totalHouseolds

StatusNet20100

EEA

CiardRing

OpendataScotland Graph

EducationPupils by

School andDatazone

VIVOIndiana

University

Pokepedia

Transparency270a.info

StatusNetGlou

GovUKHomelessness

HouseholdsAccommodated

TemporaryHousing Types

STWThesaurus

forEconomics

DebianPackageTrackingSystem

DBTuneMagnatune

NUTSGeo-vocab

GovUKSocietal

WellbeingDeprivation ImdIncome Rank La

2010

BBCWildlifeFinder

StatusNetMystatus

MiguiadEviajesGNOSS

AcornSat

DataBnf.fr

GovUKimd env.

rank 2010

StatusNetOpensimchat

OpenFoodFacts

GovUKSocietal


Education Rank La2010

LODACBDLS

FOAF-Profiles

StatusNetSamnoble

GovUKTransparency

Impact IndicatorsAffordable

Housing Starts

StatusNetCoreyavisEnel

Shops

DBpediaFR

StatusNetRainbowdash

StatusNetMamalibre

PrincetonLibrary

Findingaids

WWWFoundation

Bio2RDFOMIM

Resources

OpendataScotland Simd

GeographicAccess Rank

Gutenberg

StatusNetOtbm

ODCLSOA

StatusNetOurcoffs

Colinda

WebNmasunoTraveler

StatusNetHackerposse

LOV

GarnicaPlywood

GovUKwellb. happy

yesterdaystd. dev.

StatusNetLudost

BBCProgram-

mes

GovUKSocietal


EnvironmentRank 2010

Bio2RDFTaxonomy

Worldbank270a.info

OSM

DBTuneMusic-brainz

LinkedMarkMail

StatusNetDeuxpi

GovUKTransparency

ImpactIndicators

Housing Starts

BizkaiSense

GovUKimpact

indicators energyefficiency new

builds

StatusNetMorphtown

GovUKTransparency

Input indicatorsLocal authorities

Working w. tr.Families

ISO 639Oasis

AspirePortsmouth

ZaragozaDatos

AbiertosOpendataScotland

SimdCrime Rank

Berlios

StatusNetpiana

GovUKNet Add.Dwellings

Bootsnall

StatusNetchromic

Geospecies

linkedct

Wordnet(W3C)

StatusNetthornton2

StatusNetmkuttner

StatusNetlinuxwrangling

EurostatLinkedData

GovUKsocietal

wellbeingdeprv. imd

rank '07

GovUKsocietal

wellbeingdeprv. imdrank la '10

LinkedOpen Data

ofEcology

StatusNetchickenkiller

StatusNetgegeweb

DeustoTech

StatusNetschiessle

GovUKtransparency

impactindicatorstr. families

Taxonconcept

GovUKservice

expenditure

GovUKsocietal

wellbeingdeprivation imd

employmentscore 2010

(b) LOD graph as of August 30, 2014[from Sch+14]

Figure 2.6: Evolution of the LOD cloud diagram

A few prominent examples representing LOD sources on the Web are DBPedia16 [Aue+07],Wikidata17 [VK14], and Freebase18 [Bol+08]. Furthermore, in the context of e-commerce,the review platform revyu.com19 [HM07] and productdb.org20 are very useful. Moreexamples for LOD datasets can be looked up online21.

2.3.2 Unique Identifiers

Uniform Resource Identifiers (URIs) serve a similar purpose for Web resources as productidentifiers do for products (see Section 2.2.6.3): They uniquely identify them. URIs arespecified in the Request for Comments (RFC) 3986 [BFM05].

There exists a general distinction between URIs: Uniform Resource Locator (URL) andUniform Resource Name (URN) [BFM05, Section 1.1.3] (see Figure 2.7). The URL syntaxdefines its location and the type of accessibility [BFM05, Section 1.1.3], i.e. access throughHTTP and the File Transfer Protocol (FTP) for the URLs provided in the example inFigure 2.7. On the other hand, a URN only assigns a name to a resource, but withoutspecifying how the resource might be accessed or where it is located [BFM05, Section1.1.3]. Accordingly, a URI is composed of the following component parts22 [BFM05,Section 3]:

16http://dbpedia.org/ (accessed on May 12, 2014)17http://www.wikidata.org/ (accessed on May 12, 2014)18http://www.freebase.com/ (accessed on May 12, 2014)19http://revyu.com/ (accessed on May 12, 2014)20http://productdb.org/ (accessed on May 12, 2014)21http://linkeddata.org/data-sets (accessed on May 12, 2014)22This pattern shows only the most common component parts of a URI. Further refinements are possible.

http://dbpedia.org/

http://www.wikidata.org/


http://revyu.com/

http://productdb.org/

http://linkeddata.org/data-sets


URI

URL URN

http://www.example.org/ftp://ftp.example.org/

urn:isbn:123-4-567-89012-Xurn:ietf:rfc:3986

mailto:[email protected]:+49-89-12345678

Figure 2.7: Relationship between URI, URL, and URN [based on BFM05]

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

We further exemplify a URI’s components on a fictive geocoding Web page, represented(a) by a dereferenceable HTTP URL, and (b) by a URN as it could be used by anapplication.

a) http://www.example.org:8080/geocode?s=germany&c=munich&a=marienplatz#canvas

\__/ \__________________/ \_____/ \______________________________/ \____/

| | | | |

scheme authority path query fragment

| __________________________|_____________

/ \ / \

b) urn:example:geocode:germany:munich:marienplatz

As an advancement to URIs, IRIs have been introduced in RFC 3987 [DS05]. Theysupport resource identifiers with special characters. IRIs are based on the UniversalCharacter Set (Unicode/ISO 10646) [DS05] and thus allow for a much wider range ofcharacters than the American Standard Code for Information Interchange (ASCII) codetable that underlies URIs. For example, a German blog could choose to use the followingIRI for their blog post entry

http://www.example.org/blog/2014/März/01.html

whereas with URIs we either need to by-pass special characters, namely

http://www.example.org/blog/2014/Maerz/01.html

or we have to percent-encode the special character properly

http://www.example.org/blog/2014/M%C3%A4rz/01.html


To retain compatibility between IRIs and URIs, the specification in RFC 3987 definesbidirectional mapping algorithms [DS05, Section 3]. RDF, for example, added supportfor IRIs in recommendation version 1.1 [SR14].

As a design decision for URIs, Tim Berners-Lee postulated in 1998 the use of cool URIs.A cool URI is “one which does not change” [Ber98], i.e. a permalink. The idea wasto omit any details that might be subject to future changes, e.g. status informationabout the document (e.g. “draft”, “final”), underlying software mechanisms (e.g. “.php”,“cgi-bin”), or even metadata (e.g. author information, or storage details like disk names).According to this, cool URIs represent well-conceived and sustainable Web identifiers thatare aimed at being simple, stable, and manageable [SC08]. Otherwise, every technologicalupgrade would involve significant maintenance overhead in the best case (e.g. setting upappropriate redirects). A poor URI design carries the risks of loosing users and breakingexisting applications.

URIs on the Web can be used to refer to both information resources and non-informationresources [JW04, Section 2.2]. This distinction becomes relevant when talking aboutresources on the Semantic Web. Web pages in the traditional sense, i.e. documentsor information artifacts that can be retrieved (or dereferenced), constitute informationresources [JW04, Section 2.2]. “[T]heir essential characteristics can be conveyed in amessage” [JW04, Section 2.2]. By contrast, non-information resources describe entitieswhich essential characteristics cannot be transferred over a medium [JW04, Section 2.2].This includes real-world objects like people, cars, books, clothing, etc. Their resourceidentity on the Semantic Web can be regarded as pointing to non-information resources[cf. SC08, Section 3]. For instance, the author of this thesis can be identified on the Webby a URI, which, however, is different from a Web document that presents informationabout the author. The technical problems that arose by the need to keep identifiers forreal-world objects and their representations on the Web apart was debated in the contextof the httpRange-14 issue [SC08, Section 4.2].

2.3.3 Resource Description Framework

The Resource Description Framework (RDF) is a framework for representing resources onthe Web [SR14, Section 1]. It constitutes a basic data model that facilitates the exchangeof knowledge representations in a distributed manner, based on the publication of graphsand URIs for identifying nodes and edge types in these graphs. In 2004, RDF has becomea W3C recommendation [MM04] and was thus accepted as the official standard formodeling data on the Semantic Web. In 2014, the W3C Working Group obsoleted the


original specification in favor of RDF 1.1 [SR14], which suggested minor but importantadvancements to RDF, namely

• support for IRIs,

• a simpler mechanism for datatypes for all literals (even those with language tags,and plain literals have become obsolete), and

• a number of new serialization formats for RDF [cf. Woo14].

The main building blocks of the RDF data model are triples [SR14, Section 3.1], sometimesreferred to as statements [AvH08, p. 68]. Each triple is composed of three elements in agiven order: Subject, predicate, and object. A triple can be formally defined as follows:

(s, p, o) 2 (R [B)⇥R⇥ (R [B [ L) (2.2)

In this formula, R denotes a set of resource identifiers (e.g. URI or IRI), B representsblank nodes, and L stands for literal values (or constant values). Likewise, triples can begraphically represented as illustrated in Figure 2.823.

s op

(a) Object is a URI or blank node

s op

(b) Object is an RDF literal

Figure 2.8: RDF triple represented as a graph

As in the formal definition above, the object can be either an addressable node (i.e. aURI or blank node) or a literal value. A literal value is an atomic value and is either atextual label (or string value), date value, or numeric value [SR14, Section 3.3]. WithRDF 1.0, it was optional to assign a datatype to a literal. With RDF 1.1, datatypesare now mandatory for literals. If the datatype is omitted, then a triple store shouldassume a string literal (i.e. using a plain or simple literal is treated as “syntactic sugar”for a string-typed literal) [CWL14, Section 3.3]. For textual literals it is also possible toassociate them with a language tag. In this context, the datatype is implicitly knownand need not be supplied [CWL14, Section 3.3].

The RDF data model allows the interlinking of RDF triples based on resource identifiers

23 In order to draw this and the upcoming RDF graphs, we used a slightly customized version of thebrilliant TextMate bundle developed by Peter Geil, Turtle.tmbundle (https://github.com/peta/turtle.tmbundle (accessed on May 20, 2014)), which can generate graphs from Turtle code.

https://github.com/peta/turtle.tmbundle

https://github.com/peta/turtle.tmbundle


and blank nodes24. Because URIs are used to identify nodes and relationship types, acomputer can reliably combine multiple statements into a consolidated graph by simplestring comparison, i.e. multiple statements referring to the same subject or object can becollated. Herein lies the actual power of the Semantic Web, namely by gathering contextinformation from various datasets and thus gradually connecting the dots. Figure 2.9 ofSection 2.3.4 exemplifies such a graph.

Multiple RDF statements can themselves be made a resource and be grouped together asa single RDF graph [SR14, Section 3.5]. An RDF graph provides context informationabout a group of RDF triples. It can also be identified by a URI. A collection of RDFgraphs in turn forms an RDF dataset. RDF datasets generally consist of one defaultgraph and an arbitrary number of named graphs [CWL14, Section 4; Car+05].

2.3.4 RDF Serialization Formats

In the beginnings of the Semantic Web, the RDF/XML serialization format was introducedas a normative data format for RDF [SR14, Section 5.4]. Many tools were capable ofhandling XML documents and thus the XML toolset (parsing, transformation, search,etc.) could be applied to RDF/XML as well. Unfortunately, the fact that people had tolearn XML before being able to work with RDF, introduced an unnecessary degree ofseparation.

To mitigate this issue, a more human-friendly alternative syntax has been proposed anddeveloped by Tim Berners-Lee: Notation 3 (N3) [Ber05; cf. BC11]. Since then, variousadditional data formats for RDF have evolved, most notably N-Triples [CS14a], Turtle[PC14], RDFa [Adi+13], and JSON-LD [SKL14].

In the following, let us introduce an example based on the GoodRelations [Hep08a]vocabulary for e-commerce (see Section 2.3.6.2) that will subsequently serve as ourbaseline for explaining the different syntax variants.

Example. Imagine your child is begging for some ice cream. Luckily, not too far fromyou there is an ice cream shop, which makes the following offer: “A single scoop of icecream for only e1.10”.

24Blank nodes have the limitation that their locally scoped identifiers cannot be linked from outside thegraph. Their advantage is, however, that modelers do not need to care about minting identifiers forsecondary resources, which is especially useful for graphs not meant to be published or linked fromexternally [cf. CWL14, Sections 3.4 and 3.5].


Our simple example can be visualized using the RDF graph shown in Figure 2.9. Thetop node represents an instance of an offer, which name property is “Scoop of ice cream”.Please note also the English language tag supplied with the literal. The type of the offeris made explicit using the rdf:type property. Furthermore, the offer is intended for sale(using the business function property together with an instance gr:Sell). The tricky partis the modeling of the price. For our purpose, we decided to model the price specificationas a blank node, since the price is only related to this offer and thus it usually makesno sense to refer to it independently. The price is indicated to be calculated per unit(via gr:UnitPriceSpecification), and the price value and price currency are modeled asindividual properties.

gr:UnitPriceSpecification

rdf:type

"EUR"^^xsd:string

gr:hasCurrency

"1.10"^^xsd:float

gr:hasCurrencyValue

ex:OfferIcecream

gr:Offering

rdf:type

gr:Sell

gr:hasBusinessFunction gr:hasPriceSpecification

"Scoop of ice cream"@en

gr:name

Figure 2.9: Example as an RDF graph

In the upcoming sections, we present the same example in different RDF syntaxes. Toconvert between those syntaxes, the author of this thesis has developed an online tool25

that allows to translate between the most common data formats for RDF [SRH13a].

2.3.4.1 RDF/XML

RDF/XML is an XML-based data format for RDF. It has for a long time been regardedthe normative syntax for RDF. Because of the wide tool support for XML, RDF/XMLwas expected to be understood by most Semantic Web tools. RDF 1.1 XML Syntax[GS14] provides a syntax and grammar definition for RDF/XML. Without going intodetails, Listing 2.1 encodes our ice cream example based on those definitions.

25http://rdf-translator.appspot.com/ (accessed on May 8, 2014)

http://rdf-translator.appspot.com/


1 <?xml version="1.0" encoding="utf-8"?>

2 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

3 xmlns:gr="http://purl.org/goodrelations/v1#">

4 <gr:Offering rdf:about="http://www.example.com/#OfferIcecream">

5 <gr:name xml:lang="en">Scoop of ice cream</gr:name>

6 <gr:hasBusinessFunction rdf:resource="http://purl.org/goodrelations/v1#Sell"/>

7 <gr:hasPriceSpecification>

8 <gr:UnitPriceSpecification>

9 <gr:hasCurrencyValue rdf:datatype="http://www.w3.org/2001/XMLSchema#float">1.10</

gr:hasCurrencyValue>

10 <gr:hasCurrency rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EUR</

gr:hasCurrency>

11 </gr:UnitPriceSpecification>

12 </gr:hasPriceSpecification>

13 </gr:Offering>

14 </rdf:RDF>

Listing 2.1: Example in RDF/XML

2.3.4.2 Turtle

Turtle is a shorthand for Turtle. Its spelled-out name indicates why it came into existence,namely because of aiming at being a terse data format for RDF. The following dataformats pertain to the Turtle family of RDF languages [cf. SR14, Section 5]:

• N-Triples [CS14a] is an RDF syntax where RDF triples are written line by line.URIs are surrounded by angle brackets. Furthermore, language tags are appended toliterals separated by an @-symbol, and datatypes are specified by attaching them tothe literal separated by two consecutive caret (^) symbols. N-Triples is the simplestform of serializing an RDF graph and thus straightforward to process, although notthe most compact and readable one. Our example in N-Triples looks as shown inListing 2.2. Lines 5–7 are statements that belong to the price specification. As weknow from before, the price specification is described by a blank node, which can beassigned an arbitrary identifier with a local scope. In our case, it is _:ub22bL7C28.

• Turtle [PC14] is a compact, human-readable syntax. It is mainly used to explainRDF content to people. Turtle designates an extension of N-Triples, i.e. every validN-Triples document is also a valid Turtle document. As compared to N-Triples,Turtle permits to add prefix directives that can be used to shorten otherwiselengthy URIs (e.g. http://www.example.com/#OfferIcecream) to terse CompactURIs (CURIEs) [BM10] (e.g. ex:OfferIcecream). Other syntactical improvementsencompass shorthand notations like “a” instead of rdf:type, predicate lists (multiple

http://www.example.com/#OfferIcecream


1 <http://www.example.com/#OfferIcecream> <http://purl.org/goodrelations/v1#hasBusinessFunction> <

http://purl.org/goodrelations/v1#Sell> .

2 <http://www.example.com/#OfferIcecream> <http://purl.org/goodrelations/v1#name> "Scoop of ice

cream"@en .

3 <http://www.example.com/#OfferIcecream> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http

://purl.org/goodrelations/v1#Offering> .

4 <http://www.example.com/#OfferIcecream> <http://purl.org/goodrelations/v1#hasPriceSpecification>

_:ub22bL7C28 .

5 _:ub22bL7C28 <http://purl.org/goodrelations/v1#hasCurrency> "EUR"^^<http://www.w3.org/2001/

XMLSchema#string> .

6 _:ub22bL7C28 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/goodrelations/v1

#UnitPriceSpecification> .

7 _:ub22bL7C28 <http://purl.org/goodrelations/v1#hasCurrencyValue> "1.10"^^<http://www.w3.org

/2001/XMLSchema#float> .

Listing 2.2: Example in N-Triples

predicate-object pairs separated by semicolons “;”), object lists (multiple objectsseparated by commas “,”), or statements contained within square brackets to delimita blank node [PC14]. Consequently, the following two examples represent one andthe same RDF graph as illustrated in Figure 2.10:

s p1

o1

p2

o2p2

o3

p3

Figure 2.10: RDF graph that corresponds to the Turtle example

a) Simple Turtle notation (equals N-Triples)

<s> <p1> _:b .

_:b <p2> <o1> .

_:b <p2> <o2> .

_:b <p3> <o3> .


b) Turtle notation

<s> <p1> [

<p2> <o1>, <o2> ;

<p3> <o3> ] .

Following this short introduction into the basics of the Turtle syntax, Listing 2.3 out-lines our example in Turtle. Please note the prefix declarations for the vocabulariesat the beginning of the code section.

1 @prefix gr: <http://purl.org/goodrelations/v1#> .

2 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

3 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

4 @prefix ex: <http://www.example.com/#> .

5

6 ex:OfferIcecream a gr:Offering ;

7 gr:hasBusinessFunction gr:Sell ;

8 gr:hasPriceSpecification [ a gr:UnitPriceSpecification ;

9 gr:hasCurrency "EUR"^^xsd:string ;

10 gr:hasCurrencyValue "1.10"^^xsd:float ] ;

11 gr:name "Scoop of ice cream"@en .

Listing 2.3: Example in Turtle/N3

• Notation 3 (N3) [BC11] is a superset of Turtle, i.e. every valid Turtle document isalso a valid N3 document. In comparison to Turtle, N3 provides even more syntacticsimplifications that aim to facilitate readability (e.g. “=” in place of owl:sameAs).Hence, Listing 2.3 being valid Turtle also represents a valid snippet in N3.

In addition to the data formats outlined above, there are other syntaxes derived fromTurtle that permit to describe multiple named graphs within one document. They havebeen added as valid RDF data formats in the new RDF 1.1 W3C recommendation[SR14]:

• N-Quads [Car14] is a line-based RDF syntax that extends N-Triples with supportfor encoding multiple named graphs.

• TriG [CS14b] is an extension of Turtle and defines named graphs as block sectionsof RDF triples surrounded by curly braces.


2.3.4.3 RDFa

The Resource Description Framework in Attributes (RDFa) [Adi+13] is a data format forencoding structured information in Web pages. It avails itself of (X)HTML26, which isthanks to its wide dissemination a natural carrier for embedding data from the SemanticWeb. As such, RDFa constitutes a lightweight alternative for RDF/XML, because itdoes not require complicated server configurations in order to deliver Semantic Webcontent. Since having become an W3C recommendation in 2008 [Adi+08], RDFa hascontinually grown in popularity, which was confirmed by several research studies [e.g.Mik11; MP12; Biz+13]. RDFa was, contrary to common belief, not the first syntax toembed RDF in Web documents. Similar approaches have been proposed in the past. Forexample, Simple HTML Ontology Extension (SHOE) can be considered a predecessor ofRDFa and similar data formats for markup languages [e.g. LSR96; HH00]. As opposed toSHOE, RDFa is less intrusive for it makes use of existing Extensible Hypertext MarkupLanguage (XHTML) and HTML elements and attributes and needs to define only a fewadditional attributes. Table 2.8 summarizes all relevant (X)HTML attributes used byRDFa [Adi+13, Section 5], and Listing 2.4 outlines an RDFa snippet corresponding toour example.

Table 2.8: (X)HTML attributes defined for RDFa [based on Adi+13, Section 5]

Category Attribute Explanation

syntax prefix space-separated list of prefix-IRI pairs used for defining CURIEsvocab IRI/URI mapping for locally scoped attribute values

subject about IRI/URI/CURIE of the RDF subjecttypeof space-separated list of RDF types for the RDF subject

predicate

rel relationship between two resources (RDF predicate)rev inverse relationship between two resources (inverse of rel)property RDF predicate referring to a literal value

object

resource IRI/URI/CURIE of the RDF objecthref IRI/URI of the RDF object if the resource is navigablesrc IRI/URI of the RDF object if the resource is embeddedcontent if supplied, it takes precedence over the element content (literal value)datatype datatype of the literal value(xml:)lang language tag of the literal value

The snippet in Listing 2.4 encodes all its RDF content inside HTML attributes. Thistechnique is referred to as RDFa in “Snippet Style” [HGR09], meaning that a snippet withhidden markup is created that can be placed almost everywhere in an HTML document.26In fact, RDFa 1.1 is compatible with any XML document like e.g. Scalable Vector Graphics (SVG)

pictures [cf. Her+13, Section 1.1], although XHTML or HTML are the most common uses.


1 <div xmlns="http://www.w3.org/1999/xhtml" prefix="

2 gr: http://purl.org/goodrelations/v1#

3 xsd: http://www.w3.org/2001/XMLSchema#">

4 <div typeof="gr:Offering" about="http://www.example.com/#OfferIcecream">

5 <div property="gr:name" xml:lang="en" content="Scoop of ice cream"></div>

6 <div rel="gr:hasBusinessFunction" resource="http://purl.org/goodrelations/v1#Sell"></div>

7 <div rel="gr:hasPriceSpecification">

8 <div typeof="gr:UnitPriceSpecification">

9 <div property="gr:hasCurrencyValue" datatype="xsd:float" content="1.10"></div>

10 <div property="gr:hasCurrency" datatype="xsd:string" content="EUR"></div>

11 </div>

12 </div>

13 </div>

14 </div>

Listing 2.4: Example in RDFa

This technique has both benefits and limitations. A remarkable advantage is that it ismuch easier to generate, i.e. an application programming interface (API) with templateengine or a converter can create the snippet in an automatic way. Furthermore, it isfairly straightforward to afterwards incorporate the generated snippet into a Web page.The drawback of this technique is that the structured data is decoupled from the visiblecontent on the Web page, which adds unnecessary redundancy. To give an example, line 5of Listing 2.4 could equally be written as

<div property="gr:name" xml:lang="en">Scoop of ice cream</div>

which would reuse the text inside the HTML <div> element as metadata for RDF.

2.3.4.4 JSON-LD

JSON for Linked Data (JSON-LD)27 [SKL14] is a data format for serializing Linked Dataas JSON. JSON is a light-weight, text-based, and language-independent data interchangeformat [Bra14]. It is widely used for software engineering projects as an alternative toXML [cf. Bra+08]. Being based on JSON makes JSON-LD easy to work with for humans[LG12].

Listing 2.5 encodes our example as JSON-LD, which meaning should be self-explaining inthe meantime. In this example, all URIs are written out entirely, even though JSON-LDprovides an equivalent mechanism like the prefix directives in Turtle or RDFa. I.e., one27http://json-ld.org/ (accessed on May 16, 2014)

http://json-ld.org/


1 {

2 "@id": "http://www.example.com/#OfferIcecream",

3 "@type": "http://purl.org/goodrelations/v1#Offering",

4 "http://purl.org/goodrelations/v1#hasBusinessFunction": {

5 "@id": "http://purl.org/goodrelations/v1#Sell"

6 },

7 "http://purl.org/goodrelations/v1#hasPriceSpecification": {

8 "@type": "http://purl.org/goodrelations/v1#UnitPriceSpecification",

9 "http://purl.org/goodrelations/v1#hasCurrency": {

10 "@type": "http://www.w3.org/2001/XMLSchema#string",

11 "@value": "EUR"

12 },

13 "http://purl.org/goodrelations/v1#hasCurrencyValue": {

14 "@type": "http://www.w3.org/2001/XMLSchema#float",

15 "@value": "1.10"

16 }

17 },

18 "http://purl.org/goodrelations/v1#name": {

19 "@language": "en",

20 "@value": "Scoop of ice cream"

21 }

22 }

Listing 2.5: Example in JSON-LD

can define shorthand names (known as terms) within a context construct [SKL14, Section5.1].

2.3.4.5 Non-RDF Syntaxes for the Semantic Description of Data

In the context of facilitating the distribution and consumption of data on the Web,alternative syntaxes have been suggested that do not present full compliance with RDF,though. The ones that we are covering here are Microformats, Microdata, and the OpenGraph Protocol (OGP). These formats have enjoyed considerable adoption in the pastyears [e.g. MP12; Biz+13], mainly pushed by large companies like Google, Yahoo!, andFacebook.

• Microformats28 [cf. Kha06] reuses existing HTML attributes, like e.g. class, href,etc. Microformats is not only a data format but also a vocabulary, becausethe eligible schema elements are specified from within Microformats [cf. Kha06].Accordingly, for the product domain, predefined vocabulary terms exist such as the

28http://microformats.org/ (accessed on May 16, 2014)

http://microformats.org/


h-product class for products, the p-name property for the name of a product, as wellas the p-price property for its price [Çel14]. The adoption of new vocabularies intothe Microformats namespace is a non-trivial, tedious, and slow process [ABH11,pp. 160f.]. E.g., to prevent potential interferences between schemes or clashesbetween vocabulary terms the Microformats standardization has to be centralized[ABH11, pp. 160f.].

In Listing 2.6, we detail our ongoing example in Microformats, this time modeledas a product rather than an offer due to the fact that the Microformats vocabularymakes no conceptual distinction between offers and products (unlike GoodRelations,see Section 2.3.6.2).

1 <div class="h-product hproduct">

2 <h1 class="p-name fn">Scoop of ice cream</h1>

3 <p>Price: <data class="p-price price" value="1.10">e1.10</data></p>

4 </div>

Listing 2.6: Example in Microformats

• Microdata [Hic13] is a syntax to annotate Web content with structured data. Similarto RDFa, it can be used to add metadata based on custom vocabulary terms toHTML markup. Unlike RDFa, which is closely related to the RDF data model,Microdata does not represent an RDF graph. It rather describes a special typeof graph, i.e. a tree of nested groups (items) of name-value pairs [Hic13, Section4]. In a way it resembles the structure of a JSON document [cf. Hic13, Section 7].Nonetheless, it requires some effort to map from HTML+Microdata to RDF [cf.Kel14], mainly because of concentrating on sorting out edge cases like the specialtreatment of non-URI property names in Microdata or the handling of an orderedlist of name-value pairs from Microdata in RDF [Kel14, Section 1.1]. Generally,Microdata is considered simpler than RDFa with a less steep learning curve.

Listing 2.7 illustrates our previous example in Microdata. As you can see, Microdatais a more compact syntax than RDFa because local property identifiers inside anitem inherit the scope of the corresponding item type (provided an itemscope

attribute is in place). However, when using externally defined properties, the fullURI of the property must be provided.

• The Open Graph Protocol (OGP) [Fac12] is a simple data format and vocabularythat was inspired by RDFa which attributes it leverages [Fac12]. OGP allows tosupply additional information in the HTML header about the object described


1 <div itemtype="http://purl.org/goodrelations/v1#Offering"

2 itemid="http://www.example.com/#OfferIcecream" itemscope>

3 <h1 itemprop="name">Scoop of ice cream</h1>

4 <p itemprop="hasPriceSpecification"

5 itemtype="http://purl.org/goodrelations/v1#UnitPriceSpecification" itemscope>

6 Price: e<span itemprop="hasCurrencyValue">1.10</span>

7 <meta itemprop="hasCurrency" content="EUR">

8 </p>

9 <link itemprop="hasBusinessFunction" href="http://purl.org/goodrelations/v1#Sell" />

10 </div>

Listing 2.7: Example in Microdata

on a Web page [Fac12]. It has been developed and is used by Facebook to enrichtheir social graph, e.g. to ease finding out more about the topics that people areinterested in. The downside is that there is only a restricted set of terms available,in other words OGP is not suitable for sophisticated use cases such as describingproduct offers like in our example.

2.3.5 Ontology Languages

RDF is a data model that is not targeting a specific application domain [cf. AvH08, p. 84].It consists of RDF triples to describe basic relationships between resources or to specifythe type of a resource. This is not very powerful though. What is missing is a schemalanguage that offers means to specify a blueprint for making assertions about real-worldphenomena, comparable to a database schema that precisely describes how data can bestored in a database (e.g. compare the CREATE TABLE and INSERT INTO statementsof the Structured Query Language (SQL)). Ontology languages serve exactly this purpose:They are the toolset for building ontologies. Corcho, Fernandez-Lopez, and Gomez-Perez[CFG03] did a comprehensive literature review on ontology languages. In their work, theysystematically analyzed and compared Ontolingua, the Operational Conceptual ModelingLanguage (OCML), LOOM, the Frame Logic (F-Logic), the XOL Ontology ExchangeLanguage (XOL), SHOE, RDF(S), the Ontology Inference Layer (OIL), and the DARPAAgent Markup Language (DAML) [CFG03].

In the following, we concern with RDFS and the OWL, which have become the mostpopular ontology languages on the Web.


2.3.5.1 RDF Schema

The RDF Schema (RDFS) language adds a semantic layer on top of RDF, thus empoweringRDF with a minimal set of modeling primitives [BG14, Section 1]. RDFS introduces thenotion of classes and properties as specializations of resources [cf. BG14, Section 2 andSection 3]. Diverse classes and properties can be organized as hierarchies of subclasses andsubproperties [e.g. AvH08, pp. 85–87; AH11, pp. 130f.; BG14]. For example, one mightwant to state that a car radio is a more specific type of an electronic device, just like asmartphone is. This simple hierarchy could be set up with RDFS using rdfs:subClassOfrelationships [cf. AH11, pp. 131f.] as illustrated in Figure 2.11a. Furthermore, RDFSmakes it possible to formalize basic constraints whereupon inferencing (or reasoning)over existing data can be applied. For instance, the domain (allowed subject types) andrange (allowed object types) of a property can be supplied [e.g. AvH08, p. 90; AH11,pp. 130f.; BG14]. Figure 2.11b shows a property ex:hasPowerConsumption with a domainof ex:ElectronicDevice, i.e. the property is meant to be applied to instances of that specifictype, including those class instances subsumed by ex:ElectronicDevice (i.e. instances oftype ex:CarRadio and ex:Smartphone).

ex:CarRadio

ex:ElectronicDevice

rdfs:subClassOf

ex:Smartphone

rdfs:subClassOf

(a) Class hierarchy

ex:hasPowerConsumptionex:ElectronicDevice rdfs:domain

(b) Domain value restriction

Figure 2.11: RDFS language additions

Unfortunately, the mechanism for domain and range information for properties in RDFSdiffers fundamentally from similar mechanisms in traditional database systems, becauseit triggers the automatic addition of additional type information instead of reportingviolation. Only in combination with disjointness axioms, this mechanism can be usedfor actually constraining the use of properties to particular types; for an overview see[dBru+05].

2.3.5.2 OWL Web Ontology Language

The Web Ontology Language OWL is an even more powerful ontology language thanRDFS. Historically, OWL emerged from an effort by a working group formed around


DAML+OIL to extend RDFS by more descriptive elements [AvH08, p. 113; cf. AH11,pp. 369f.]. DAML+OIL itself was a language that was previously split up into the U.S.language DAML Ontology Language (DAML-ONT) and the European language OIL[AvH08, p. 113]. OWL already exists in its second version (OWL 2), albeit in the contextof this thesis we refer to OWL 1 for being the version still in wide use. OWL 1 became aW3C recommendation in 2004 [cf. DS04].

OWL addresses use cases that cannot be fulfilled by the primitive features provided byRDF and RDFS. For example, it extends RDFS, among others, by

• value and cardinality constraints on classes (e.g. max 2 ),

• class operations (union, intersection, and complement),

• enumerations,

• class relations (e.g. equivalent class),

• advanced property types (e.g. object, datatype, functional, symmetric, annotation),

• property relations (e.g. equivalent property),

• individuals, and

• individual identity (e.g. same as or different from) [cf. DS04].

Accordingly, OWL facilitates the modeling of advanced patterns from real-world usecases, such as:

Example (Class A is equivalent to class B).<A> a owl:Class .

<B> a owl:Class .

<A> owl:equivalentClass <B> .

Example (Individual I describes the same entity as individual J).<I> owl:sameAs <J> .

Example (It is only possible to have exactly 2 parents).<hasParent> a owl:ObjectProperty .

[] a owl:Restriction ;

owl:onProperty <hasParent> ;

owl:cardinality "2"^^xsd:nonNegativeInteger .


Without going into detail, based on the restriction on features outlined above, threegeneral languages (OWL layers) for OWL 1 were suggested. They are by decreasingexpressivity [DS04, Section 8]:

1. OWL Full : Covers the full range of OWL constructs.

2. OWL DL: Restricted to primitives as found in description logics.

3. OWL Lite: Very light-weight and even more restricted than OWL DL, providingbasic modeling primitives.

The kind of OWL layer to choose requires the ontology engineer to make a trade-offdecision between expressive power and computational overhead (decidability, at theextreme) [DS04, Section 8]. While for practical concerns it might be perfectly reasonableto trade OWL DL for OWL Full, from a decidability point of view it is recommendable toremain within OWL DL, which allows OWL reasoners to infer knowledge in a deterministicway [DS04, Section 8]. For exactly this reason, many ontology engineers had decided inthe past to restrict their ontologies to OWL DL, which is known to be a fair compromisebetween expressivity and computational complexity. However, since real-world scenarios(especially Web ontologies) often require more powerful features from OWL Full but areless prone to reasoning, it is tempting to trade OWL DL ontologies for the sake of bettermodeling capabilities as provided by OWL Full.

2.3.6 Ontologies and Global Schemas

Although the term ontology (or rather, Ontology (sic!)) also plays a major role inphilosophy being a branch that studies the role of the being in the world [e.g. SBF98;GOS09], we subsequently concern with the term in the field of artificial intelligence (AI).In Table 2.9, we list the three mostly cited definitions of ontologies.

Table 2.9: Most important contributors to the definition of the term ontology

Authors Definition

[Gru93] “An ontology is an explicit specification of a conceptualization.”[Bor97, p. 12] “An ontology is a formal specification of a shared conceptualization.”[SBF98] “An ontology is a formal, explicit specification of a shared conceptualisation.”

Gruber [Gru93] was the first to provide a succinct and generally accepted definition forwhat an ontology describes in the field of AI. The other authors borrowed from this initialdefinition and slightly modified it. Compared to Gruber’s definition from 1993, Borst[Bor97, p. 12] concluded that an ontology is formalized in order to make it accessible to a


computer. Furthermore, the real-world phenomena encoded within an ontology need tobe based on consensus [Bor97, p. 12]. Studer, Benjamins, and Fensel [SBF98] merged theprevious two definitions into a third one, and stated that an ontology promises “a sharedand common understanding of some domain that can be communicated across peopleand computers” [SBF98].

As indicated in the preceding section, ontology languages are used for building ontologies,whereas ontologies denote the schemas necessary to give meaning to data. Data withoutany schema29 or ontology is of limited value, because nobody is able to interpret it. Aschema defines a number of classes and properties (schema data) that help disambiguateinformation expressed in terms of these classes and properties (instance data). Gruningerand Lee summarize the main uses of an ontology as communication, computationalinference, and reuse (and organization) of knowledge [GL02]. Consequently, an ontologyfacilitates the interaction between both human agents and systems [GL02]. Furthermore, itcan improve systems (e.g. assist their specification) and help to better organize knowledge(e.g. re-using or extending ontologies) [GL02].

There exist different ontologies for varying scopes. One possible distinction is donebetween upper ontologies (also upper-level ontologies, or top-level ontologies) and domainontologies [Gri+11, pp. 522f.]. Upper ontologies try to cover a large spectrum of domains[Gri+11, pp. 523], but at the cost of not being able to provide detailed descriptions.Top-level ontologies are thus often extended by domain ontologies [Gri+11, pp. 523].Domain ontologies are narrowing on a particular application domain and consequently itis easier to capture concepts at greater detail [Gri+11, pp. 523].

The two most notable conceptual schemas in the field of e-commerce are schema.org andGoodRelations. The first covers most popular application domains on the Web. However,it is not an upper ontology because the domains it supports are covered in detail. Ratherit can be considered an accumulation of multiple domain ontologies. The second describesa comprehensive domain ontology for e-commerce on the Web.

2.3.6.1 Schema.org

Schema.org30 is an ongoing joint initiative of the leading search engine operators Google,Yahoo!, Bing, and Yandex. It strives to compile a single vocabulary that unifies a collectionof popular Web schemas under a consolidated namespace (i.e. http://schema.org/).29The word ontology is often used synonymously with the terms vocabulary, schema, or data dictionary.

Unless otherwise noted, for the rest of the thesis we will keep referring interchangeably to these terms.30http://schema.org/ (accessed on May 21, 2014)

http://schema.org/

http://schema.org/


The key goals of schema.org are to retain simplicity, and the incremental addition offurther interesting domains by relying on feedback from a large community formed aroundthe development and maintenance of the vocabulary. The vocabulary is intended to beunderstood by all search engines in order to provide the greatest benefit to the users.This way developers and Web site owners have a clear benefit of annotating their Webdocuments with schema.org. Such benefits might be search engine result snippets, feedingGoogle’s Knowledge Graph, or a better search experience due to additional relevancesignals sent to search engines. Especially the value proposition in the form of rich snippets[Goo16] has attracted attention from various people and organizations that are runningtheir own Web sites. As of February 2016, Google is communicating that it will displayrich snippets on search engine results pages (SERPs) for

• products including offering details and ratings,

• reviews about products, restaurants, movies, and stores,

• recipes,

• events, and

• software applications [Goo16].

Figure 2.12 illustrates a rich snippet as generated by Google, indicating rating, reviews,price, and stock availability of a product. The corresponding schema.org Microdatasnippet is outlined in Listing 2.8. It is worth mentioning that schema.org presumescertain default values, so the user does not need to specify that the worst possible ratingis 131 or that the product is intended for sale32. These defaults can be easily guessed bysearch engines and hence it does not burden Web masters.

Figure 2.12: Google rich snippet

31 “If worstRating is omitted, 1 is assumed.” – schema.org documentation about http://schema.org/Rating (accessed on May 22, 2014)

32 “The business function (e.g. sell, lease, repair, dispose) of the offer or component of a bundle(TypeAndQuantityNode). The default is http://purl.org/goodrelations/v1#Sell” – schema.orgdocumentation about http://schema.org/businessFunction (accessed on May 22, 2014)

http://schema.org/Rating

http://schema.org/Rating

http://purl.org/goodrelations/v1#Sell

http://schema.org/businessFunction


1 <div itemtype="http://schema.org/Product" itemscope>

2 <h1 itemprop="name">Power Sander</h1>

3 <div itemprop="offers" itemtype="http://schema.org/Offer" itemscope>

4 <span itemprop="price">$80.30</span>

5 <link itemprop="availability" href="http://schema.org/InStock" />Available

6 </div>

7 <div itemprop"review" itemtype="http://schema.org/Review" itemscope>

8 <h2 itemprop="name">Great value for money!</h2>

9 <div itemprop="reviewRating" itemtype="http://schema.org/Rating" itemscope>

10 <span itemprop="ratingValue">4</span> out of

11 <span itemprop="bestRating">5</span> stars

12 </div>

13 </div>

14 </div>

Listing 2.8: Schema.org in Microdata

In terms of data formats, Google has historically always promoted Microdata as thepreferred syntax for schema.org. Notwithstanding, it meanwhile claims explicitly tosupport also RDFa and JSON-LD [Goo15a].

2.3.6.2 GoodRelations

GoodRelations [Hep08a] is a standardized, light-weight vocabulary for e-commerce onthe Semantic Web. The model is defined in OWL DL, which allows any RDFS-stylereasoner to compute valuable inferences on GoodRelations data [Hep08a]. Despite thework on the ontology had started a couple of years before, the ontology went first publicin 2008. In late 2012, it was incorporated into schema.org, as was publicly announced onthe schema.org blog [Guh12]. Nonetheless, development on GoodRelations in its ownnamespace still continues in parallel [Hep15b]. This makes GoodRelations and schema.orggood complements. While structured data markup in GoodRelations can take advantageof schema.org elements, schema.org conversely can be enhanced with existing individualsdefined by GoodRelations.

GoodRelations aims to describe offers for products or services and their related resourceson the Web [e.g. Hep08a; Hep15b]. A design decision behind the development of theontology was to be simple yet flexible, i.e. to be an attractive option for the small Webshop owner while leaving the possibility for advanced modeling requirements imposed byindustry. In particular, GoodRelations foregoes the detailed specification of products,because there exist classification standards and ontologies that can readily contribute them


[cf. Hep08a]. The core part of the ontology entails the relationships between business entity(gr:BusinessEntity), offer (gr:Offering), and product or service (gr:ProductOrService),henceforth referred to as the agent-promise-object principle [Hep15b; Hep11]. As detailedin Figure 2.13, this structure allows to model a business party (agent) that makes an offer(promise) for a product or service (object), which a second party can purchase in returnfor some compensation, followed by the transferral of the property rights from the firstparty to the purchasing party [Hep15b]. Now, having this separation of offer and productor service, it is possible to define multiple offerings (or promises) for a single product orservice. In practice, this could be used to model bulk prices or to effectively enforce pricedifferentiation. GoodRelations makes no prior assumptions about the characteristics ofthe promise, so in theory an offer could consist of a bundle of items, its validity could betemporally restricted, and on top of that a good could even be traded for good karmainstead of in return for a monetary amount [cf. Hep15b].

Agent 1 ObjectAPO Principle

GoodRelationsEntities gr:BusinessEntity gr:Offering gr:ProductOrService

Agent 2

Promise

Compensation Transfer ofRights

Figure 2.13: Agent-Promise-Object principle [based on Hep15b]

The ontology provides basic support for the most frequently used properties and individualsin offering descriptions, such as product details, prices, and terms and conditions [Hep15b].The GoodRelations ontology allows to extend products (gr:SomeItems or gr:Individual,both subclasses of gr:ProductOrService) with product models (gr:ProductOrServiceModel,likewise a subclass of gr:ProductOrService), or datasheets, that can contribute detailedproduct information like product features. For that purpose, it defines a fully-fledgedmeta-model for expressing quantitative and qualitative product properties in OWL. Inaddition to that, GoodRelations allows to refine product and product model descriptionswith classes and features of comprehensive product classification standards and productontologies. In this regard, a number of GoodRelations-compliant product ontologies


exist that can provide a more detailed description of products; they are eClassOWL(product types and features) [Hep05b], FreeClass (construction and building materials)[cf. Rad+13], the consumer product ontologies33 developed in the context of the Ontology-based Product Data Management (OPDM) project, and the Product Types Ontology(PTO)34 with over 600, 000 precise classes from Wikipedia. Moreover, the GoodRelationsontology was extended by a number of e-commerce verticals, including vocabularies for

• consumer electronics35,

• tickets36: concert, museum, airfare, and train tickets,

• accommodation37: hotels, camping sites, vacation homes, etc.,

• vehicles in general38: cars, boats, bikes, etc., and

• the automotive industry: car options39, Volkswagen vehicles40, and used cars41.

A number of tools have evolved since 2008 to complement the GoodRelations vocabulary.The tool chain consists of:

Publishing Tools

• RDF Book Mashup42: Book offers from Amazon annotated with GoodRelationspublished as Linked Data on the Web.

• GoodRelations Snippet Generator43: This form-based online tool helps small busi-nesses and Web site owners with moderate Semantic Web experience to quicklygenerate custom RDFa snippets for embedding in their Web pages.

• Shop extensions: Over time, a significant number of plug-ins and modules forvarious Web shops have been developed. So far, there exist extensions for os-

33http://www.ebusiness-unibw.org/ontologies/opdm/ (accessed on May 22, 2014)34http://www.productontology.org/ (accessed on May 8, 2014)35http://www.ebusiness-unibw.org/ontologies/consumerelectronics/v1.html (accessed on

May 22, 2014)36http://www.heppnetz.de/ontologies/tio/ns.html (accessed on May 22, 2014)37http://ontologies.sti-innsbruck.at/acco/ns.html (accessed on May 22, 2014)38http://www.heppnetz.de/ontologies/vso/ns.html (accessed on May 22, 2014)39http://www.volkswagen.co.uk/vocabularies/coo/ns.html (accessed on May 22, 2014)40http://www.volkswagen.co.uk/vocabularies/vvo/ns.html (accessed on May 22, 2014)41http://ontologies.makolab.com/uco/ns.html (accessed on May 22, 2014)42http://wifo5-03.informatik.uni-mannheim.de/bizer/bookmashup/ (accessed on May 22, 2014)43http://www.ebusiness-unibw.org/tools/grsnippetgen/ (accessed on May 22, 2014)

http://www.ebusiness-unibw.org/ontologies/opdm/

http://www.productontology.org/

http://www.ebusiness-unibw.org/ontologies/consumerelectronics/v1.html

http://www.heppnetz.de/ontologies/tio/ns.html

http://ontologies.sti-innsbruck.at/acco/ns.html

http://www.heppnetz.de/ontologies/vso/ns.html

http://www.volkswagen.co.uk/vocabularies/coo/ns.html

http://www.volkswagen.co.uk/vocabularies/vvo/ns.html

http://ontologies.makolab.com/uco/ns.html

http://wifo5-03.informatik.uni-mannheim.de/bizer/bookmashup/

http://www.ebusiness-unibw.org/tools/grsnippetgen/


Commerce44, Magento Commerce45, Joomla/VirtueMart46, xt:Commerce47, OxideShop48, WordPress/WP e-Commerce49, PrestaShop50, and Drupal Commerce51.

• GR-Notify ping service52: In order to see whether there exist problems with theshop extensions, it was key to keep track of new installations, which has beenrealized with a registration service where Web shops and site owners can submittheir URIs. It was also useful for monitoring adoption of shop extensions.

Converters

• Google Product Feed Converter53: Converts Google product feeds to GoodRelations.

• BMEcat2GoodRelations54: Converter from BMEcat documents to GoodRelations.First developed as an online service [cf. Mat09], it was then replaced by a morescalable command-line tool (see [SRH13b] and Chapter 4).

• ELMAR2GoodRelations55: Command-line converter from Electronic Market DataFeed (ELMAR) product data feeds to GoodRelations.

• PCS2OWL56: Generic converter from product category systems to GoodRelations-compatible product ontologies (see [Sto+14] and Chapter 5).

Consuming Tools

• GR4PHP57: A programming API intended for Hypertext Preprocessor (PHP)developers to fetch information about businesses and products from RDF stores[SGH12].

44https://code.google.com/p/goodrelations-for-oscommerce/ (accessed on May 22, 2014)45http://www.magentocommerce.com/magento-connect/msemantic-semantic-seo-for-rich-

snippets-in-google-and-yahoo.html (accessed on May 22, 2014)46https://code.google.com/p/goodrelations-for-joomla/ (accessed on May 22, 2014)47https://code.google.com/p/semantic-seo-for-xt-commerce/ (accessed on May 22, 2014)48http://wiki.oxidforge.org/Features/RDFa (accessed on May 22, 2014)49http://wordpress.org/plugins/wpec-goodrelations/ (accessed on May 23, 2014)50http://addons.prestashop.com/en/seo-prestashop-modules/3866-rich-snippets-and-

semantic-seo-with-goodrelations.html (accessed on May 22, 2014)51https://drupal.org/project/commerce_goodrelations (accessed on May 22, 2014)52http://gr-notify.appspot.com/ (accessed on May 22, 2014)53http://www.ebusiness-unibw.org/tools/google-product-feed-converter/ (accessed on May 22,

2014)54http://wiki.goodrelations-vocabulary.org/Tools/BMEcat2GR (accessed on May 22, 2014)55https://code.google.com/p/elmar-to-goodrelations/ (accessed on May 22, 2014)56http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL (accessed on May 22, 2014)57https://code.google.com/p/gr4php/ (accessed on May 22, 2014)

https://code.google.com/p/goodrelations-for-oscommerce/

http://www.magentocommerce.com/magento-connect/msemantic-semantic-seo-for-rich-snippets-in-google-and-yahoo.html

http://www.magentocommerce.com/magento-connect/msemantic-semantic-seo-for-rich-snippets-in-google-and-yahoo.html

https://code.google.com/p/goodrelations-for-joomla/

https://code.google.com/p/semantic-seo-for-xt-commerce/

http://wiki.oxidforge.org/Features/RDFa

http://wordpress.org/plugins/wpec-goodrelations/

http://addons.prestashop.com/en/seo-prestashop-modules/3866-rich-snippets-and-semantic-seo-with-goodrelations.html

http://addons.prestashop.com/en/seo-prestashop-modules/3866-rich-snippets-and-semantic-seo-with-goodrelations.html

https://drupal.org/project/commerce_goodrelations

http://gr-notify.appspot.com/

http://www.ebusiness-unibw.org/tools/google-product-feed-converter/

http://wiki.goodrelations-vocabulary.org/Tools/BMEcat2GR

https://code.google.com/p/elmar-to-goodrelations/

http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL

https://code.google.com/p/gr4php/


• GR2RSS 58: This tool combines query results over GoodRelations e-commercedatasets with content syndication, thus empowering Web site owners that runoff-the-shelf CMSs with support of syndication formats (Really Simple Syndication,sometimes Rich Site Summary or RDF Site Summary (RSS) or Atom SyndicationFormat (Atom)) to enhance their Web pages with Semantic Web content [SH13b].

• GoodRelations Validator59: Modeling mistakes on the Semantic Web are quitecommon. This online service validates RDF documents for semantic validity inmultiple check steps, i.e. it checks whether all ontology constraints are satisfied.

• GRCrawler60: The GoodRelations crawler is a focused Web crawler that extractsstructured data in GoodRelations from a given set of Web resources (see Chapter 3).

• Browser extensions and mobile applications: Browser extensions and mobile ap-plications are promising tools because they can effortlessly take into account auser’s context information. A browser extension, for example, has knowledge aboutthe page a user is currently visiting, whereas a mobile application can regard geo-positional information. In the context of GoodRelations, the following extensionswere showcased:

– A Firefox extension61 that looks up product data from an RDF store basedon some selected text on a Web page (e.g. a product name to compare withproducts in a RDF store).

– A Google Chrome extension62 that shows the presence of rich GoodRelationsmarkup in the current Web page and notifies the GR-Notify registration serviceaccordingly.

– A mobile application based on a large dataset about points of interest inRavensburg city63, letting the user explore surrounding restaurants or publicinstitutions that are currently opened.

Furthermore, a sophisticated vocabulary documentation system has been established forGoodRelations, which features examples in different syntaxes and considers the socialaspect of ontology development, i.e. users can interact with developers using different58http://www.stalsoft.com/gr2rss/ (accessed on May 22, 2014)59http://www.ebusiness-unibw.org/tools/goodrelations-validator/ (accessed on May 22, 2014)60http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler (accessed on May 22, 2014)61https://addons.mozilla.org/en-us/firefox/addon/goodrelations-extension/ (accessed on

May 22, 2014)62http://www.stalsoft.com/grome (accessed on May 22, 2014)63http://wiki.goodrelations-vocabulary.org/Case_studies/Ravensburg (accessed on May 22,

2014)

http://www.stalsoft.com/gr2rss/

http://www.ebusiness-unibw.org/tools/goodrelations-validator/

http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler

https://addons.mozilla.org/en-us/firefox/addon/goodrelations-extension/

http://www.stalsoft.com/grome

http://wiki.goodrelations-vocabulary.org/Case_studies/Ravensburg


social media channels like Twitter, Facebook, Google+, Quora, Stackoverflow, or Delicious[cf. Hep11]. The GoodRelations ecosystem also includes an information platform64 (staticWeb page and Wiki) with collected knowledge and external links to additional material(slides, videos, etc.), as well as a community mailing list accompanied by a correspondinglist archive65.

2.3.6.3 Simple Knowledge Organization System

The Simple Knowledge Organization System (SKOS) is a W3C recommendation andconstitutes an OWL Full ontology for representing controlled vocabularies such as classi-fication schemes, taxonomies, thesauri, or subject heading systems [MB09, Section 1.2].SKOS is complementary to ISO 25964 [Int11]. In contrast with ISO 25964, which givesgeneral advice on how to build a decent thesaurus [Int11] (see Section 2.2.7.1), SKOSdeals with the machine-readable representation of thesauri on the Web [MB09, Section1.2].

SKOS contrasts itself from RDFS or OWL as being a data model to informally representa thesaurus or classification scheme rather than a formal ontology or ontology language[MB09, Section 1.3]. Unlike SKOS structures, ontologies and ontology languages generallydefine axioms and facts [MB09, Section 1.3]. Knowledge representation languages likeRDFS and OWL are further limited to a few very strong semantic relationships, i.e.rdfs:subClassOf and rdfs:subPropertyOf. This means that an instance of class A thatis subsumed by class B is also an instance of class B. It does not pose a problem forwell-organized hierarchies such as a car being a subclass of a vehicle, but the assertionwill not hold e.g. for a garage being subsumed by a class building, even if a real estatebroker might disagree66.

SKOS defines elements better suited for representing light-weight KOSs than RDFS orOWL. The main element of SKOS is the notion of a concept (skos:Concept), which isan instance of an OWL class. A SKOS concept is “an idea or notion” [MB09, Section3.1]. Each concept can be equipped with textual labels. SKOS distinguishes betweenthree types of labels, a preferred one (skos:prefLabel), an arbitrary number of alternativeones (skos:altLabel), and hidden labels (skos:hiddenLabel) [MB09, Sections 5.1 and 5.2].Hidden labels are supposed not to be visible to the user, but to be considered for indexingpurposes instead, e.g. by a digital library or an IR system [MB09, Section 5.1]. Concepts64http://www.goodrelations-vocabulary.org/ (accessed on May 22, 2014)65http://ebusiness-unibw.org/pipermail/goodrelations/ (accessed on May 22, 2014)66A garage is typically part of a building (meronomy, parthood, or whole-part relationship [cf. MR06,

p. 216]), but not a building on its own. On paper, however, both might well be treated similarly.

http://www.goodrelations-vocabulary.org/

http://ebusiness-unibw.org/pipermail/goodrelations/


can be hierarchically organized with SKOS by taking advantage of the object propertiesskos:narrower and skos:broader [MB09, Section 8.1]. These and other additional semanticrelationships (among others associative links like skos:related) allow for a more fine-grained description of the relations between concepts [cf. MB09, Section 8.1]. Accordingly,a concept A linked to another concept B via skos:narrower does not necessarily implythat A is a specialization of B. So a garage in this case could be modeled as a narrowerconcept of a building.

2.3.7 Query and Rule Languages

The SPARQL Protocol and RDF Query Language (SPARQL) is a SQL67-like querylanguage and a protocol for RDF. It was standardized by a W3C working group andreceived recommendation status in 2008 [PS08]. In 2013, it was updated to SPARQL 1.1[HS13], which basically added query federation support and SPARQL Update functionality.SPARQL was not the first attempt to define a query language for RDF. For example, itwas preceded by RDF Query Language (RQL), Sesame RDF Query Language (SeRQL),TRIPLE, RDF Data Query Language (RDQL), N3, and Versa [cf. Haa+04].

The query language borrows much of its syntax from the Turtle language for RDF [cf.PC14]. In particular, the basic graph pattern [HS13, Section 5.1] used to query an RDFgraph consists of a set of triple patterns with variables, structured the same way as RDFtriples [HS13, Section 5.1]. A variable is given by a string of characters introduced with aquestion mark, e.g. ?var. Every SPARQL query follows the same basic structure, whichcomponents are prefix declarations, query result form, dataset definition, graph pattern,and solution modifiers [cf. HS13]. Some of these constituents are required, others areoptional: Prefix declarations may be omitted if an abbreviated syntax with CURIEs isnot needed. The dataset definition is not mandatory, because without supplying a namedgraph URI of a specific RDF subgraph, the query is executed on the default graph (whichmost often encompasses all named graphs as well). Finally, the solution modifiers, whichpermit to limit or manipulate the results, are optional as well [cf. HS13].

The SPARQL query language supports four types of queries [HS13, Section 16]:

• SELECT : This query type retrieves a result list by binding values to the variablessupplied with the query result clause (the variables in the SELECT clause are calledprojection; the wildcard symbol “*” (asterisk) can be used to select all variablesbound in the graph pattern).

67SQL is the standard query language for querying relational databases. It has a SELECT-FROM-WHERE structure that the SPARQL query language adopted.


• CONSTRUCT : This query type works like SELECT, but generates triples from acustom triple pattern template that is populated with the retrieved results.

• ASK : This query type returns the boolean value “true” if some data in the datasetmatches the given basic graph pattern, otherwise “false”.

• DESCRIBE : This query type extracts all triples in the dataset for a particularresource.

With SPARQL, it is possible to explicitly query named RDF graphs defined as graphURIs. Named graphs are either selected using the dataset definition FROM NAMED infront of the basic graph pattern [HS13, Section 13.2.2], or inside the graph pattern block[HS13, Section 13.3]. The syntax is then as follows:

GRAPH <g> { ... }

It is also possible to make a triple pattern an optional match. This can be done byembracing the triple pattern within an OPTIONAL clause [HS13, Section 6]. Furthermore,filters on results are possible from within the graph pattern section [HS13, Section 5.2.2].

SPARQL 1.1 incorporated numerous new functionalities that were highly demandedby academia and industry. It contributed among others subqueries, the property pathfeature, variable assignments (BIND and VALUES), better filtering opportunities, likeFILTER NOT EXISTS ... for negation, and additional aggregate functions like MIN,MAX, or AVG [cf. HS13]. Furthermore, it added support for SPARQL Update queries,for which the most important applications are graph manipulations [GPP13, Section 3.1]:(1) Insert data into the graph; (2) delete data from the graph; (3) insert and/or deletecertain data if the specified graph pattern matches; (4) load data from a Web resourceinto the graph; and, (5) empty an entire graph. SPARQL-1.1-capable endpoints supportfederated queries by virtue of the SERVICE keyword [PB13, Section 2]. Federated queriespermit to delegate portions of a query to remote SPARQL endpoints and to combine theresults locally afterwards [PB13; Bui+13].

Recall the ongoing ice cream example. Listing 2.9 shows a corresponding SPARQL query.It combines many of the SPARQL features presented herein. In particular,

• two prefix bindings are provided (lines 1–2);

• the query form is of type SELECT (line 5);

• the solution modifiers tell the query to remove duplicate results (keyword DISTINCTon line 5), to rank the results by the price (line 12), and to limit the returned resultset to ten results (line 13);


1 PREFIX gr: <http://purl.org/goodrelations/v1#>

2 PREFIX ex: <http://www.example.com/#>

3

4 # query name, price, and currency of offer ex:OfferIcecream

5 SELECT DISTINCT ?name ?price ?currency {

6 ex:OfferIcecream

7 gr:hasPriceSpecification [

8 gr:hasCurrency ?currency ;

9 gr:hasCurrencyValue ?price ] ;

10 gr:name ?name .

11 }

12 ORDER BY ?price

13 LIMIT 10

Listing 2.9: Example query in SPARQL

• in the projection three variables are selected, namely name, price, and currency(line 5);

• abbreviated Turtle syntax is used inside the basic graph pattern block (lines 6–10);and

• values are bound to variables for the name of the offer, its price and the relatedcurrency (lines 8–10).

The graph pattern used in the query can also be visualized as a graph (see Figure 2.14).A side-by-side comparison of the RDF graph (see Figure 2.14a) and the correspondinggraph for the SPARQL query (see Figure 2.14b) reveals what values will be bound tovariables when the query is executed.

gr:UnitPriceSpecification

rdf:type

"EUR"^^xsd:string

gr:hasCurrency

"1.10"^^xsd:float

gr:hasCurrencyValue

ex:OfferIcecream

gr:Offering

rdf:type

gr:Sell

gr:hasBusinessFunction gr:hasPriceSpecification

"Scoop of ice cream"@en

gr:name

(a) RDF graph (see Figure 2.9)

?currency

gr:hasCurrency

?price

gr:hasCurrencyValue

ex:OfferIcecream

gr:hasPriceSpecification

?name

gr:name

(b) SPARQL graph pattern

Figure 2.14: Side-by-side comparison of RDF graph and SPARQL graph


Rule languages allow to specify rules that otherwise would be impossible, or at leastdifficult, to express with OWL terminology. In logics, rules consist of an antecedent (orbody) and a consequent (or head) [e.g. Hor+04, Section 1; GR98, p. 29; AvH08, p. 162].In other words, a conclusion C is drawn from a set of premises P [cf. HR04, p. 5]. Arule consisting of a set of n premises and a conclusion can be formally represented asfollows:

P1, . . . , Pn

! C (2.3)

On the Semantic Web, this functionality is offered by rule languages like the Semantic WebRule Language (SWRL) [Hor+04]. Due to the vast number of available rule languages andengines (e.g. Datalog, Rule Markup Language (RuleML), SWRL), the W3C chartered aworking group that elaborated a format with a set of dialects charged with the interchangeof rules across various rule languages, the Rule Interchange Format (RIF) [Bol+07]. Yet,even with SPARQL it is possible to execute a number of logical rules via SPARQLCONSTRUCT queries [e.g. AH11, pp. 88f., pp. 115f.]. SPARQL Inferencing Notation(SPIN)68, for example, avails itself of this capability and defines inference rules based onSPARQL CONSTRUCT query forms [AH11, p. 116].

2.3.8 Storage and Reasoning

In this section, we address the technologies to store and retrieve RDF data. In particular,we delineate the notions of a triple store and of a SPARQL endpoint. Then, we discuss themost relevant details and idiosyncrasies about reasoning over RDF data on the SemanticWeb.

2.3.8.1 RDF Stores

An RDF store is a system that “allows storage of RDF data and schema information, andprovides methods to access that information” [HBS09, p. 490]. Its essential componentsare a repository for the storage and a middleware for the access of the data [HBS09,p. 490]. Triple stores and quad stores are sometimes used as alternative terms for RDFstores, even though strictly speaking they describe specific types of RDF stores, i.e. RDFengines that store only triples or triples with context information (graph name), knownas quads [cf. HBS09, p. 494].

Most of the RDF stores available today can be classified as either native RDF stores,DBMS-backed stores, or RDF wrappers [e.g. Has+11; FCB12]:68http://spinrdf.org/ (accessed on May 25, 2014)

http://spinrdf.org/


• A native store implements an RDF-compliant storage layout, usually either persis-tent or in-memory [FCB12].

• A DBMS-backed store maps RDF triples to relational database tables while takingadvantage of existing and well-proven database technologies. The are three commonways of storing RDF triples in relational databases, namely using vertical tables,property tables, and horizontal tables [SN10].

• An RDF wrapper creates an RDF view over otherwise RDF-agnostic data sources.Typically, such RDF views are created from structured relational data or madeaccessible via Web APIs. Hence, this kind of RDF repositories provide read accessonly [Has+11]. A tool for generating such RDF views over relational databases isD2RQ69.

2.3.8.2 SPARQL Endpoints

SPARQL endpoints complement RDF stores with support for SPARQL queries overHTTP.

“SPARQL endpoints are RESTful services that accept queries over HTTP written inthe SPARQL language [...] adhering to the SPARQL protocol [...]” [Bui+13]

According to this definition, the capabilities of SPARQL endpoints are determined bythe SPARQL query language, and by the SPARQL protocol definition. The protocoldefinition standardizes the communication with SPARQL endpoints, namely by specifyingthe messages that every endpoint shall understand, the URI pattern every endpoint shouldadhere to, and the supported HTTP request methods [cf. Fei+13], i.e. GET and POSTand, for certain SPARQL Update operations, PUT and DELETE.

2.3.8.3 Reasoning

Reasoning, or inferencing, is a common term for the task of inferring new, implicit factsfrom existing information.

“In the context of the Semantic Web, inferencing simply means that given somestated information, we can determine other, related information that we can alsoconsider as if it had been stated.” [AH11, p. 114]

69http://d2rq.org/ (accessed on May 26, 2014)

http://d2rq.org/


Many RDF stores include reasoning capabilities [FCB12]. Some only support basicinferences over RDFS axioms (e.g. rdfs:subClassOf ), while others implement very sophis-ticated reasoning including OWL axioms (e.g. owl:sameAs) and custom rules (e.g. SWRLrules)70 (see Section 2.3.7). For an overview of reasoning in semantic repositories, see[KD11, pp. 245–257].

To infer additional knowledge, rule-based reasoners generally apply one of the twoinferencing strategies [e.g. KD11, p. 249]:

• Forward-chaining concludes the goal from (a series of) facts. Thus, it concludesfrom antecedent to consequent. It is further known as data-driven or bottom-upreasoning [GR98, p. 145; cf. KD11, p. 249]. This is also known as materialization[e.g. KD11, p. 249].

• Backward-chaining starts from the goal (hypothesis) and tries to derive assertionsthat hold true, eventually drawing on supporting facts (evidence). Hence, itconcludes from consequent to antecedent. Backward-chaining is also called goal-driven or top-down reasoning [GR98, pp. 145f.; cf. KD11, p. 249]. A reasoner witha backward-chaining strategy fundamentally performs query rewriting [e.g. KD11,p. 249].

A more formal treatment of reasoning in description logics can be found in [MH09]; anoverview of reasoning with the Web Ontology Language (OWL) is given in [HP11].

2.3.8.4 Open-World and Closed-World Assumptions

Many software systems, specifically database systems, are based on the premise that theirdata is complete and that everything not known to the system is deemed false. Thisprinciple, where the absence of information is regarded as negative information, is knownas the closed-world assumption (CWA) [e.g. AvH08, p. 151; DFH11, p. 21]. CWA isdescribed by the situation where a statement is regarded false when it is not supportedby another statement, i.e. “[i]f a fact is not evaluated to be true [...], it is assumed to befalse” [DFH11, p. 21].

By contrast, the Semantic Web is based on the open-world assumption (OWA) [AvH08,p. 151; DFH11, p. 21], which regards that in an open environment like the Web it isunrealistic to have full control over all the data. In such an environment it is possiblethat a dataset makes assertions about another dataset without the second taking note of70Stardog, for instance, implements SWRL rules: http://docs.stardog.com/owl2/ (accessed on May 26,

2014)

http://docs.stardog.com/owl2/


it. In the simplest case, a person claims to be a friend of a second person. Unless thedataset with this claim is confirmed by an RDF store, it would be wrong to assume thisclaim to be false just based on the fact that it is not there. OWA thus implies that “astatement cannot be assumed true on the basis of a failure to prove it” [AvH08, p. 151].

OWA as a characteristic of the Semantic Web has significant practical impact. Simplyput, absent information like “noisy street next to the hotel” does not mean that there isindeed no highway next to it. This information could simply be missing, either becausethe assertion was omitted or because it was not detected for it was made available froman unknown information source.

As a potential solution to OWA on the Semantic Web, Polleres, Feier, and Harth proposedcontextually scoped negation [PFH06]. The authors’ idea was to allow scoped negation,i.e. to define rules that are valid only in a particular context (with respect to a specificrule base).

2.3.8.5 Non-Unique-Names Assumption

A further peculiarity of the Semantic Web is that in OWL, without explicit statementslike owl:sameAs or owl:differentFrom, it cannot be assumed that individuals with differentnames indeed represent different entities; this is termed the unique-names assumption(UNA) [cf. AvH08, p. 151]. In short, UNA implies that unless stated otherwise, one mustalways expect that people could use different terminology on the Web for referring to thesame thing.

2.4 Semantic Data Interoperability

A consolidated view on the data is crucial for querying large data sets like the SemanticWeb. To accomplish this goal, a considerable number of heterogeneous data sources needto be integrated.

The Semantic Web and Linked Data already made a significant step towards global dataintegration on the Web [cf. Gan+11, pp. 138f.]. The two design decisions of declaring astandard data model (RDF) and to mint globally unique identifiers (URIs or IRIs) forconceptual entities greatly facilitate the interlinking of datasets. Though, it still proveschallenging to establish an integrated view over structured data on the Semantic Web,

2.4 Semantic Data Interoperability 89

among others because of the variety of syntaxes for RDF71, the multitude of conceptualmodels (even for describing similar things), the different levels of data granularity, andthe lack of common standards.

In this section, we discuss the main topics linked to the problem of semantic datainteroperability that, although being an old and well-known research problem, is still farfrom being solved [ZD04; Hal05].

2.4.1 Data Integration and Heterogeneity

Data integration is “a pervasive challenge faced in applications that need to query acrossmultiple autonomous and heterogeneous data sources” [HRO06]. The need for dataintegration emerges whenever different data sources, such as information systems withinenterprises, need to be consolidated in order to create a uniform view on the data [ZD04;cf. HRO06; Len02].

The complexity of the data integration task increases as more heterogeneous data sourcesneed to be merged [DHI12, p. 6; cf. Fen+01]. Some of these data sources are structuredhaving a well-defined schema (e.g. relational databases), while others are semi-structured(e.g. XML, HTML) or unstructured (e.g. plain, unorganized text) [DHI12, p. 6]. Becauseschemas are often developed independently, it can further occur that similar things aredescribed differently, or that the level of detail differs across descriptions although relyingon the same schema.

The heterogeneity of information manifests itself in a number of ways. Sheth [She99], forexample, outlined three classes of information heterogeneity, namely syntactic, structural,and semantic heterogeneity. Similarly, Euzenat and Shvaiko consider syntactic (e.g.ontology language mismatch), terminological (e.g. product versus article), conceptual(intended meaning), and semiotic heterogeneity (subjective interpretation) [cf. ES07,pp. 40–42]. These differences in data quality and granularity pose serious challenges ondata integration both for databases and the Web.

In general, data integration can take place at two conceptual levels, i.e. at schema andinstance level. Subsequently, we address both cases, i.e. integration at schema level inthe form of schema and ontology alignment, and integration at instance level by meansof data or instance matching.

71Although the variety of syntaxes does not pose a huge problem, because most syntaxes are designedaround the RDF model.


2.4.2 Schema and Ontology Matching

Schema and ontology matching constitute approaches for finding mappings between twoschemas [e.g. RB01] or ontologies [e.g. SE13] based on similarity measures. While schemamatching is a branch from database research, ontology matching describes the respectivecounterpart for the Semantic Web.

2.4.2.1 Schema Matching

Schema matching deals with identifying correspondences between different databaseschemas [RB01]. A match “takes two schemas as input and produces a mapping betweenelements of the two schemas that correspond semantically to each other” [RB01].

Rahm and Bernstein [RB01] elaborated a categorization of schema matching approaches,as illustrated in Figure 2.15. Their main three distinctions of schema matching approachesare as follows [RB01]:

• Schema-level versus instance-level : Besides those matches that consider schemainformation only, also schema matches relying on instance data are possible.

• Element-level versus structure-level : Matches can span one or multiple elements.

• Linguistic versus constraint-based : Computing matches based on linguistic featuresonly or by taking into account restrictions like data types, integrity constraints, etc.

Furthermore, it is possible not only to rely on an individual but on multiple matchingcriteria (hybrid matcher) or to combine results of several matching algorithms (composite

Schema Matching Approaches

Individual matcher approaches Combining matchers

Schema-only Instance/contents-based Hybrid Composite

Element-level Structure-level Element-level

Linguistic Linguistic Constraint-basedConstraint-basedConstraint-based

Manual Automatic

Name similarityDescription similarityGlobal namespaces

Type similarityKey properties

Graph matching IR techniques(word freq.,key terms)

Value pattern andranges Sample approaches

... ... ... ... ...Further criteria:- Match cardinality- Auxiliary information ...

Figure 2.15: Taxonomy of schema matching approaches [from RB01]


matcher) [RB01]. Many schema matching approaches also take advantage of additionalinput in the form of auxiliary information like dictionaries, thesauri, or previouslycomputed mappings (e.g. to reapply a concatenation of first name and last name developedfor the author name to the editor name of a book) [RB01], and the mappings are notalways simple one-to-one mappings but may involve also multiple schema elements, knownas one-to-many and many-to-many mappings, respectively [DHI12, pp. 123f.; RB01]. Forexample, a simple one-to-one mapping is

Book.author = Author.name

whereas a more complex one-to-many mapping would be

Location.address = compose_address(Place.city, Place.zip, Place.street)

Another categorization of matching techniques distinguishes between rule-based andlearning-based solutions [DH05]. Rule-based solutions define rules based on schemainformation, which makes them straightforward and fast. As a downside, rule-basedsolutions do not exploit valuable information from instance data, nor do they includepast matches. Learning-based approaches aim at addressing these drawbacks.

2.4.2.2 Ontology Matching

Ontology matching is a related discipline to schema matching, for it applies establishedtechniques from database schema matching to ontologies [cf. ES07, p. 61, p. 63; Cas+11].Moreover, it extends schema matching by novel approaches peculiar to ontologies, e.g.reasoning-based matching that relies on ontologies [e.g. Cas+11].

The field of ontology matching is sometimes also referred to as ontology alignment [cf.Ehr07]. The variety of terms for ontology matching and their inconsistent use amongscientific works is sometimes confusing. The authors in [Ehr07, pp. 23f.; ES07, p. 42]thus provided a terminological distinction. In the following, we briefly summarize the keyterms in the context of ontology matching as found in [ES07, pp. 42f.]:

• A correspondence is the relationship that holds between concepts of differentontologies according to a matching algorithm [ES07, p. 42].

• Matching is the process of finding correspondences [ES07, p. 42].

• An alignment is a set of correspondences [ES07, p. 42].

• A mapping is the directed counterpart of an alignment [ES07, pp. 42f.].


• Merging is the creation of a new ontology from two others [ES07, p. 43].

• Integration is the inclusion of one ontology into another [ES07, p. 43].

The ontology matching process is to produce an alignment from two ontologies [e.g.Cas+11]. We can formalize the structure of the ontology matching process using thefollowing equation [ES07, p. 44]:

A

0 = f(o, o0, A, p, r) (2.4)

The matching process is given by a function f that creates an output alignment A

0 fromthe following input parameters: Two ontologies o and o

0, an existing alignment A thatneeds to be extended, some relevant parameters p for the match operation (e.g. similaritythresholds), and additional resources r for inclusion (e.g. external knowledge bases andthesauri) [ES07, p. 44]. Figure 2.16 provides a graphical representation of the ontologymatching process taken from [ES07, p. 44]. Of course, this matching process providesonly the baseline for more advanced matching strategies, among others the combinationof several matchers or similarities, learning strategies from instance data, probabilisticmethods, or user involvement [ES07, p. 117].

matching

o

A

o'

A'

parameters

resources

Figure 2.16: Ontology matching process [from ES07, p. 44]

As already mentioned before, ontology matching has produced several matching techniques,where a large part originates from schema matching. Euzenat and Shvaiko thus extendedthe taxonomy of automatic schema matching approaches, as presented in Figure 2.15,with further categories and aspects that apply to ontology matching [ES07, p. 65]. Theresulting classification of matching techniques is given in Figure 2.17. Relevant changes toFigure 2.15 are highlighted in bold font face and/or emphasized with shaded backgroundcolor.

The biggest difference to Figure 2.15 is the layered organization of the classification (threelayers). The taxonomy is read either top-down or bottom-up versus the middle. The upper


Mat

chin

g te

chni

ques

Mat

chin

g te

chni

ques

Elem

ent-level

Structure-level

Synt

actic

Exte

rnal

Synt

actic

Exte

rnal

Sem

antic

s

Term

inol

ogic

al

Ling

uist

ic

Stru

ctur

al

Inte

rnal

Rel

atio

nal

Exte

nsio

nal

Sem

antic

Gra

nula

rity/

Inpu

t int

erpr

etat

ion

Basic

tech

niqu

es

Kind

of i

nput

String-

based

nam

e sim

ilarit

y, de

scrip

tion

simila

rity,

glob

al

nam

espa

ce

Lang

uage

-ba

sed

toke

nisa

tion,

le

mm

a-tis

atio

n,

mor

phol

ogy,

elim

inat

ion

Ling

uist

ic

reso

urce

s

lexic

ons,

th

esau

ri

Constraint-

based

type

sim

ilarit

y, ke

y pr

oper

ties

Alig

nmen

t re

use

entir

e sc

hem

a or

ont

olog

y, fra

gmen

ts

Upp

er

leve

l, do

mai

n sp

ecifi

c on

tolo

gies

SUM

O,

DOLC

E, F

MA

Dat

a an

alys

is

and

stat

istic

s

frequ

ency

di

strib

utio

n

Graph-

based

grap

h ho

mo-

mor

phism

, pa

th, c

hild

ren,

le

aves

Taxo

nom

y-ba

sed

taxo

nom

y st

ruct

ure

Rep

osito

ry

of

stru

ctur

es

stru

ctur

e m

etad

ata

Mod

el-

base

d

SAT

solve

rs,

DL re

ason

ers

Fig

ure

2.17

:C

lass

ifica

tion

ofm

atch

ing

tech

niqu

esfo

ron

tolo

gym

atch

ing

[from

ES0

7,p.

65]


layer distinguishes first by the granularity of the input, followed by the interpretation ofthe input. The bottom level makes the distinction based on the kind of input used by thematching approaches [ES07, p. 64]. As compared to schema matching approaches, theauxiliary information are now encoded in the form of alignments that could be reused,ontologized thesauri and knowledge bases like WordNet RDF72 (derived from WordNet[Mil95]) or DBPedia73 (derived from Wikipedia), and upper-level and domain-specificontologies. Furthermore, semantic characteristics can be taken advantage of, e.g. viamodel-based reasoning (see Figure 2.17).

An important question that arises is how to represent alignments that could be detectedbetween ontologies. First of all, it depends on the kind of relationship that holdsbetween the mapped concepts. The simplest relationships is equivalence, which couldbe represented using an owl:equivalentClass property from OWL [ES07, p. 45]. But alsoother axioms like disjointness or specialization could be valid mappings. E.g., a conceptcat in one ontology could be subsumed by a concept pet in a second ontology. Similarly,alignments could be composed of more complex relationships such as the confidence thata relationship holds [ES07, p. 46].

2.4.3 Data and Instance Matching

Data matching and instance matching are two terms for the same problem and thecounterparts of schema and ontology matching. Doan, Halevy, and Ives [DHI12] definedata matching as “the problem of finding structured data items that describe the same real-world entity” [DHI12, p. 173]. Other terms referring to the same task can be sometimesfound in database, artificial intelligence (AI), and Web literature, namely record linkage,tuple deduplication, duplicate identification, entity consolidation, co-reference resolution,object matching, and link discovery [cf. DH05; Cas+11]. In the following, we stick to theterm instance matching.

Instance matching copes with the problem of duplicate representations of identical objects.This problem often arises when the same real-world entities appear in heterogenous datasources, especially in multiple databases [cf. DHI12, p. 173] or in different documents onthe Web [cf. Cas+11].

Let us consider a data integration example of two databases: The synchronization of twocustomer databases reveals customer entries for “James Doe” and “Jim Doe”. “Jim” is acommon shorthand for “James”. A closer investigation of the two customer entries reveals72http://wordnet-rdf.princeton.edu/ (accessed on June 4, 2014)73http://dbpedia.org/ (accessed on May 12, 2014)

http://wordnet-rdf.princeton.edu/

http://dbpedia.org/


that they have the same date of birth, which leads us to conclude that they most likelyrefer to one and the same customer.

Several techniques have been investigated to identify whether two representations match.We are not going to cover them in detail here, but the most prominent approaches forinstance matching encompass rule-based, learning-based, clustering-based, probabilistic,and collective solutions [cf. DHI12, p. 174].

On the Semantic Web, instance matching tries to reconcile individuals or instancesrepresented by URIs. To consolidate these entities, the owl:sameAs property has thesame function for instances as the owl:equivalentClass property did for matching classes[Hog+10]. The Silk framework is a tool that employs instance matching techniques todiscover links between entity pairs on the Web of Data [cf. Vol+09]. To achieve this, itoffers a declarative language to specify the data sources and SPARQL queries for dataretrieval, the type of the resulting links between entities, as well as conditions for theselinks in the form of similarity metrics which scores may be weighted and aggregated[Vol+09]. Based on the resulting scores, the links are either created or not [Vol+09]. Thesupported similarity metrics used for discovering the links are various string similaritymetrics (including exact string matches); similarities between numerals, dates, and URIs;distance metrics between concepts in a taxonomy; and, set comparison [Vol+09].

2.4.4 String Matching

Doan, Halevy, and Ives [DHI12] define string matching as “the problem of finding stringsthat refer to the same real-world entity” [DHI12, p. 95]. String matching is central tomany data integration tasks, including schema, ontology, or instance matching [cf. DHI12,p. 95].

Two strings can be matched using a string similarity measure [cf. DHI12, p. 95]. Cohen,Ravikumar, and Fienberg [CRF03] provide a comparison of various string distancemetrics ranging from simple edit distance metrics over token-based to hybrid forms. Theircomprehensive list of considered string distance metrics includes among others Levenshtein,Jaro, Jaro-Winkler, Jaccard, Jensen-Shannon, or TF-IDF metrics [CRF03]. In thefollowing, we describe the Levenshtein string distance [Lev66] in more detail, a popularyet very simple edit distance metric. The Levenshtein string distance computes the stringsimilarity based on the minimum number of character edits (insertions, deletions, andsubstitutions) required to transform one sequence of characters into another. Returningto our previous example with the name mismatch between two customer entries, the edit


distance between “James Doe” and “Jim Doe” is three (i.e. one character replacementand two deletions).

A similarity score is computed as the one minus the distance function between two objectso1 and o2 (provided that the distance is calculated as a normalized value in the range[0, 1]) [cf. BR11, p. 223]:

sim(o1, o2) = 1� d(o1, o2) (2.5)

This means for the Levenshtein string distance that the similarity can be computed usinga normalized string distance [cf. DHI12, p. 97] as follows

sim(s1, s2) = 1� d(s1, s2)

max(length(s1), length(s2))(2.6)

In our case, we obtain a similarity score of 0.67:

sim(James Doe, Jim Doe) = 1� 3

max(9, 7)= 0.67 (2.7)

2.4.5 Data Cleansing

Data cleansing (or data cleaning, scrubbing) [RD00] is a technique related to the methodsof schema matching (ontology matching), instance matching, and other activities thatimprove the quality of the data [cf. RD00]. While the data integration tasks presented sofar are primarily concerned with finding correspondences between related concepts orschemas [e.g. RB01; ES07; Cas+11], data cleansing “deals with detecting and removingerrors and inconsistencies from data in order to improve the quality of data” [RD00].These errors may be caused by data quality problems like spelling mistakes, missingor invalid data, or inconsistencies. Rahm and Do identified two general dimensions forsources of data quality problems, i.e problems caused by single versus multiple datasources (single-source versus multi-source scenarios), and problems arising either at theschema or instance level [RD00].

In a data warehouse environment, the general approach to data cleansing consists ofan initial data analysis to spot frequent anomalies, followed by the definition of atransformation workflow and mapping rules for the data, a verification step to evaluatethe transformation, the execution of the data transformation, and the replacement ofthe dirty data by the cleansed data [RD00]. In relational database management systems(DBMSs) with SQL support, this data transformation task is often conducted usinguser-defined functions (UDFs) [cf. RD00]. On the Semantic Web by comparison, someobvious data quality problems can be spotted using SPARQL queries as suggested in


[FH10]. These data quality problems could then be partially curated by materializingthe corrections via SPARQL Update queries.

Standardization and normalization (or canonicalization) represent important cleansing andpre-processing steps for data. Creating and adhering to standards ensures that the data isrepresented in a uniform and consistent way [RD00]. E.g., relying on a code standard likeUN/CEFACT [Uni09b] allows to seamlessly convert between metric (e.g. “centimeter”)and imperial units (e.g. “inch”), and vice versa. Normalization (or canonicalization)is essential to address the variety of representations in textual descriptions, and thusto support instance matching. Normalization operations can be applied on words orsentences (word normalization, or token normalization [MRS09, pp. 28–34]) to turn theminto a canonical form for easier comparison. Some relevant techniques are summarizedbelow:

• String-based normalization [cf. ES07, pp. 76f.]:

– Case normalization: E.g., convert everything to lower-case letters.

– Removal of blank spaces, links, digits, and punctuation: E.g., “U.S.A.” becomes“USA”.

• Linguistic normalization [cf. ES07, pp. 84f.]:

– Tokenization: E.g., split sentences into tokens.

– Stemming and lemmatization: E.g., stemming [e.g. Lov68; Por80] would reduce“actor”, “actress”, “action” etc., to their basic stem “act”; lemmatization wouldfurther consider contextual information to derive lemmata, i.e. “performer” inaddition to “actor”.

– Removal of stop words: E.g., “the”, “a”, “for”, etc.

• Extrinsic linguistic techniques:

– Usage of dictionaries and thesauri to match or disambiguate various terms [cf.ES07, p. 86].

– Expansion of common abbreviations and acronyms [cf. Sor+10]: E.g., “IR” to“information retrieval” or “Thu” to “Thursday”.

Culotta, Wick, Hall, Marzilli, and McCallum e.g., suggest a learning-based method basedon edit distances to canonicalize data records [Cul+07]. Mauge, Rohanimanesh, andRuvini [MRR12] apply this method to e-commerce inventory. In essence, they use the


method to find and cluster synonyms among properties that they have first extractedfrom unstructured textual product descriptions [MRR12].

2.5 Product Search

In the following, we discuss relevant concepts and topics for product search.

2.5.1 Information Need

The information need is the lack of information that users seek to compensate withsearches. Manning, Raghavan, and Schütze [MRS09] paraphrase it as “the topic aboutwhich the user desires to know more” [MRS09, p. 5]. Morville and Rosenfeld mentionfour types of information needs [MR06, pp. 34f.]:

1. Known-item seeking (“the right thing”),

2. exploratory seeking (“a few good things”),

3. exhaustive research (“everything”), and

4. re-finding (“need it again”).

Information needs are expressed using different types of queries. A query is how the userexpresses the information need when looking for answers [MRS09, p. 5].

2.5.2 Search Types

According to the source of information considered, searches can be roughly classified intothree main approaches:

1. Classical search (or traditional search) is the most common search paradigm anddescribes keyword searches over the document-based Web, based on IR techniquesapplied on textual descriptions [e.g. BP98].

2. Semantic search goes beyond keyword search by trying to better understand theintended meaning of the terms. Because semantic search approaches typically relyon structured data from the Semantic Web, they can take into account contextinformation (e.g. to better understand queries) and augment traditional searchresults with additional information from the Semantic Web [e.g. GMM03; Hen10].

2.5 Product Search 99

3. Hybrid search combines the two distinct search types, i.e. the flexibility of keyword-based search with context-aware semantic search [e.g. Bha+08; RSA04].

As a complement to these main search types, personalized searches take advantage ofuser preferences, interests, and context information [e.g. Law00; Pit+02]. Based on thisknowledge, search systems can then provide custom-tailored search results.

2.5.3 Information Retrieval

Modern information retrieval (IR) describes a research field in computer science that aimsto facilitate the access to information objects for users [BR11, p. 1]. In comparison tocomputer science, IR has a very long tradition that can be traced back to 5, 000 years ago,when the Sumerians already have started to organize information on clay tablets for laterretrieval [Sin01; BR11, pp. 1f.]. With the invention of paper the amount of informationhas further increased which let storage and retrieval become even more important [Sin01].Precursors of modern IR systems were built as mechanical and electro-mechanical devicesstarting in the 1920s, when Emanuel Goldberg was able to build a system that searchedfor patterns in a catalog of entries stored on a roll of film [SC12]. A particularly relevantwork was published in 1945 by Vannevar Bush, who described in his essay “As We MayThink” the idea of memex, an assistive mechanical device in the form of a desk thatwas intended to help users in organizing knowledge by means of a persistent storage(augmenting what the human memory is capable of storing) and associations74 (resemblingthe functioning of the human brain) [Bus45]. The history of IR research and developmentis more comprehensively reviewed in [SC12].

The wide application of modern, computer-based IR systems emerged from library scienceand digital libraries [BR11, p. 3]. The actual breakthrough of IR came with the WWWand search engines, where the corpus of information to be handled rose very quickly,which challenged traditional IR techniques [BR11, p. 3].

A general and widely accepted definition of IR is provided in the book entitled ModernInformation Retrieval: The Concepts and Technology behind Search by Baeza-Yates andRibeiro-Neto:

“Information retrieval deals with the representation, storage, organization of, andaccess to information items such as documents, Web pages, online catalogs, struc-tured and semi-structured records, multimedia objects. The representation andorganization of the information items should be such as to provide the users witheasy access to information of their interest.” [BR11, p. 1]

74The idea of associations formed the basis of modern hypertext systems as the WWW.


In the book An Introduction to Information Retrieval, Manning, Raghavan, and Schützegive a narrower definition of IR:

“Information retrieval (IR) is finding material (usually documents) of an unstructurednature (usually text) that satisfies an information need from within large collections(usually stored on computers).” [MRS09, p. 1]

Following this definition (without the notes within the parentheses), IR addresses thefinding of unstructured information artifacts from large collections. Taking into accountthe content from within the parentheses, the definition becomes even narrower, focusingon the retrieval of digital, text-based documents. As compared to the previous definitionby Baeza-Yates and Ribeiro-Neto [BR11] which also regards non-textual content, thissecond definition defines IR from a natural language processing (NLP) point of view(which also reflects the primary research focus of the book authors).

2.5.3.1 Approaches

In general, IR models can be classified by textual properties of documents and, forretrieval on the Web, the link structure [BR11, p. 59]. Sometimes, multimedia retrievalis also taken into account, but which of course has different characteristics than textretrieval [BR11, p. 59]. IR models based on text are further distinguished by the levelof structure, i.e. whether the text is unstructured or semi-structured [BR11, pp. 59f.](e.g. HTML elements). Classical IR models are based on unstructured text, that are theboolean model, the vector-based model, and probabilistic models [BR11, p. 60]. In thefollowing, we briefly sketch the main ideas of these classical IR models.

Boolean Model The boolean model of information retrieval treats documents as asequence of words, represented by an inverted index [ZM06]. An inverted index, invertedfile, or inverted file index is a dictionary-like data structure that stores terms alongsidea list of documents they appear in, see e.g. [ZM06; MRS09, p. 6]. An inverted indexis a very space-efficient way of storing sparse term-document matrices [MRS09, p. 6].With boolean retrieval, queries are executed as boolean expressions (linked via AND,OR, or NOT connectors) evaluated over these dictionaries of index terms [MRS09, p. 4].However, the boolean retrieval model has some downsides [SFW83]:

1. The number of results returned can vary greatly based on the search terms andboolean connectors used in the query [SFW83].

2. Retrieved results are not ranked as a query either matches the document terms ordoes not [SFW83].


3. Terms that appear in a query or document are not weighted and thus have equalimportance, i.e. matches between queries and documents are exact [SFW83].

4. The results are often counterintuitive, because for the disjunction of query terms(e.g. information OR retrieval) a document with one of these terms is given the samerelevance as a document containing both terms; and similarly, for the conjunction(e.g. information AND retrieval) a document that contains one of these terms isconsidered just as useless as a document that does not contain any of them [SFW83].

Alternative models are fuzzy or extended boolean retrieval models [cf. SFW83]. Theformer extends the boolean retrieval model by giving weights to terms within documentswhich allows for a ranked retrieval of documents [SFW83]. In the extended booleanretrieval model, it is further possible to assign weights to query terms whereby a user canindicate mixed importance values for various search terms [SFW83].

Vector Space Model Salton, Wong, and Yang formalized the vector space model whenthey described an automated way to cluster documents for indexing [SWY75]. Eachterm in a document constitutes a single dimension in the vector space (that optionallymight be weighted), i.e. a t-dimensional vector for t distinct terms in a text document[SWY75]. If documents are jointly relevant to a given user query, then they are nearbyin the vector space, otherwise their vectors are distant [SWY75]. The similarity betweentwo term vectors is usually determined by calculating the angle (cosine similarity) or theinner product between them [SWY75; Sin01]. In the vector space model, it is commonthat in addition to documents also queries are represented as term vectors, since a query,composed of a sequence of words or a phrase, is ultimately text [Sin01]. Contrary to theboolean retrieval model, the vector space model provides a partial matching and rankingof documents with respect to queries, which is conducted based on to the proximity ofthe term vectors in the vector space and the weights optionally assigned to the terms[BR11, p. 77].

Probabilistic Model An early proposal to use probability theory for establishing a rank-ing of documents was published in 1960 by Maron and Kuhns [MK60]. The probabilisticmodel for information retrieval was presented in 1976 by Robertson and Sparck Jones[RS76]. A probabilistic model uses statistical methods to estimate the probability ofa document being relevant to a specific query [cf. Fuh92; BR11, p. 80f.]. Probabilisticmodels are generally based on the probability ranking principle [Rob77], meaning that asystem’s performance is optimal, if the documents are ranked by decreasing probabilityof relevance for a specific request on the basis of relevance judgments available to the


system [e.g. Rob77; Cre+98; RZ09]. The binary independence model (BIM), a simpleprobabilistic model, assumes a binary index description of documents and that the termsin the documents are independently distributed [RS76]. Additional probabilistic modelsthat are frequently mentioned in IR literature are, among others, BM25 [RZ09], but alsobinary independence indexing, staged logistic regression, 2-Poisson, or inference networkmodels. For an overview of these probabilistic models, see [Fuh92; Cre+98].

Other Approaches Latent semantic indexing (LSI) (or latent semantic analysis) [Dee+90]is a novel vector-based approach that seeks to improve retrieval performance. It is basedon singular-value decomposition of a term-document association matrix, in other words itleverages the latent semantic structure of documents and semantically indexes associatedterms in a vector space [Dee+90]. This way it is possible to consider documents thatin the first place seem irrelevant, yet are implicitly related to relevant documents basedon some shared concepts. Similarly, irrelevant documents that appear relevant can beeliminated with higher confidence from the result set. Consequently, LSI is able toameliorate the problems of synonymy (i.e. different words that mean identical or similarthings [NO95], e.g. “car” versus “automobile”) and polysemy (or homonymy, i.e. identicalor similar words that carry different meanings [NO95], e.g. “Jaguar” the big cat versus“Jaguar” the automobile brand name), that usually create problems for the conventionalboolean and vector-based retrieval models [Dee+90]. While synonyms can negativelyimpact recall of IR systems, homonyms usually account for poor precision [Dee+90] (seeSection 2.5.3.3 for a discussion of precision and recall in IR).

2.5.3.2 Ranking

IR algorithms generally rank results based on previously computed scores associated withdocuments [Sin01]. The result set might further be pruned to those documents withscores above a specified threshold value [BR11, p. 78].

Term weighting is a method to attach weights to terms relative to their importance indocuments or document collections [cf. SB88]. Its goal is to improve retrieval effectivenessby retrieving as many relevant documents as possible and rejecting irrelevant documents[SB88]. The term frequency–inverse document frequency (TF-IDF) model is the mostpopular weighting scheme for IR [BR11, p. 68]. It relates the term frequency withina document (TF – term frequency) [Luh57] to the frequency of the term over thewhole document collection (IDF – inverse document frequency) [Spa72]. Accordingly,a document is relevant with respect to a particular query term, if the term is frequent


in that document. At the same moment, the relevance of the document is lower if theterm exists in several documents of a collection instead of solely occurring in a singledocument.

In large document collections, there is a risk of term weighting schemes favoring longerdocuments, because long documents typically expose a richer set of terms and a higherterm frequency [SBM96]. To address this unfair treatment of shorter documents, IRalgorithms employ document length normalization functions to correct for different-lengthdocuments. Popular techniques mentioned in [SBM96] are cosine normalization, maximumTF normalization, and byte length normalization.

The TF-IDF weighting scheme calculates ranking scores for documents as follows: Givena query q and a document d, then the score associated with the document d is the sumof all TF-IDF weights of query terms t [cf. MRS09, p. 119]:

score(q, d) =X

t2qtfidf

t,d

(2.8)

In the vector space model, the score is computed as the similarity between query anddocument vector [Sin01]. The similarity of two vectors can be determined by the cosinesimilarity (normalized dot product) [cf. MRS09, p. 124], i.e. the angle between the twovectors ~q and ~

d in the t-dimensional space. The vectors are represented as vectors ofweighted terms, in its basic form TF-IDF weights associated with terms [BR11, p. 78].

score(q, d) =~q · ~

d

|~q||~d|(2.9)

As already discussed before, probabilistic models rank documents according to the proba-bility ranking principle [Rob77], which essentially states that for maximal effectivenesssystems should rank documents by the probability of being relevant to a given query.

2.5.3.3 Evaluation Criteria for Information Retrieval

In Modern Information Retrieval: The Concepts and Technology behind Search, Baeza-Yates and Ribeiro-Neto outline the central criterion for the success of an IR algorithm,that is the relevance of the presented results with respect to a given information need:

“[T]he primary goal of an IR system is to retrieve all the documents that are relevantto a user query while retrieving as few non-relevant documents as possible.” [BR11,p. 4]


Precision and recall are two fundamental metrics for evaluating the quality of IR systems[e.g. Cle67]. They were initially defined as part of the Cranfield experiments in the 1950s[Cle67], a precursor of modern IR evaluation, where indexing systems were comparedand evaluated with the help of systematically created reference collections. These testcollections contain documents manually labelled as relevant or non-relevant relative toparticular queries. Nowadays, similar test collections75 are provided in the context of theText REtrieval Conference (TREC) conference, addressing the various information needsfrom different application domains [e.g. Sin01; SC12], including Web search, medicalsearch, enterprise search, and others.

In the following, we formally define precision, recall, and the F1-measure as three popularevaluation criteria for information retrieval algorithms.

Let D

retrieved

denote the set of retrieved documents from a document collection D, andD

relevant

be the set of documents deemed relevant (see Figure 2.18).

DrelevantDretrieved

D

(a) Precision

DrelevantDretrieved

D

(b) Recall

Figure 2.18: Side-by-side comparison of precision and recall

The precision, aiming for the highest possible amount of relevant documents in a returnedresult set [Cle67], is calculated as the length of the set of relevant documents retrieveddivided by the number of all retrieved documents from a document collection [cf. MRS09,p. 155]:

precision =|D

relevant

\D

retrieved

||D

retrieved

| (2.10)

The recall, which strives to receive as much relevant documents as possible from adocument collection [Cle67], is formalized as the amount of relevant documents thatcould be retrieved with respect to all relevant documents in the document collection [cf.MRS09, p. 155]:

recall =|D

relevant

\D

retrieved

||D

relevant

| (2.11)

75http://trec.nist.gov/data.html (accessed on June 23, 2014)

http://trec.nist.gov/data.html


The values for precision and recall are rational numbers between 0 and 1. The higher theprecision, the more of the retrieved documents are relevant. The higher the recall, themore of all the relevant documents could be retrieved. The paradox is now that precisionand recall often describe an inverse relationship, that means if you improve one value itusually comes at the cost of the other value [Cle67]. Finally, the F1-measure relates thesetwo values [cf. MRS09, p. 156]. It is given by

F1 =2 · precision · recall

precision + recall

(2.12)

In addition to the F1-measure, there exist other measures that integrate precision andrecall into a single number, e.g. mean average precision (MAP) or 11-point interpolatedaverage precision [MRS09, pp. 159–161]. The 11-point interpolated average precision e.g.,calculates the arithmetic mean of the interpolated precision at recall levels 0%, 10%, ...100% (eleven points) [MRS09, p. 159]. Often however, especially for Web searches, itis difficult to get hold of all the documents in a collection, which at least hampers thecomputation of the recall measure. If the proportion of retrieved documents is very large,then it turns out difficult to compute the precision metric, too. For this reason, othermetrics are sometimes more appropriate. Subsequently we delineate a non-exhaustiveselection of these metrics:

• Precision at n (P@n), where P@5 (precision at 5), P@10 (precision at 10), P@20(precision at 20) are possible instances, is a very popular measure for Web searches[BR11, p. 140]. P@5 means that the result set is cut off after the fifth result, andthe portion of relevant documents in this reduced result set gives the precision.This metric can be used to compare different IR systems, or to observe the behaviorof a single system over a number of queries [BR11, pp. 139f.].

• R-precision computes the precision at the position R, given that R is the numberof relevant documents for a particular query [BR11, p. 141]. This measure assumesthat R is known. Thus it addresses a potential drawback of P@n, i.e. it relatesthe measure to the number of relevant documents, which can positively affect themeasure for queries with many relevant results [MRS09, p. 161].

• Binary preference (BPREF) [BV04] is used if relevance judgements are incompletein document collections, like in large corpora as the Web where other measuresgenerally treat the unreceived documents as non-relevant [BR11, p. 151]. In short,BPREF computes a metric over preference relations among documents, and is basedon the number of documents judged as non-relevant by human experts that showup ahead of relevant documents [BR11, p. 151].


In addition to the retrieval quality metrics, there are other important factors for evaluatingIR systems such as system quality assessment and user utility tests (e.g., time requiredfor completing the task, or usability scores) [MRS09, p. 168]. Accordingly, an IR systemcan be systematically evaluated in terms of usability by means of user-based experiments.A non-exhaustive list of popular methods for usability testing includes [cf. BR11, pp. 168–173]:

• Side-by-side panels: With this method, the results of two retrieval algorithmsare compared next to each other [TH06]. Users are then asked for comparativejudgements on the retrieval qualities of the two algorithms and, possibly, theirinteractions are logged [TH06].

• A/B testing : A/B testing is a controlled experiment where users are typically splitinto two evenly distributed groups (but also other splits are possible [e.g. MRS09,p. 70]), namely a control group A and a treatment group B, and randomly assignedto them [cf. Koh+09]. While for one user segment the conditions stay the same,the other group is faced with a different version with slightly changed parameters[Koh+09]. Running this kind of tests can provide useful insights prior to adding newfeatures, changing the user interface design, or renewing the underlying retrievalalgorithm of a system.

• Click-through data: User clicks are recorded in the background while users are inter-acting with the system [Joa02]. From such data about the relevance of documentsto a particular query, a retrieval algorithm can learn better rankings for futurerequests [Joa02].

In general, there can be imagined at least three possible settings for conducting userexperiments. The classical example is a lab setting where users are observed during taskexecution and interviewed [BR11, p. 168]. An alternative method is to collect usagedata by logging user interaction with the system [cf. Joa02]. This requires additionalimplementation effort. The third option is to outsource the fulfillment of user tasks topeople on the Web, known as crowdsourcing. Usability testing with crowdsourcing canbe regarded similar to a lab setting, with the important difference that a bigger audiencecan be reached at lower costs and timely overhead [Liu+12].

2.5.3.4 Tools

There are many tools and applications that have been developed in the context of IR,among others commercial search engines like Google, Yahoo!, Bing, etc. Furthermore,


there exist several open source projects. One of them is Apache Lucene76, an industrial-strength open source project that brings several of the benefits of IR algorithms also tosmaller development projects. Lucene is a software library that adds full-text search andindexing capabilities to applications, supporting a range of different query types, severalranking models, and lots of configuration options [cf. MHG10].

2.5.4 Human-Computer Interaction

Human-computer interaction (HCI) describes an interdisciplinary field dealing withmethods on how people can interact with computer systems. For an overview, see e.g.[Car97; Mye98; Gru12]. Even if HCI also deals with past and future advances in input andoutput devices like the computer mouse or motion tracking devices [Mye98], the researchdirection we are mostly interested herein are user interface designs and interaction modelsfor search systems. White, Kules, Drucker, and Schraefel [Whi+06b] frame the challengesof search systems as “[r]ather than just providing search results, search systems shouldhelp users explore, overcome uncertainty, and learn” [Whi+06b]. Morville and Callender[MC10] characterize search as follows:

“[S]earch at its best is a conversation. It’s an iterative, interactive process where wefind we learn. The answer changes the question. The process moves the goal. Searchhas the power to suggest, define, refine, cross-sell, upsell, relate, and educate. Infact, search is already among the most influential ways we learn.” [MC10, p. 9]

In the following three sections, we elaborate on the common distinguishing characteristicsof search. After that, we summarize interaction paradigms for search, dedicating morespace to the faceted search interfaces, and conclude with design guidelines for searchinterfaces.

2.5.4.1 Static versus Dynamic Search

Search is traditionally oversimplified as a task accompanied by a static information need[Hea11]. In this view, search is about (a) identifying a problem, (b) articulating theinformation need, (c) formulating a query, and (d) evaluating the results [Hea11, p. 23].While this model fits well known-item seeking (see Section 2.5.4.2) scenarios, it is notsuitable for searches where the user does not yet exactly know what he is looking for.

In fact, search is very often a dynamic process, where the information need adapts overtime as users learn and collect new information [Hea11, p. 23]. At best, the design of userinterfaces caters for these circumstances. Contemporary searches hence try to continually76http://lucene.apache.org/core/ (accessed on June 24, 2014)

http://lucene.apache.org/core/


engage the user in the search process. Information seekers revise queries in an iterativemanner according to their changing information needs based on past results [Mar06].While searching they collect, compare, filter, and digest pieces of information. This issometimes referred to as the berrypicking model of search [Bat89].

In accordance with the dynamic search model, incremental search strategies are quitecommon. One such strategy is that users pose an initial query without fully specifyingthe information need (“testing the water”), which is only later refined in subsequent searchiterations [Hea11, p. 26]. Another possibility is to do a series of smaller searches instead ofwriting a long complex query, in the hope of gradually approaching the final answer. Thiskind of incremental search strategy has been termed in [OJ93] as orienteering [Hea11,p. 23].

One important problem with current systems is that the temporary storage and laterintegration of intermediate results is mostly left to the human and only lightly supportedby the systems.

2.5.4.2 Lookup versus Learning

Depending on the information need and the prior knowledge of a domain, there mightbe different search tasks most appropriate. Morville and Callender discern two of them,i.e.

• lookup (or known-item search), and

• learn (or exploratory search) [MC10, pp. 27f.].

Further, Marchionini [Mar06] makes a more subtle distinction by regarding lookup, learn,and investigate. With regard to Morville and Callender [MC10], the latter two tasks(learn and investigate) can be summarized under a common term exploratory search[Mar06].

Lookup “is the most basic kind of search task” [Mar06] and involves fact retrieval andquestion answering (QA), i.e. to query information about items that are already known[cf. MC10, p. 28]. On the other hand, learning and investigative searches are multi-stepapproaches that comprise knowledge discovery, comparison, and analysis, to name buta few [cf. MC10, p. 28]. Investigative search delineates from learning in that it is along-term process that aims to gather new knowledge or update existing one, or to fillpotential gaps in knowledge [Mar06]. It hence constitutes an in-depth research process.


2.5.4.3 Searching versus Browsing

In general, we can characterize two main activities of an information-seeking agent, i.e.

1. searching (or querying) and

2. browsing (or navigating) [BR11, p. 4].

Morville and Rosenfeld [MR06, p. 35] further mention asking as a third information-seeking activity.

Sometimes, if an information need is vague, it may not be appropriate or feasible toformulate a query. E.g., it might be difficult to recall the right terms or descriptions ofan information need [Hea11, p. 24]. In such cases, browsing or navigating the optionspace is more promising than querying. Nonetheless, these information seeking tasks aremost commonly combined, i.e. a user is able to alternate between searching, browsing,and asking [MR06, pp. 35f.].

2.5.4.4 Interaction Paradigms for Search

The optimal search interface for user interaction depends on several factors. Hearstmentions three situations that affect the type of applicable search interface, namely

1. the kind of search task,

2. the time and effort that can be spent on the task, and

3. the past experience of the information-seeking agent [Hea11, p. 22].

For example, imagine someone wants to know more about a topic he currently lacksexpertise in. Recall that in this case, navigating the option space and possibly findingnew interesting things might be easier and lead to more promising results than having towrite complex queries.

In the following, we outline the two most common search paradigms in use today:

1. Keyword search: Widely employed by search engines and digital libraries, theprevalent query method is to enter query terms into a search box, referred toas keyword search. Keyword search is taking advantage of techniques from IR[cf. BR11, p. 3]. Keyword searches are intended to be easy to grasp for the vastmajority of information seekers. A related but less intuitive approach is describedby form-based query interfaces, i.e. guided search interfaces with multiple inputfields for more expressive queries [Wei+13]. Due to their complexity for the average


user, they are typically reserved for expert searchers or narrowly defined informationneeds.

2. Navigational search: Before the rise of keyword-based search engines, links werecollected and manually organized into Web directories to facilitate the discovery ofdocuments via category navigation [Din+05; Wei+13] (e.g. by Yahoo! Directory77,Open Directory78, etc.). This navigational search has become less popular forWeb searches with the enormous growth of the Web and improvements to searchengine accuracy. However, navigational search is implemented by many Web sites,e.g. news portals and e-commerce platforms to facilitate page navigability or thediscovery of products.

Well-designed search interfaces usually integrate both keyword-based and navigationalsearches. Amazon, for example, effectively combines keyword searches with navigationalcapabilities to let the user narrow down the information space (i.e. the product catalog).

Exploratory search goes one step further by blending querying and browsing strategiesinto a highly interactive user interface [Mar06]. Faceted search interfaces constitute animportant instance of exploratory searches.

2.5.4.5 Faceted Search Interfaces

Faceted search is a specific type of exploratory search [Wei+13], which in addition totraditional keyword search allows to navigate the option space via browsing and filtering[Tun09, p. 24]. Faceted search interfaces are quite common for e-commerce platforms likeeBay or Amazon.

Faceted search is a multi-dimensional search paradigm and based on the concepts offacets79 and their facet values or terms [Tun09, pp. 7f.]. Facets can be roughly comparedto categories orthogonal to each other. Consequently, to give an example, possible facetsfor beer would be “style”, “location”, and “brewery”. Similarly, examples of respective facetvalues would be “wheat beer” for the style, “Munich” for the location, and “Paulaner” forthe brewery (see Figure 2.19).

Faceted search is sometimes interchangeably used with the terms faceted navigation,faceted browsing, or guided navigation [Wei+13; cf. MC10, p. 95]. Tunkelang, however,77http://dir.yahoo.com/ (accessed on June 18, 2014)78http://www.dmoz.org/ (accessed on June 18, 2014)79Important contributions to faceted classification were made by Shiyali Ramamrita Ranganathan,

an Indian librarian, in the first half of the twentieth century, by proposing his extensible colonclassification scheme for library science [Tun09, pp. 7f.].

http://dir.yahoo.com/

http://www.dmoz.org/


Style

Wheat beer

Pale beer

Dark beer

Brewery

Paulaner

Location

Munich

Erding Erdinger

Hacker-Pschorr

Bitburger

Figure 2.19: Faceted search interface

makes a distinction between faceted search and faceted navigation (or browsing), whenhe considers faceted navigation, like parametric searches, a predecessor of faceted search[Tun09, p. 21].

Parametric searches are search interfaces based on boolean algebra, where the interfacecomponents essentially describe a combination of logical AND and OR connectors [Tun09,p. 21]. Multiple facets would be connect by ANDs, whereas their option values by ORs[Tun09, p. 21]. Selecting both the “Paulaner” and “Bitburger” breweries and the location“Munich” from Figure 2.19 thus yields the following boolean query:

Location(Munich) AND (Brewery(Paulaner) OR Brewery(Bitburger))

Faceted navigation adds guidance to parametric searches [Tun09, p. 23]. While parametricsearches are executed as a one-shot query, the faceted navigation typically providesimmediate feedback by adapting the choices available in other facets to the currentlyselected options [Tun09, p. 23]. So, in the example above, after selecting “Munich” forthe location, it would not be possible anymore to choose the brewery “Bitburger”. Inother words, since Bitburger beer is brewed in Bitburg, Rhineland-Palatinate, and notin Munich, the option for Bitburger has disappeared. Finally, a faceted search interfacecombines faceted navigation with keyword search [Tun09, p. 24]. Pioneering work on thistype of search interfaces was conducted by Hearst et al., who suggested the Flamencofaceted search framework80 [Hea+02]. Wei et al. [Wei+13] compare a number of facetedsearch systems that have been presented in the past.

The research on dynamic taxonomies [Sac00] is also closely linked to faceted search,because faceted search usually displays facets and facet values dynamically rather thanshowing rigid category structures that rely on static taxonomies.

Faceted search interfaces have a number of characteristics that permit to contrast themfrom other search paradigms: (1) They do not require the user to manually formulate80http://flamenco.berkeley.edu/ (accessed on June 27, 2014)

http://flamenco.berkeley.edu/


complex queries, but they allow to refine and relax search results based on facets andfacet values; (2) they require little knowledge about the underlying data schema; (2) theyfacilitate to explore the option space in a guided and incremental fashion, where possibleoptions for next navigation steps depend on the current selection; and, (3) faceted searchinterfaces solve the problem of “dead ends” [FH11], i.e. they eliminate unsatisfiable choicesthat could otherwise lead to empty result sets [cf. Wei+13].

2.5.4.6 Design Guidelines for Search Interfaces

A simple but very important rule in search interface design is simplicity, intending that asearch interface needs to be as simple as possible avoiding any unnecessary complexitiesthat could distract users [SBC97]. In here, we briefly summarize eight guidelines forgood search user interface design. They have initially been proposed in the context ofIR systems in the work of Shneiderman, Byrd, and Croft [SBC97], and were furtherdiscussed in [Hea09]:

1. Consistency: Search user interfaces need to be consistent regarding terminology,layout, etc.; otherwise, usability might suffer [SBC97].

2. Shortcuts: Highly repetitive tasks or well-known navigation paths should be sup-ported by shortcuts. Shortcuts can be realized in the form of keyboard shortcutsfor better interaction, but also clickable page links to facilitate navigation (e.g. pagebreadcrumbs or deep links in search engine result snippets) [Hea09].

3. Informative feedback: The provided feedback during the search process should beinformative to the user, so that the context of the particular query becomes clear[SBC97]. In this regard, it is necessary that a search interface quickly returns results,that search results are augmented with a short summary or snippet (possibly withthe search terms highlighted), or that search terms are suggested to the user in thequery formulation process [Hea09].

4. Design for closure: For a user it is usually hard to estimate the size of the optionspace. Hence, a system should be designed as for the users to relieve them fromtedious and unnecessary guesswork regarding their search progress [SBC97].

5. Error handling: Simple errors should be reported as long as they give helpfulinsights and do not distract users [SBC97]. Furthermore, a system should takecare of avoiding certain errors, such as presenting users with an empty result set,or taking action to make the search system more resilient against several differentvariants of the same query intent [Hea09].


6. Reversal of actions: It should be possible for users to undo actions in order to beable to return to past queries, which can become very powerful in combination withrelevance feedback [SBC97].

7. User control: In general, users should feel like they have control over a searchsystem [SBC97]. However, many search systems use search algorithms that a userdoes not necessarily have to understand in much detail. To some extent, a systemcould work autonomously, e.g. it could improve recall by eliminating stop words orexpanding acronyms, or cluster results [Hea09]. Nevertheless, sometimes the usershould take over control, e.g. the system could ask whether a spelling correction isdesired [Hea09].

8. Short-term memory load: Any search interface design that poses high cognitiveload on users is generally discouraged. For instance, information at one placein compacted form is often better than finding the same piece of informationdistributed among multiple resources (e.g. many slides in a presentation for oneidea) [SBC97]. In principle, users should not have to remember too much of therelevant information and instead be supported by the interface [Hea09]. This caninclude affordances for interaction (e.g. pre-filled default values in text fields, ortooltips), but also the maintenance of a search history [Hea09].

2.5.5 Recommender Systems

Nowadays, we are finding ourselves in a culture of mass customization rather than massproduction [cf. FFS07; SKR99]. Mass production was the leading business model of thetwentieth century, introduced in the early twentieth century by Henry Ford for producingmillions of standardized Ford Model T automobiles [FFS07]. Mass customization is theimperative that companies are facing today [FFS07]. Customers have individual needsthat can largely be fulfilled by the provision of multiple product variants. However, thevast array of options available for purchase makes people quickly feel overwhelmed [e.g.FFS07; SKR99].

Recommender systems are a means to help solving the problem of information overloadthat accrues from this variety of products [SKR99]. They facilitate the decision-makingand are based on the same principle as traditional word-of-mouth recommendations viaother people’s experiences [Bob+13].

“Recommender systems support users by identifying interesting products and servicesin situations where the number and complexity of offers outstrips the user’s capabilityto survey them and reach a decision.” [FFS07]


A good recommender system tries to understand the information need and to suggestthe right products based on that. Possible business benefits of an accurate recommendersystem are to turn visitors into buyers, to stimulate cross-selling of products81, or toimprove customer loyalty [SKR99].

2.5.5.1 Approaches

Recommendation approaches are usually distinguished by the type of information source.The most prominent information sources are user ratings and item82 characteristics.Adomavicius and Tuzhilin distinguish between three main categories of recommendationapproaches [AT05]:

1. Content-based filtering considers item characteristics and suggests similar itemsbased on prior interests or past purchases of the user, i.e. matching items to theuser profile.

2. Collaborative filtering takes into account user profiles, preferences, and interestsby other users. Based on similar profiles, a user is recommended items that otherusers have already preferred, rated, or visited previously.

3. Hybrid recommendation approaches combine different recommendation methods[Bur07; AT05], e.g. content-based and collaborative filtering, to overcome individuallimitations of the single approaches.

There is more explicit and implicit information about users and items that can begathered and utilized, namely user behavior, various types of context information (e.g.demographics), social profiles, or the Internet of Things (IoT) [Bob+13]. Accordingly, wecan augment the aforementioned categorization of recommendation approaches. Burkementions the following two additional recommendation approaches [Bur07]:

4. Demographic filtering takes into account the demographic profile of the user, i.e.the similarity of user profiles is calculated based on demographic attributes likelocation, gender, or age.

5. Knowledge-based filtering recommends items based on domain knowledge and userpreferences.

Further, there is at least another approach that became attractive with social networks:81Cross-selling describes additional sales generated by related products, e.g. complementary goods.82“Items are the objects that are recommended.” [RRS11]


6. Community-based filtering takes advantage of social networks, e.g. by limitingrecommendations to the circle of friends whom the user typically trusts more thanthe anonymous crowd [e.g. RRS11, p. 13].

2.5.5.2 Limitations

There are several challenges that need to be solved for recommender systems. In thefollowing, we outline selected problems that are repeatedly mentioned in literature [e.g.FFS07; Bob+13; RRS11; AT05].

Cold-Start Problem A common limitation of recommender systems is the cold-startproblem [e.g. Bob+13; RRS11, p. 13; Bur07; LKH14]. The cold-start problem describesthe problem of an insufficient number of initial ratings that recommendations could bebased on [Bob+13]. It emerges either when a new recommender systems is launched for thefirst time, or when a new user or a new item is added to an existing recommender system[Bob+13]. While content-based filtering generally suffers from the new user problem only(i.e. the system has no prior knowledge about a user’s preferences), the collaborativemethod also faces new item challenges (i.e. there are not sufficiently many user ratingsfor a given item) [AT05]. The new user and new item problems are usually addressedby employing hybrid recommendation strategies, e.g. by combining content-based andcollaborative filtering techniques [e.g. AT05; Bur07].

Sparsity Sparsity of user ratings is a problem of collaborative filtering techniques [AT05].Especially if the number of items is large, it can happen that several of them have not beenrated yet. This problem can be mitigated by drawing on demographic filtering techniques,i.e. to recommend items based on user groups that share the same demographic attributes[AT05].

Diversity A feature that is sometimes disregarded but still desirable for recommendersystems is to establish some degree of diversity between recommended items [AT05]:First of all, if systems strive to suggest items that best possible match user profiles, thenusers will never be recommended serendipitous items, that are novel items that might besurprisingly interesting to them. To tackle this issue, recommender systems could avoidto suggest items that are too similar to the items seen before (instead of only those thatare too dissimilar) [AT05].


2.5.5.3 Tools

Numerous commercial Web portals integrate recommendation engines to recommenditems to users, e.g. Netflix, Youtube, Amazon, eBay, etc. But there exist also severalnon-commercial tools for building recommendation engines. In the course of their research(e.g. MovieLens83, a movie recommender system), the GroupLens research laboratory atthe University of Minnesota developed an open source toolset for building and studyingrecommender systems, named LensKit84 [Eks+11]. In their complementing researchpaper, Ekstrand, Ludwig, Konstan, and Riedl [Eks+11] also give an overview of otherpopular recommender toolkits, among others Apache Mahout85, which is a machinelearning library that implements recommenders based on clustering, classification, andcollaborative filtering techniques.

2.5.6 Natural Language Processing

Natural language processing (NLP) is one branch of speech and language processing[JM09, p. 1]. Speech and language processing is a research area which can be subdividedinto the fields of computational linguistics (linguistics), NLP (computer science), speechrecognition (electrical engineering), and computational psycholinguistics (psychology)[JM09, p. 9]. According to Manaris, NLP spans the research disciplines of computerscience, electrical engineering, mathematics, statistics, linguistics, psychology, philosophy,and biology [Man98, pp. 6f.].

Early research on NLP reaches back to the first years after World War II (late 1940s)when English scientists worked on machine translation to the end to translate Sovietpapers on physics from Russian to English [Man98, p. 7]. Beginning in the 1990s, the fieldof NLP has since attracted a lot of commercial and academic interest due to the dawnof the Web, an increasing amount of text corpora made available for analysis (e.g. thePenn Treebank project with tag sets of annotated English text [MSM93]), significantlymore powerful computers, and improvements to NLP models (e.g. the application ofprobabilistic models) [cf. JM09, pp. 12f.]. For a more detailed overview on NLP history,see e.g. [Bat95; Man98, pp. 7–12; cf. JM09, pp. 9–14].

It is not easy to find a universal definition for NLP, because of the many facets of languageand its interdisciplinary nature [Man98, pp. 4f.]. Manaris defines NLP within the scopeof HCI as83http://movielens.org/ (accessed on June 27, 2014)84http://lenskit.grouplens.org/ (accessed on June 27, 2014)85http://mahout.apache.org/ (accessed on June 27, 2014)

http://movielens.org/

http://lenskit.grouplens.org/

http://mahout.apache.org/


“[...] the discipline that studies the linguistic aspects of human-human and human-machine communication, develops models of linguistic competence and performance,employs computational frameworks to implement processes incorporating such mod-els, identifies methodologies for iterative refinement of such processes/models, andinvestigates techniques for evaluating the resultant systems.” [Man98, p. 6]

NLP deals with text [cf. BKL09] and as such it borrows techniques from IR, machinelearning, or text categorization [MS99, p. xxxi]. In IR, for example, NLP attracted wideattention with respect to better grasp the intent behind natural language queries, andthe indexing of unstructured text for better retrieval [LS96].

The principal problem in dealing with natural language texts is the ambiguity in thelanguage and the variety of structure in the sentences. Thus, “a practical NLP systemmust be good at making disambiguation decisions of word sense, word category, syntacticstructure, and semantic scope” [MS99, p. 18].

2.5.6.1 Approaches

NLP entails various processes that each relies on different kinds of knowledge sources. Ageneral model of NLP systems was suggested in [Bat95]. Figure 2.20 shows the classes ofprocesses that operate at varying knowledge levels [see also Man98, pp. 30–32] in orderto interpret input text. NLP processors apply at the lexical (morphological features),syntactic (word structure), and semantic (meaning) levels, as well as at the levels ofdomain (known concepts and relations), discourse (sequence of sentences), and pragmatic(purpose or intended goal) knowledge.

UnderstandingSearchWords Output(s)

Discourse/Pragmatics

Domain Model

SemanticsSyntax

Lexicon

Figure 2.20: Processes of an NLP system [from Bat95]

In Section 2.4 (semantic data interoperability), we have already learned some techniquesof NLP. Subsequently, we extend them by briefly summarizing some of the most popularNLP approaches:


• Regular expressions for pattern matching and information extraction [cf. BKL09,pp. 97–106].

• (Pre-)processing steps based on words and sentences like stop word elimination [cf.BKL09, p. 60], tokenization [cf. BKL09, pp. 109–112], lemmatization [cf. BKL09,p. 108] sentence segmentation [cf. BKL09, pp. 112f.], or stemming (e.g. Lovins[Lov68] and Porter [Por80] stemmers) [cf. BKL09, pp. 107f.].

• Word sense disambiguation (WSD) [Nav09] to find out the intended meaning of aword [cf. BKL09, p. 28].

• Part-of-speech (POS) tagging as a method to categorize words based on the rolethey play in a sentence (i.e. noun, verb, etc.) [cf. BKL09, pp. 179–189]. POStagging algorithms are typically based either on stochastic or on rule-based (e.g.the Brill tagger) approaches [Bri79]. By that, it is possible to distinguish the senseof the word limit in “price limit” and “to limit a risk”.

• Named entity recognition (NER) to detect interesting named entities in text suchas organizations, persons, or locations [cf. BKL09, pp. 281–283]. On top of that,relation extraction then tries to extract existing relationships between the namedentities [cf. BKL09, pp. 283–285].

• A grammar to define the structure of natural language. Given an input sentence,the parser generates a parse tree based on the supplied grammar. A probabilisticcontext-free grammar is able to produce parse trees with associated probabilitiesthat can be ranked [cf. BKL09, pp. 291–321].

In the course of this thesis, we will not use advanced NLP techniques. It is promisingfuture work to integrate them into our approach.

2.5.6.2 Tools

There exist a number of tools and programming libraries for NLP. Stanford NER86 isa NER tool written in Java. The Natural Language Toolkit (NLTK)87 is a powerfulPython framework for NLP. For example, it supports NLP features like POS tagging,implements various classifiers, and comes with an interface for Stanford NER. The libraryis complemented by an accompanying book about NLP in Python [BKL09]. GeneralArchitecture for Text Engineering (GATE)88 [Cun02] is another popular framework that86http://nlp.stanford.edu/software/CRF-NER.shtml (accessed on June 20, 2014)87http://www.nltk.org/ (accessed on June 20, 2014)88https://gate.ac.uk/ (accessed on June 20, 2014)

http://nlp.stanford.edu/software/CRF-NER.shtml

http://www.nltk.org/

https://gate.ac.uk/


implements several features for text processing, including information extraction andNLP features. Finally, Apache OpenNLP89 is an open-sourced machine learning libraryfor NLP with support for the most common NLP tasks.

2.5.7 Matchmaking

The term matchmaking is widely used in e-commerce literature and describes the problemof

1. searching for potential matches between supply90 and demand91, and

2. selecting the best fitting object among a set of candidates, even if no perfect matchis available [e.g. Hof+00; Vei03; AL05; Di +03].

Di Noia, Di Sciascio, Donini, and Mongiello [Di +03] define the matchmaking problem asfollows:

“The process of searching the space of possible matches between demand and suppliescan be defined as Matchmaking. Notice that this process is quite different fromsimply finding, given a demand a perfectly matching supply (or vice versa). Instead itincludes finding all those supplies that can to some extent fulfill a demand, eventuallypropose the best ones.” [Di +03]

While the matchmaking problem has ever existed since people have been trading goods,it has become more complex, challenging, and important since market transactions aretaking place in electronic networks, and recently on the WWW [cf. Di +03; DR10].Freuder and Wallace already commented in 1998 on the prospective role of matchmaking,that “[i]ntelligent matchmakers can be regarded as a third generation tool for Internetaccessibility, where hypertext constitutes the first generation, and search engines thesecond.” [FW98].

2.5.7.1 Characteristics of Matchmaking

Some aspects of matchmaking are often not sufficiently addressed by existing matchmakingapproaches. The key characteristics are:

1. Matchmaking is a learning process where the user learns about the option spaceduring search.

89https://opennlp.apache.org/ (accessed on June 28, 2014)90also: resource description, offer, advertisement91also: request, query

https://opennlp.apache.org/


2. Matchmaking is an iterative process spanning multiple cycles.

3. Matchmaking is bi-directional, where both buyer and supplier have preferences thatthey might specify.

4. Matchmaking should support the gradual disclosure of information, i.e. only provideas much information as necessary and only when needed.

5. Matching items are multi-dimensional objects that need to be considered by thematchmaker and the corresponding resource specification language.

6. Matches are rarely perfect and most often partial.

7. Human involvement is key to successful matchmaking, that includes query refine-ment/relaxation and match explanation.

8. Matchmaking has to deal with multiple, heterogenous sources and differing modali-ties.

9. Context (e.g. past queries, location, skills) is a relevant criterion for matchmaking.

10. Matchmaking often needs to deal with dynamic information.

11. Matchmaking should regard temporal constraints, i.e. to be responsive for short-termneeds.

In the following, we summarize how these characteristics of matchmaking have beentackled in literature.

Learning Process The matchmaking is a learning process [e.g. Hof+00; Vei+01; FW98].Especially for complex products, a multi-phased dialog with intermediate feedback isindispensable [Hof+00]. Colucci et al. [Col+06] propose gradual refinement of the queryin a multi-stage dialog.

Multi-Dimensional Matchmaking Matchmaking is not a one-dimensional, but a multi-dimensional problem, where complex request and resource profiles need to be matched.On many Web portals, for example, products are compared according to their prices, andproduct characteristics have to be compared manually [AL05]. Agarwal and Lampartersuggest a semantic approach to match product descriptions from different providersin e-marketplaces based on multiple criteria [AL05]. In [Vei+01; Eit+01], the authors


treat matchmaking as a multi-dimensional matching problem and suggest an XML-based matchmaking approach (GRAPPA) that is able to match heterogenous and multi-dimensional items. They apply their matchmaker to the domain of human resources,where they match job applicant profiles to job advertisements.

Human Involvement Many approaches from literature underrate the importance ofhuman intervention in the matchmaking process. The user should be able to explore theoption space [e.g. Hof+00; Col+06]. Colucci et al. [Col+06] suggest a visual approach tomatchmaking that lets users explore the marketplace option space. This approach permitsto set negotiable characteristics and to select bonus characteristics that the matchmakershould possibly take into account [Col+06].

Trust and Confidence Some authors address the problem of trust and confidence inmatchmaking [e.g. Col+06; Hof+00]. Hoffner, Facciorusso, Field, and Schade [Hof+00]propose a staged dialog between electronic marketplace customers and providers wheretrust builds up gradually as sensitive information is only disclosed as necessary. Colucciet al. [Col+06] show a semantic-based explanation on the match degree to the user.

Bidirectional Nature Many matchmaking proposals disregard the bi-directional natureof matchmaking. I.e., matchmaking should consider the symmetric exchange of infor-mation between customer and provider in order supply also the provider with necessarydetails about the customer’s promises, and not only the customer with information aboutthe supplier [Hof+00]. Ragone, Straccia, Bobillo, Di Noia, and Di Sciascio [Rag+08]merge the discovery and negotiation phase into a bilateral matchmaking approach toincrease the utility of both transaction partners.

Temporal Aspect Finally, time can be an important criterion for successful matchmak-ing. One challenge are interruptive sessions. Hoffner, Facciorusso, Field, and Schade[Hof+00] noted that a virtual marketplace should provide long-life sessions, which can beinterrupted and restarted or resumed. So a matchmaking process is not limited in time[Hof+00]. Another aspect is the duration of the matchmaking process. Stollberg, Hepp,and Hoffmann [SHH07] suggested a caching mechanism to improve Web service discovery.The proposal reduces the search space for successive discovery operations [SHH07].


2.5.7.2 Matchmaking and Information Retrieval

Kuokka and Harada [KH96] address the problem of matchmaking in the context of theincreasing availability of information on the Web. They propose matchmaking as apossible solution to notify users about information that appears and changes dynamicallyin a large information environment. They implemented and evaluated two matchmakers,SHared DEpendency Engineering (SHADE) and COmmon INterest Seeker (COINS),one that matches on formal representations, and the other to match free text [KH96].To facilitate matchmaking on the Internet, Sycara, Widoff, Klusch, and Lu specifieda language for describing agent capabilities, termed Language for Advertisement andRequest for Knowledge Sharing (LARKS) [Syc+02]. A matchmaking agent based onLARKS was proposed, which matching process supports context, profile, similarity,signature, similarity, and constraints matching [Syc+02].

Di Noia, Di Sciascio, Donini, and Mongiello [Di +03] argue that classical IR techniques,as suggested in [KH96] and [Syc+99], are not appropriate for matchmaking. IR is mostlyabout matching free-text queries against document collections with textual content. Whenconsidering the vector space model of IR, its limitations with regard to matchmaking canbe revealed by giving an example, consisting of a demand “apartment with 2 rooms insoho pets allowed no smokers” and a supply “apartment with 2 rooms in soho no petssmokers allowed” [Di +03]. In the vector space, both sentences are represented by twoidentical term vectors, despite their meanings are obviously contradictory.

2.5.7.3 Ranking with Match Degrees

The ranking of matches within matchmaking is multi-dimensional, taking into accountmultiple attributes that a user might choose from [Col+06].

The problem of many matchmaking approaches is that they lack classifying and rankingof matches [Di +03]. Di Noia, Di Sciascio, Donini, and Mongiello propose a rankingmechanism based on description logics, a logical language that OWL is based on. Theranking of supplies is determined by the type of relations: Exact, potential, and partialmatches [Di +03].

Colucci et al. [Col+06] provide a more detailed categorization of matches: Exact, full,plug-in, potential, and partial matches [Col+06]. Figure 2.21 illustrates the possiblematch types given a demand D and a supply S [Col+06]:

• Exact : D = S (D is equivalent to S)


Demand

Supply

Exact Full Plug-In Potential Partial

{car, red}

{car, red} {car, red, convertible} {car} {car, convertible} {car, blue}

{car, red} {car, red} {car, red} {car, red}

carred

convertible blueconvertible

carred car

red

car

red

car

red

Figure 2.21: Match classes [based on Col+06]

• Full : D ⇢ S (D is more general than S)

• Plug-in: D � S (D is more specific than S)

• Potential : D \ S 6= ; and D \ S 6= ; and S \ D 6= ; (D is compatible with S)

• Partial : D \ S 6= ; and 9d 2 D, 9s 2 S, where d is incompatible with s (D is notcompatible with S)

The ranking is defined in the order partial! potential! full! exact. Special care isrequired for the ranking of plug-in matches, because they favor underspecified supplies[Col+06]. It must be guaranteed that more specific resource descriptions are not treatedinferior to unfairly generic or less specific descriptions.

Imagine a demand for a red car, D = {car, red}, and supplies S as in Figure 2.21. It iseasy to see that by revising the request, the match classes can change. E.g., a refinementto D = {car, red, convertible} would turn a previous exact match (S = {car, red}) intoa plug-in match, whereas a relax to D = {car} transforms it into a full match.

Buyer and seller preferences in e-marketplaces are often vague [e.g. Rag+08; AL05].Consider a buyer looking for a car that costs less than 15, 000 Euros and has a mileageof 50, 000 kilometers. Now, a seller offers the car for 15, 500 Euros and a mileage of45, 000 kilometers. This, in fact, would be an interesting offer. Ragone, Straccia, Bobillo,Di Noia, and Di Sciascio [Rag+08] propose a fuzzy matchmaking solution based ondescription logics that allows for conditional preferences with supplying hard and softconstraints. Agarwal and Lamparter [AL05] present an intelligent matchmaking portalwhere fuzzy user requests are relaxed to interval queries that are further matched againsta marketplace. Stuckenschmidt and Kolb [SK08] suggest a partial matchmaking approachthat uses approximate subsumption to match complex product descriptions. Based onthis partial matchmaking approach, Nowakowski and Stuckenschmidt [NS10] present ademonstrator where a user can gradually relax constraints in a product catalog based oneClassOWL.


2.5.7.4 Related Research Fields and Applications Areas

In fact, the problem of matchmaking can be found in many research fields and applicationdomains outside the narrower context of business. Its interdisciplinary relevance led to arich body of work on matchmaking from various angles, like

• intelligent, agent-based systems [e.g. Syc+99; KH96],

• Web service discovery [e.g. Pao+02; LH03; GTB01; Sto+07; Wei+11],

• human resource matching (e.g. match job profiles with applicants) [e.g. VWM02],

• Grid resource matching (e.g. allocate computational resources for tasks) [e.g. RLS98;Har+04; LR05; Ama+09] and Cloud service discovery [e.g. Zar+13],

• real estate matching [e.g. Poo+11],

• recommendations in multiplayer games (e.g. suggestion of proper competitors) [e.g.JJD11; Man+11],

• genetic matchmaking (e.g. mating in biology) [e.g. Wed+95], or

• marital matchmaking (i.e. finding a spouse) [e.g. BD08].

3 Data Collection


3.2 State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 129

3.2.1 Approaches for the Collection of Structured Data . . . . . . . . . . . . . 129

3.2.1.1 Standalone Crawlers . . . . . . . . . . . . . . . . . . . . . . . 129

3.2.1.2 Embedded Crawlers . . . . . . . . . . . . . . . . . . . . . . . . 129

3.2.1.3 Extractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

3.2.2 Web Data Commons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

3.3 Sweet-Spot Deep Crawling Approach . . . . . . . . . . . . . . . . . . . . . . . . 133

3.3.1 Ping-based Discovery of Relevant Web Sites . . . . . . . . . . . . . . . . 133

3.3.2 Crawling Strategy and Implementation . . . . . . . . . . . . . . . . . . . 134

3.3.2.1 Sitemap-based Crawling . . . . . . . . . . . . . . . . . . . . . 136

3.3.2.2 Spider-based Crawling . . . . . . . . . . . . . . . . . . . . . . 138

3.3.3 Extraction of Structured Data . . . . . . . . . . . . . . . . . . . . . . . . 138

3.3.3.1 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

3.3.3.2 Politeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

3.3.3.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.4 Evaluation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

3.4.2.1 Shop Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

3.4.2.2 Property Statistics . . . . . . . . . . . . . . . . . . . . . . . . 142

3.4.3 Comparison with Web Data Commons . . . . . . . . . . . . . . . . . . . 145

3.4.3.1 Quantitative Comparison of Entities in WDC and GRC . . . . 146

3.4.3.2 Quantitative Comparison of Structured Data in Web Shops . 146

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

In the recent years, the publication of structured data inside Hypertext Markup Language(HTML) content of Web sites has become a mainstream feature of commercial Web sites.In particular, e-commerce sites have started to add Resource Description Framework in

125


Attributes (RDFa) or Microdata markup based on schema.org [SchND] and GoodRelations[Hep08a]. For many potential usages of this huge body of data, we need to crawl the sitesand extract the data from the markup. Unfortunately, a lot of markup can be foundin very deep branches of the sites, like product detail pages. Such pages are difficult tocrawl because of their sheer number and because they often lack links pointing to themwhich means they cannot be found by a spider-based crawling approach [e.g. BP98; CC03;Ara+01]. In this chapter, we analyze the approach taken by the Web Data Commons(WDC) initiative, propose an alternative, ping-based crawling strategy that focuses onthe deep detail pages of e-commerce Web sites, and compare the results from our crawlof 2, 628 shops with the WDC corpus, in particular with regard to the amount of dataand vocabulary usage. We provide evidence that popular Web crawlers like CommonCrawl fail to detect most of the product detail pages that hold a majority of the data,and show that the statistics of structured data in Web shops from our deep Web crawldiffer significantly from the WDC dataset. Our crawl is used in the following chapters asdata for our prototype.


Today, evermore vendors are offering and selling their products on the Web. The spectrumof online vendors ranges from a few very large retailers like Amazon1 and Best Buy2

featuring thousands of goods to many small Web shops with much smaller assortments.Consequently, the amount of product information available online is very significant.

While product descriptions on the Web were mostly unstructured in the past, the situationhas meanwhile changed. Numerous online shops have started to expose product offersusing structured data markup in RDFa, Microdata, or Microformats. Repeated statisticsof the Common Crawl corpus generally affirm this ongoing trend [MPB14] (see alsoSection 3.2.2 for more details).

To some extent, the semantic annotations of product offers on the Web have beenpromoted by several search engine operators that offer tangible benefits and incentivesfor Web shop owners. For instance, Goel, Guha, and Hansson publicly declared in 2009via the Google Webmaster Central blog:

“Today, we’re announcing Rich Snippets, a new presentation of snippets that appliesGoogle’s algorithms to highlight structured data embedded in web pages. [...] Whensearching for a product or service, users can easily see reviews and ratings [...] our

1http://www.amazon.com/ (accessed on October 19, 2015)2http://www.bestbuy.com/ (accessed on October 19, 2015)

http://www.amazon.com/

http://www.bestbuy.com/


experiments have shown that users find the new data valuable – if they see useful andrelevant information from the page, they are more likely to click through.” [GGH09]

Similarly, Microsoft has put it on the Bing Ads Web site:

“Drive traffic to your sites for free with Bing Rich Captions. When customers searchand your product pages appear in the organic search results, the price and ratingsinformation may appear below the blue links in the search results. These detailshelp potential customers make informed decisions about the organic results theywant to click on and may move them closer to a purchase decision.” [MicND]

To summarize, search engines claim that additional data granularity can increase thevisibility of single Web pages in the form of rich snippets (or rich captions, in Bingterminology) displayed on search engine results pages (SERPs). Early attempts withenhanced result snippets were conducted in the SearchMonkey project [Mik08], wherepotential benefits of this technique have already been laid out, namely increased andhigher-quality traffic for Web site owners, and better search experience for users [Mik08].Also Haas, Mika, Tarjan, and Blanco have surveyed a general preference of users forenhanced search results and have measured a higher click-through rate (CTR) [Haa+11].In addition to those immediate effects for publishers and consumers, structured datacan provide useful relevance signals that search engines can take into account for theirranking algorithms as for the delivery of more appropriate search results. These genericprospects obviously hold for the concrete case of structured product data as well.

Unfortunately, for regular data consumers other than Google or Bing3, it is difficultto make use of the wealth of product information on the Web. There are at least twoproblems that arise in this context:

1. Crawling the Web for structured product offers is very resource-consuming due tothe number of relevant pages.

2. The data granularity of product offers supplied by Web shops is relatively low.

As problem 1 suggests, an obvious challenge is related to crawling the Web. The sheersize of the Web and its dynamics make it infeasible to conduct an exhaustive crawl thatwould entail a comprehensive and current snapshot of all product information publishedby Web shops. Due to scarce crawling resources, established crawling strategies usuallyprioritize the visiting of Web pages. For example, they take into account the contents ofsitemaps [SS09], the page load time [SC10; cf. Cas+04], or they honor the link structureof Web pages (i.e. PageRank [Pag+98; BP98]) [SH15c; cf. MPB14; MMB14]. Thus, froma crawler point of view, landing pages of Web sites constitute the more popular part of

3Popular search engines are often directed to Web pages proactively by shop owners.


the Web, whereas product item pages generally are of lower priority. On that account, weclaim that popular crawling strategies often fail to reach the product detail pages wherethe interesting product data resides. This problem was publicly debated in a World WideWeb Consortium (W3C) mailing list thread from 20124, where concerns were raised thatthe usual way of crawling the Web, as performed by Common Crawl, misses to reacha major part of relevant product item pages. We have shown in [SH15c] that CommonCrawl systematically misses a large share of the deep detail pages of Web sites, and shouldthus not be thought of comprising a representative sample of the Web for e-commerce.

Problem 2 refers to the issue that structured product information on the Web is generallynot sufficient for deep product comparison, because its level of detail is very limited.The majority of structured product data on the Web reflects only very basic commercialdetails of product offers. A likely reason is that the immediately beneficial effects ofadding structured data in major search engine results (like Google rich snippets) do notrequire sophisticated annotations of products – which, undoubtedly, would be at the costof the simplicity of publishing structured data on the Web. To be valuable for regulardata consumers though, unstructured product data would need to be cleansed and liftedto a higher data granularity (e.g. using natural language processing (NLP) techniquesor machine learning [cf. PBB14]), or existing structured data be enriched via additionaldata sources (see Chapters 4 and 5).

For testing the aforementioned assumptions, we use a “positive list” of e-commerce siteswith data markup, which we gained from a site registration service that is pinged by manypopular extension modules for shop software applications. We then propose a crawlingstrategy for e-commerce, conduct a deep Web crawl based on our seed list, and analyzethe quantity and structure of the extracted data. Finally, we determine the overlap of ourcrawl with the WDC corpus, whereby we challenge the representativeness of CommonCrawl with respect to e-commerce data on the Web. Our goal is first to provide a corpusof data for our further work, but also to understand the usability of the WDC corpus forreal-world applications in the e-commerce domain.

The rest of this chapter is structured as follows: Section 3.2 presents related work onSemantic Web crawlers and structured data statistics; in Section 3.3, we detail our deepcrawling approach; in Section 3.4, we analyze our crawl and compare our work with theWDC dataset; and Section 3.5 concludes the chapter.

4http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html (accessed on July 15,2014) and http://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html (accessedon July 15, 2014)

http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html

http://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html

3.2 State of the Art and Related Work 129

3.2 State of the Art and Related Work

In this section, we present current approaches for the collection of Semantic Web dataand existing statistics about structured data on the Web.

3.2.1 Approaches for the Collection of Structured Data

Harvesting structured content from the Web poses specific challenges to Web crawlers,spiders, or sniffers, e.g. in terms of link traversal across resources and the indexing ofstructured content in heterogenous formats from various data sources [HUD06]. Variousapproaches have been suggested for collecting structured data from the Web. Hereinafterwe distinguish between standalone crawlers [e.g. Dod06; Ise+10; MMB14], crawlers aspart of semantic search engines [e.g. Din+04; Ore+08; Hog+11; DM11], and extractors[e.g. MPB14]. Even though we tried to cover the most prominent and widely used tools,it is out of the scope of this work to offer a comprehensive survey of Semantic Webcrawlers.

3.2.1.1 Standalone Crawlers

Some popular standalone Semantic Web crawlers are Slug [Dod06] or LDSpider [Ise+10].Both are tools that represent configurable crawling frameworks targeted at harvestingSemantic Web data. In particular, they follow semantic links (i.e. RDF links betweenresources), extract structured data, and store metadata about visited pages. Meusel,Mika, and Blanco [MMB14] describe Anthelion, a focused crawler targeted at structureddata in Web pages that uses an intelligent selection strategy based on a combinationof an online classifier and a trade-off approach between exploitation and exploration ofWeb page candidates. As an alternative to Semantic Web crawlers, general-purpose Webcrawling frameworks like Scrapy5 can be configured to crawl the Web for structured data,e.g. to scrape Web pages for RDFa and Microdata content.

3.2.1.2 Embedded Crawlers

Semantic search engines typically extract and index metadata from various data sources,e.g. annotated Web pages, published Resource Description Framework (RDF) files, oronline RDF repositories. Examples of popular semantic search engines are Swoogle

5http://scrapy.org/ (accessed on August 6, 2014)

http://scrapy.org/


[Din+04], Sindice [Ore+08], SWSE [Hog+11], or Watson [DM11]. On top of this, thecrawling components of major commercial search engines that have recently announcedto process structured data could likewise be added to this kind of crawlers (e.g. Google,Bing, Yahoo!, or Yandex).

3.2.1.3 Extractors

Extractors are software components that aim at extracting relevant information frompreviously conducted crawls or other data sources. In other words, extractors do notcrawl autonomously. For example, WDC [e.g. MPB14] is a project that focuses onextracting structured data in RDFa, Microdata, and Microformats from the CommonCrawl corpus. Similarly, the Virtuoso Sponger6 can be prompted to derive RDF datafrom a variety of different data sources, such as Web application programming interfaces(APIs), comma-separated values (CSV) files, or annotated Web pages.

3.2.2 Web Data Commons

Various studies indicate a recent increase of structured data markup on the Web. Basedon a Bing crawler sample from early 2012, Mika and Potter reported a significant ratioof over 30% of the Web pages in their collection featuring structured data markup in abroad Web crawl [MP12]. The findings in the context of the Web Data Commons (WDC)project show similar results. WDC is an effort by a research group headed by ChristianBizer. They regularly extract metadata from the Common Crawl corpus, which is aproject that aims at making Web content accessible to the public for research purposesand practical applications alike. Thereby people and organizations do not have to conductcostly crawls by themselves. A study based on such a Common Crawl corpus fromFebruary 2012 revealed that about 12% of all HTML pages already contain structureddata markup (in contrast to 6% in 2010) [MB12]. For a corpus of November 2013, thesefigures are even higher with 26% [MPB14; WebNDd].

Figure 3.1 illustrates how RDFa and Microdata have evolved throughout three differentWDC datasets from 2012 to 2014 [WebNDa; WebNDd; WebNDc]. Figure 3.1a shows theshare of the formats with respect to the domains (or Web sites), whereas Figure 3.1boutlines it relative to all extracted entities. Although there was a significant growth ofstructured data (Microdata in particular) from 2012 to 2013, the increase for 2014 sloweddown.

6http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtSponger (accessed onMay 26, 2014)

http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtSponger


(a) Distinct domains with structured data (b) Number of entities from structured data

Figure 3.1: Distribution of syntaxes in the Web Data Commons

Table 3.1 gives a detailed overview on the main statistics of the three WDC datasets[WebNDb; WebNDa; WebNDd]. With regard to the total number of crawled domainsand entities, the amount of structured data even decreased a bit from 2013 to 2014 (seeFigure 3.1). Nonetheless, it is interesting to see that Microdata has outperformed RDFa,which is likely due to Microdata being perceived the simpler data format than RDFaas well as the pushing by search engines that have for long indicated a slight preferencetowards Microdata. Figure 3.2 shows the number of entities per domain based on thenumbers from Table 3.1.

Table 3.1: Structured data in the Web Data Commons [based on WebNDa; WebNDd; WebNDc]

August 2012 November 2013 December 2014Domains Entities Domains Entities Domains Entities

Total markup 2,286,277 1,811,471,956 1,779,935 4,264,562,758 2,722,425 5,516,068,263

Format

RDFa 519,379 188,243,535 471,406 436,100,210 571,581 405,541,283Microdata 140,312 266,169,151 463,539 1,964,777,851 819,990 2,209,497,281

Class

gr:Offering 1,342 371,864 2,199 498,825 2,196 440,403s:Offer 8,456 13,725,226 35,635 154,407,699 62,849 236,952,507

With respect to markup for product offers, the results are similarly promising as theoverall growth of structured data markup. In the analysis of the Common Crawl corpusfrom August 2012, Bizer et al. [Biz+13] detected 1, 342 sites with structured productoffers in RDFa (gr:Offering entities), and 8, 465 sites with structured product offersin Microdata syntax (s:Offer entities). This amounts to 0.26% of the sites containingmarkup in RDFa and 6.03% in Microdata, respectively [Biz+13]. A more recent extractfrom the Common Crawl corpus of December 2014 further reports 2, 196 sites (0.38%7)with gr:Offering entities in RDFa, and 62, 849 sites (7.66%8) with s:Offer entities in

7 domains(gr:Offering)domains(RDFa) = 2,196

571,581 = 0.38%8 domains(s:Offer)domains(Microdata) = 62,849

819,990 = 7.66%


Figure 3.2: Average number of entities per domain in Common Crawl corpora (log-scaled y-axis)

Figure 3.3: Share of structured product offer data with respect to domains with RDFa (for gr:Offering)and Microdata (for s:Offer) in the Web Data Commons from 2012–2014

Microdata. The respective trends are depicted in Figure 3.3.

Figure 3.4 relates the share of structured product offers to the total amount of structureddata (RDFa, Microdata, and Microformats) in the WDC. In particular, it demonstratesthe share of gr:Offering and s:Offer entities at domain (see Figure 3.4a) and entity level(see Figure 3.4b) relative to all structured data found in the Common Crawl corporafrom 2012, 2013, and 2014 [WebNDa; WebNDd; WebNDc]. According to Figure 3.4a, morethan two percent of the domains with structured data contained product offer data in2013 and 2014, as opposed to half a percent in 2012. Similarly, as per Figure 3.4b, theamount of structured e-commerce data has increased fivefold (in relative figures) betweenthe 2012 and 2013 datasets.

3.3 Sweet-Spot Deep Crawling Approach 133

(a) Percentage of domains (b) Percentage of entities

Figure 3.4: Share of structured product offer data with respect to all structured data markup in theWeb Data Commons from 2012–2014

To sum up, a potential bias in these statistics cannot be ruled out (e.g. due to the varietyof the crawl samples), but multiple independent studies provided more than anecdotalevidence to conclude that structured e-commerce data on the Web is on the rise.

3.3 Sweet-Spot Deep Crawling Approach

In this section, we present a focused, depth-first Web crawler for structured e-commercedata. A focused Web crawler is capable of quickly collecting pages of a specific topic[CvBD99]. In our case, the crawler focuses on the product detail pages of e-commerceWeb sites that contain descriptions of product offers in GoodRelations. Using a depth-firstcrawling strategy permits us to quickly reach the deep levels of Web sites that otherwisewould often not be judged relevant by search algorithms. This is in contrast with breadth-first crawling, which aims at quickly finding the most relevant pages according to a highlink density [cf. NW01].

3.3.1 Ping-based Discovery of Relevant Web Sites

A central question to every Web crawler is where to start from with the crawling process.Usually, an initial list with seed Uniform Resource Identifiers (URIs) is provided that canbe extended or updated continuously as the crawl progresses.

We prepared such an initial seed list for our GoodRelations crawl. Our primary datasources come from a list of URIs that we collected using a central registry component andnotification service for GoodRelations-empowered Web pages and Web shops, namely


GR-Notify9. Over time, the list of GoodRelations resources contained in the registrationservice have been augmented by ping submissions from the following channels:

• Notifications by popular open source Web shop extensions, e.g. for Magento, Joom-la/VirtueMart, or PrestaShop. Some of them submit the shop URI autonomouslyafter successful installation, others ask shop owners to manually register their shopURI via form-based submission.

• Log file analysis of our tool chain (e.g. GoodRelations Snippet Generator10), andcrowdsourcing approaches like Grome11, a Google Chrome browser plug-in thatpings GR-Notify as the user with the active browser plug-in visits a page containingGoodRelations. The idea is similar to the one presented in [SM12].

• Human users pinging the registry service themselves via a form-based user interfacefor URI submission, e.g. implementers that followed the GoodRelations QuickstartGuide12.

Furthermore, we added selected data sources from a manually maintained list of Webshops that we are aware of exposing GoodRelations data13. This includes e-commerceservice providers (e.g. www.rakuten.de, formerly www.tradoria.de), large retailers (e.g.www.sears.com), and small Web shops.

3.3.2 Crawling Strategy and Implementation

Our crawler extracts data from Web pages annotated with GoodRelations markup inRDFa and Microdata, including data about product offers, product instances, and productmodels [cf. Hep08a]. Furthermore, the crawler accepts the following types of input URIs(raw and compressed using gzip), namely XML sitemaps (or sitemap indexes for largershops) [Sit08], single item pages, product category pages, and shop main pages.

The implementation of our crawler was realized as a Python module with parallelization,i.e. several domains can be crawled concurrently. The overall crawling task is split intosmaller chunks that are individually distributed to a pool of processes14. Each processmanages its own process stack, i.e. a distinct area of memory allocated to the program

9http://gr-notify.appspot.com/ (accessed on May 22, 2014)10http://www.ebusiness-unibw.org/tools/grsnippetgen/ (accessed on May 22, 2014)11http://www.stalsoft.com/grome (accessed on May 22, 2014)12http://wiki.goodrelations-vocabulary.org/Quickstart (accessed on August 19, 2014)13http://wiki.goodrelations-vocabulary.org/Datasets (accessed on July 31, 2014)14On our machine, we spawned 32 processes, which is twice the number of central processing unit (CPU)

cores.


http://www.ebusiness-unibw.org/tools/grsnippetgen/

http://www.stalsoft.com/grome

http://wiki.goodrelations-vocabulary.org/Quickstart

http://wiki.goodrelations-vocabulary.org/Datasets


Algorithm 1: Crawling algorithmInput : List of seed URIs (S)

1 for uri 2 S do

// parallelized execution2 extract the domain part from uri

3 check_robots(domain)4 if found sitemap URIs (M) in robots.txt then

5 for sitemap_uri 2M do

6 read_sitemap(sitemap_uri, domain)7 end

8 else

9 if crawling of domain is allowed then

10 crawl(uri, domain)11 end

12 end

13 end

code and the variables of the respective process. This ensures the autonomy of everyprocess.

The general steps of the crawling process are as follows:

1. Load and compile the list of seed URIs from text files stored in a local folder andfilter out duplicate URIs.

2. Allocate a pool of worker processes.

3. Start a parallel crawl and continually assign one seed URI at a time to an idleworker process until finished.

The flowchart in Figure 3.5 depicts the crawling algorithm in detail. Algorithm 1 formalizesthe main program flow. The algorithm starts with a list of seed URIs, where each of themis assigned a free process out of the process pool. If there is no idle process available (e.g.because there are more seed URIs than processes), then the URI stays in the pipeline(implemented as a list of URIs) as long as it becomes eligible for processing. Lines 2–13describe the code section that every process needs to run through before being ready toaccept a new task from the seed list. The details of the crawling algorithm are indicatedbelow:

1. Try to fetch robots.txt from the root directory [cf. Kos07; Kos95]. If successful,adhere to the politeness policy and check if Sitemap: directives with references


to sitemap files exist [cf. Sit08]. Otherwise, try to locate sitemap.xml in the rootdirectory of the Web site.

2. From here on, the algorithm branches into one of two possible paths, based onwhether a sitemap file was detected or not, as outlined in Figure 3.5.

3.3.2.1 Sitemap-based Crawling

When crawling a sitemap, a process extracts metadata from the list of URIs providedby the sitemap file (see Function read sitemap). Besides sitemaps15, the function alsohandles sitemap index files. Sitemap index files are collections of sitemaps [Sit08]. Aheuristic distinction between sitemaps and sitemap index files can be made based onthe Extensible Markup Language (XML) root element names (e.g. sitemapindex versusurlset). Once the algorithm reaches a sitemap, then the function stops the recursionand starts to extract metadata iteratively from the linked Web page URIs. Although notshown in the code snippet, read sitemap is able to extract gzip-compressed sitemaps, itdefines an upper threshold to prevent visiting too many pages with no metadata, and itadheres to crawl delays. As an aside, our approach is currently limited to only discoverthose pages explicitly listed in a sitemap. Hence, a partial sitemap pointing at categorypages or a subset of the available product item pages yields an incomplete crawl.

Function read sitemapInput : sitemap_uri, domain

1 if crawling of domain is allowed then

2 read sitemap_uri contents and store found URIs in U

3 if is sitemap index file then

4 for uri 2 U do

// recursion5 read_sitemap(uri, domain)6 end

7 else

8 for uri 2 U do

9 extract_metadata(uri, domain)10 end

11 end

12 end

15http://www.sitemaps.org/ (accessed on August 21, 2014)

http://www.sitemaps.org/


Start

Load list of seed URIs

More URIs in S?

Found sitemaps in robots.txt?

Find and read robots.txt in the

location of d

Extract domain part d from s

URI allowed as per robots.txt?

More URIs in M?

URI allowed as per robots.txt?

Stop

Wait for a time t as per robots.txt politeness rule

More URIs in U?

Extract metadata from

content in u

Write N-Triples file

Read content and extract metadata

Extract anchor links, filter by domain d and not yet visited

Crawling depth for d < MAX?

More URIs in U?

Write N-Triples file

List of page URIs (U)

List of sitemap URIs (M)

List of page URIs (U)

List of seed URIs (S)

s = seed URIm = sitemap URIu = URId = domaint = time

S = List of seed URIsM = List of sitemap URIsU = List of URIs

yes

no

Legend

yes

yes

yes

yes

yes

yes

yes

no

no

no

no

no

no

no

no

Parallelized paths are in blue color

Read content of sitemap index

Pick a URI u

Pick a URI s

Pick a URI m

Pick a URI u

Wait for a time t as per robots.txt politeness rule

Spider-basedcrawling part

Sitemap-basedcrawling part

URI m not a sitemap index?

Extract URI links from sitemap

yes

{recursive}

{recursive}

Figure 3.5: Flowchart of the crawling algorithm


3.3.2.2 Spider-based Crawling

Our spider-based crawling approach, as indicated in Function crawl, performs a depth-firstcrawl. It starts by extracting metadata from a Web page and collects all outgoing links.The function then enters a recursion and crawls all discovered URIs unless a configurablemaximum crawl depth is reached. Outgoing links that either have been visited before orpertain to a different domain as the current process is in charge of, are skipped.

Function crawlInput : uri, domain, [depth=0], MAX_CRAWL_DEPTH

1 content extract_metadata(uri, domain)2 if content then

3 read content, look for anchor links, and store found URIs in U

4 filter outgoing links (U) pertaining to the same domain as domain

5 filter outgoing links (U) that have not been visited yet6 if depth < MAX_CRAWL_DEPTH then

7 for uri 2 U do

// recursion8 crawl(uri, domain, depth+1)9 end

10 end

11 end

3.3.3 Extraction of Structured Data

Finally, the function extract metadata, invoked by both the read sitemap and crawlfunctions, harvests metadata content and serializes the extracted triples to N-Triples files.The detection of structured markup within an HTML page is implemented as regularexpressions. The regular expressions for matching RDFa and Microdata are as follows:

RDFa:

# matches characeristic strings such as

# - xmlns:gr="http://purl.org/goodrelations/v1#" or

# - prefix="gr: http://purl.org/goodrelations/v1#"

regex_rdfa = re.compile(

"(xmlns:([a-z0-9]+)\s*=\s*[\"’]{0,1}http://purl\.org/goodrelations/v1#[\"’]{0,1}|prefix\s*=\s

*[\"’]{0,1}[^\"’]*([a-z0-9]+)\s*:\s*http://purl\.org/goodrelations/v1#[^\"’]*[\"’]{0,1})",

re.IGNORECASE|re.DOTALL)


Microdata:

# matches characteristic strings such as

# - itemtype="http://purl.org/goodrelations/v1#SomeItems" or

# - itemprop="http://purl.org/goodrelations/v1#name"

regex_md = re.compile(

"item[a-z]+\s*=\s*[\"’]{0,1}http://purl\.org/goodrelations/v1#[a-z0-9\-\_]+",

re.IGNORECASE|re.DOTALL)

The Python RDFLib library further takes care of parsing the content and loads theextracted triples into an in-memory graph, that is later on serialized to N-Triples. Thealgorithm is able to decompress (or, more specifically, gunzip) pages ending in “.gz” orcarrying the header “Content-Encoding: gzip” [cf. FR14d, Section 3.1.2.2].

3.3.3.1 Identification

The crawler identifies itself in the Hypertext Transfer Protocol (HTTP) request headerwith the user agent field [cf. FR14d, Section 5.5.3]

python-grcrawler/<version> (http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler)

This way every Web site owner is given the ability to contact us anytime to ask questionsregarding our crawler.

3.3.3.2 Politeness

A Web site owner can control the interaction behavior of our focused crawler with arobots.txt file [cf. Kos07]. Our crawler obeys all important robots.txt directives. Thatmeans, it abandons sites if it is not allowed to crawl them, skips directories that areexplicitly excluded for proprietary crawlers, and avoids to overload servers by respectingthe indicated crawl delay. For instance, a crawl delay of one second and thus a maximumamount of 60 requests per minute can be obtained as follows:

User-agent: python-grcrawler

Disallow:

Crawl-delay: 1


For those Web pages that lack a robots.txt file we use our own politeness policy, i.e. weset the crawl delay to a default value of five seconds. The architecture of the crawlerfurther ensures that no site is simultaneously hit by more than one process, otherwisethe policy constraints would be violated.

3.3.3.3 Storage

The purpose of our crawler is to find and store contents of Web pages that containGoodRelations data. Up to now, we are able to extract GoodRelations content encodedas RDFa and Microdata16. At the moment, we temporarily store all structured contentwe could gather in N-Triples files that are later uploaded to a private SPARQL endpointfor our research.

3.4 Evaluation and Analysis

In the following, we report statistics on the dataset obtained in late 2011/early 2012 byconducting a focused Web crawl relying on the aforementioned crawling strategy.

3.4.1 Method

We applied our focused Web crawl to a variety of data sources mentioned in Section 3.3.1,namely (1) entries of a central URI submission service17, where shop extensions and shopoperators could register their shop URIs, and (2) selected data sources from a manuallymaintained collection of datasets18. We spidered Web sites up to a maximum crawldepth of three hops. Among the crawled datasets thereby obtained, some of the biggerones were not crawled entirely. For example, we could not conduct an exhaustive crawlof a large online retailer like wayfair.com, thus we stopped the crawling at some point.Similarly, huge datasets like sears.com or bestbuy.com were not considered due to resourcelimitations19.

We stored the crawl data in an RDF store that exposes a SPARQL Protocol and RDFQuery Language (SPARQL) endpoint. More exactly, we loaded it into a Virtuoso Open16In fact, we added support for schema.org in Microdata and RDFa in the meantime, which, however,

we did not yet support at the time of the crawl.17http://gr-notify.appspot.com/ (accessed on May 22, 2014)18http://wiki.goodrelations-vocabulary.org/Datasets (accessed on July 31, 2014)19Crawling all ⇠15 million pages from sears.com would take roughly 174 days with a crawl delay of one

second.


http://wiki.goodrelations-vocabulary.org/Datasets

3.4 Evaluation and Analysis 141

Source instance, which has some important characteristics that has affected our choice. Itprovides the necessary scalability in order to load large amounts of structured data intoa database. It is fault-tolerant with respect to imperfect data, where other RDF storeswould abort the loading process with an error message. And finally, it supports executingSPARQL queries to comfortably access selected data that we need for our analysis.

For the statistics, we took advantage of the excellent Pandas20 library for data analysiswith the Python programming language [McK12] together with the iPython Notebookinteractive development environment [cf. PG07] that encourages literate programming [cf.Knu84].

3.4.2 Results

Overall, our focused crawl yielded 20 GB of raw structured e-commerce data (i.e. files inN-Triples syntax) with a total of 188, 765, 183 triples.

3.4.2.1 Shop Statistics

Our dataset reflects information from 2, 628 shops that expose structured e-commerce data.Among them, 2, 314 shops provide product offers contributing to a total of 3, 197, 130offerings (see Table 3.2 in Section 3.4.2.2 for the full statistics). Consequently, a shop inour dataset consists of 1, 382 offers on average.

The distribution of the numbers of items offered by the Web shops follows a power lawdistribution, i.e. a small number of Web shops account for a large quantity of products,whereas the majority of the shops offer only very few items. For better readability, weused a logarithmic-scaled instead of a linear-scaled y-axis in Figure 3.6 to convey theinformation. Accordingly, about 500 shops offer 1, 000 and more products, and over 800

shops offer less than 100 products. The first quartile (25%) is at 37, the median (50%)at 145, and the third quartile (75%) at 530 product offers, as shown in Figure 3.7. Thevertical lines at the edges in Figure 3.7 represent the whiskers, which are defined as 1.5

times the interquartile range (1.5⇥ (75%� 25%)). The maximum number of productoffers gathered for a shop was 174, 487 (not visible in Figure 3.7 due to outlier correction),and the minimum was one item.

Figure 3.8 shows the ten shops with the most product offers in our crawl dataset. Weassigned to each shop a different Uniform Resource Name (URN) identifier that we20http://pandas.pydata.org/ (accessed on July 30, 2014)

http://pandas.pydata.org/


Figure 3.6: Distribution of items per shop (log-scaled y-axis)

Figure 3.7: Boxplot of the distribution of items per shop

generated based on the domain names of the Web shops, e.g. www.example.org resultedin urn:www.example.org.

3.4.2.2 Property Statistics

To examine the nature of properties used to describe products, product models, andproduct offers, we first measured the average number of properties across the entiredataset. This gave us the results outlined in the last column of Table 3.2.

Table 3.2: Instance count and average number of properties in crawl dataset

Type GoodRelations Concept No. ofInstances

Avg.Properties

Offers gr:Offering 3,197,130 13.43Flat offers gr:Offering w/o gr:includes/gr:includesObject 421,125 11.11Products gr:ProductOrService (incl. subclasses) 2,772,951 7.14Product models gr:ProductOrServiceModel 82,173 11.93


Figure 3.8: Ten most represented shops by offer count

Table 3.2 covers all classes of GoodRelations that allow for supplying product features,i.e. product offers, instances, and models. Not very surprisingly, we found out that onlyvery little product features are specified beyond those typically sufficient to satisfy searchengines.

Offers The bar chart in Figure 3.9 displays the most frequent properties (the upper90%, i.e. 20 out of 55) used for product offers in the dataset. Most offers include generalproperties from the GoodRelations namespace, as indicated by the orange-colored bars.Image and page links from external vocabularies are represented by red bars. The singleblack bar for gr:hasBrand denotes the erroneous usage of this property with a productoffer, which is not allowed in GoodRelations. The strongest property almost alwayspresent with offers is gr:hasPriceSpecification.

Flat Offers As illustrated in Figure 3.10, the statistics look a bit different for flatoffers than for offers with products attached. While the strongest properties remainthe same, there are noticeably more external vocabularies involved, e.g. propertiesfrom Open Graph Protocol (OGP) and review information. Also, many flat offersseem to rely on the old and outdated modeling pattern for GoodRelations that usesrdfs:label and rdfs:comment annotation properties rather than the newer propertiesgr:name and gr:description. In total, the 90% of the most frequent properties forflat offers constitute 24 out of 46 properties. Again, there is some wrong usage ofproperties. Instead of attaching gr:hasCurrency to the offer directly, it shall be attachedto gr:hasPriceSpecification. Furthermore, the GoodRelations class gr:BusinessEntity ismistakenly used as a property.


Figure 3.9: Frequency of offer properties in crawl (upper 90% – 20 out of 55)

Figure 3.10: Frequency of flat offer properties in crawl (upper 90% – 24 out of 46)


Products Contrary to our expectations, there is less variety of properties for productinstances than for product offers in the crawl. Out of the 43 available product properties,only 11 properties account for 90% of the most used ones that are shown in Figure 3.11.Apart from generic product properties from GoodRelations, the most frequent propertiesare foaf:depiction, foaf:page, and yahoo:image. But even those properties are not specificto products.

Figure 3.11: Frequency of product properties in crawl (upper 90% – 11 out of 43)

Product Models The situation is for product models even more surprising than forproduct instances. In the entire crawl dataset, there appear only 24 different propertiesalong with product models. Out of these 24 properties, the 90% most used ones arerepresented by 17 properties (see Figure 3.12). Among these properties, most are fromexternal vocabularies, i.e. Friend of a Friend (FOAF) [BM14], RDF Schema (RDFS), theRDF Review vocabulary, and the Yahoo! Searchmonkey vocabulary. Nonetheless, theproperties used are again not very specific.

3.4.3 Comparison with Web Data Commons

We evaluate our approach by comparing our findings to the results from the WDCinitiative. In particular, we are interested in whether our focused GoodRelations crawldataset (henceforth GRC) is more complete and covers a larger amount of useful productdata than the latest Common Crawl, which is without a doubt several times bigger thanour crawl.


Figure 3.12: Frequency of product model properties in crawl (upper 90% – 17 out of 24)

3.4.3.1 Quantitative Comparison of Entities in WDC and GRC

In [MPB14], Meusel, Petrovski, and Bizer present a table with the 30 most frequentRDFa classes in WDC. In there, some GoodRelations classes are listed, that we use tocompare with our crawl (GRC), as detailed in Table 3.3. Despite the number of domainsfeaturing GoodRelations data is similar in both datasets, we found much more entities inGRC than in WDC. The considerable differences in the entity/domain-ratios21 of thetwo datasets underlines this impression, as depicted in Figure 3.13.

Table 3.3: Comparison of entity frequency in WDC and in GRC

WDC 2013 GRCGoodRelations Class Domain Entities E/D-

RatioDomain Entities E/D-

Ratio

gr:Offering 2,199 498,333 226.62 2,314 3,197,130 1,381.65gr:BusinessEntity 2,155 394,556 183.09 1,031 425,769 412.97gr:UnitPriceSpecification 1,681 429,409 255.45 2,271 3,557,590 1,566.53gr:SomeItems 1,429 235,785 165.00 2,137 2,558,844 1,197.40gr:TypeAndQuantityNode 1,221 187,865 153.86 455 637,409 1,400.90gr:QuantitativeValue 1,032 192,560 186.59 1,259 1,553,239 1,233.71

3.4.3.2 Quantitative Comparison of Structured Data in Web Shops

In addition to the quantitative comparison of the total numbers of entities in the datasets,we were interested in whether there are differences in the amount of structured data21The E/D-ratio denotes the average number of entities per Web site.


Figure 3.13: Comparison of the E/D-ratios for WDC and GRC

of individual Web shops. For this purpose, we required access to the raw data fromWDC (in our case, the RDFa datasets were sufficient, because at the time of our crawlGoodRelations was used almost exclusively with RDFa). WDC publishes the datasets asa list of gzip-archived N-Quad files. In total, the uncompressed files for RDFa amount to695 GB, each of them with a file size of about one gigabyte.

Our idea was to match graph names from GRC to graph names from WDC. By a simplequery (prefix declaration omitted) like

SELECT DISTINCT ?g WHERE { GRAPH ?g {?s a gr:Offering} }

against a SPARQL endpoint with the GRC dataset, we could obtain a list of all 2, 314named graphs [Car+05] in the form of URNs that expose offers in our dataset. We savedthem as a list of graph names (without the urn: prefix) into a text file. Afterwards, weran a grep command against WDC counting the number of times a domain name appearsin the dataset. Since N-Quads is a line-based syntax, the obtained result denotes thenumber of triples (or quads) in WDC.

More precisely, we executed the following command to extract the graph name part fromthe gzipped N-Quads files (i.e. every fourth term/column in a line), and to write it into atext file. The second command selects the first one of the text files created before.

$ ls *.nq.gz | sed ’s/.nq.gz//’ | parallel -j64 \

> "zcat {}.nq.gz | awk ’{if(\$(NF)){print \$(NF-1)}}’ >> {}.wdc_graphs.txt"

$ ls *.wdc_graphs.txt | head -n 1

ccrdf.html-rdfa.0.wdc_graphs.txt


We used the GNU parallel command line tool to distribute the task over several processes.It is run with the option “-j64” to spawn 64 processes, i.e. four processes per CPU coreon our machine with 16 cores.

The next command shows how we used grep to count the number of lines from thepreviously created graph name files. Without going into detail, it uses the graph namefile from GRC (grc_graph_uris.txt) as input to a parallelized grep. Thus, every processis assigned a graph name from GRC to search for within the collection of graph namefiles from WDC. The processed graph name together with the number of quads found inWDC are stored in a text file (num_triples.txt), for which the first two lines are displayedbelow.

$ cat grc_graph_uris.txt | parallel -j64 \

> ’grep {} *.wdc_graphs.txt | (var=$(wc -l); if [ "$var" -ne "0" ]; then echo {} $var; fi)’

> >> num_triples.txt

$ head -n 2 num_triples.txt

sanita.ro 29

shop.goodrelations.pl 11

There is a remarkable overlap of 214 shops between the two datasets. After some cleansingsteps that involved consolidating www. and non-www. variants and eliminating graphnames without dot (e.g. “goodrelations” and “localhost”), we retained 205 distinct shopsthat appear in both datasets. Subsequently we refer to these reduced datasets as WDC205

and GRC205. Table 3.4 presents the ten domains with the largest number of triples inGRC and compares them with the number of triples in WDC. One shop in this list(www.eastwood.com) is represented with more triples in WDC than in GRC. We explainthis divergence by the crawls being conducted at different points in time and thus withpossible interim changes, and that we occasionally had to prematurely abandon thecrawling task of Web sites, e.g. because of interruptions like an imminent server restart,downtime, or similar. It could also be that the internal link structure of this Web site hasmore hops than the threshold (that was three hops) used for our spider-based crawlingapproach. Yet another idea is that WDC may sometimes spot more URIs from externallinks that our crawler cannot find because they are not in the sitemap nor reachable viainternal links from the very same domain, or that some sites have duplicate content undermultiple URIs. Nevertheless, out of the 205 shops, we could only find ten shops whereour dataset exhibited less triples than the WDC (see Table 3.5 for a complete list).

Intuitively, following our observations from above, there are on average more triples forevery shop in our GRC than in the WDC dataset. To verify this assumption, we statedthe following null hypothesis:


Table 3.4: Comparison of the amount of RDF triples in shops for WDC and GRC, sorted in descendingorder by the number of triples in GRC

Domain Triples in WDC Triples in GRC

www.shopforia.com 226 5,469,335www.speichermarkt.de 702 4,999,534www.macmall.com 570,200 4,305,817www.mts-shop.eu 42,226 3,984,312www.wayfair.com 116,623 1,817,762www.parfumerie.nl 63 1,703,252www.eastwood.com 3,329,977 1,136,916www.takatomo.de 23,706 1,047,689www.radonline.de 865 883,364www.puissance-moteur.fr 71 695,354. . .Total: 205 sites in the sample

Table 3.5: Comparison of the amount of RDF triples in shops for WDC and GRC, filtered by domainsfor which WDC contains more triples than GRC

Domain Triples in WDC Triples in GRC

www.eastwood.com 3,329,977 1,136,916www.threadless.com 1,828,048 71,341www.demartina.com 14,355 12,854www.simplyglobo.com 5,845 3,820www.pauladeenstore.com 35,308 2,900www.foodnetworkstore.com 439,303 2,276www.antfuel.com 24,708 316www.heppnetz.de 314 310franz.com 1,587 113www.christian-junghanns.de 116 83

Null hypothesis. The amount of data for every single Web shop in GRC is not largerthan in the WDC dataset.

As the matching samples in the two datasets (WDC205 and GRC205) are not normallydistributed22 and the domains appearing in the datasets are the same (and thus comparableto repeated measurements), we carried out a paired samples test on the number of triples,more precisely a Wilcoxon signed-rank test [Wil45]. As a result, the number of triplesin GRC205 (median = 17, 520) was significantly higher than in WDC205 (median = 98),t(205) = 970, p = 1.76⇥ 10�29, r = �0.79. We thus conclude that our crawl collectedon average significantly more structured product data per Web shop than the CommonCrawl.

22 A Shapiro-Wilk test [SW65] with a null hypothesis of a normally distributed sample gave us thefollowing p-values: WDC205: p = 4.58⇥ 10�30; GRC205: p = 4.55⇥ 10�28.


3.5 Conclusion

In this chapter, we have provided evidence that popular Web crawlers like CommonCrawl fail to detect the product details pages that contain a majority of the productdata. We have proposed an alternative, ping-based crawling strategy that focuses on thedeep detail pages of e-commerce Web sites, and compared the results from our crawl of2, 628 shop sites with the WDC corpus. Our statistics of structured data in Web shopsdiffer significantly from the WDC dataset. In addition, we have shown that structurede-commerce data on the Web lacks data granularity, which for example limits its usefulnessfor deep product comparison and calls for additional techniques for enriching productdata, which is addressed in Chapter 6 of this thesis.

In the course of our research, we identified the following potential limitations in ourapproach that open up opportunities for future work:

• The two compared datasets were created at disparate points in time. Our crawlwas conducted in late 2011/early 2012, while the latest Common Crawl snapshotoriginates from November 2013. During that period, the implementation of struc-tured data on these Web sites might have changed. Conceivable changes includethe addition and removal of pages, newly added or deleted triples, data formatchanges (e.g. from RDFa to Microdata), or migrations to other vocabularies (e.g.from GoodRelations to schema.org).

• Even though the coverage between the two compared datasets was already consid-erable (⇠10%), we would throw in that a crawl over exactly the shop URIs fromthe WDC dataset (i.e. 100% coverage) would give an even better evaluation of ourwork.

• In this work, we by and large focused on GoodRelations and RDFa. Repeating thesame experiments with taking into account schema.org and Microdata would be aworthwhile exercise that might lead to additional insights.

• A more comprehensive analysis would be possible with only a few additions tothe crawler. These improvements include to generate and store further metadata(e.g. a timestamp of the extraction, or details about the detected data format andvocabulary), to make it possible to resume interrupted crawls, to utilize sitemaps(e.g. the lastmod element) for enhancing the crawling plan, or to specify a SPARQLendpoint location where to instantly upload the crawled data by means of SPARQLUpdate [GPP13] queries.

4 Product Model Master Data



4.2.1 BMEcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

4.2.2 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.3 Product Model Master Data for the Semantic Web . . . . . . . . . . . . . . . . . 158

4.3.1 Aligning BMEcat with GoodRelations . . . . . . . . . . . . . . . . . . . 159

4.3.1.1 Product Details . . . . . . . . . . . . . . . . . . . . . . . . . . 160

4.3.1.2 Product Features . . . . . . . . . . . . . . . . . . . . . . . . . 161

4.3.1.3 Catalog Group Systems . . . . . . . . . . . . . . . . . . . . . . 162

4.3.1.4 Product and Catalog Group Map . . . . . . . . . . . . . . . . 163

4.3.2 Selected Modeling Problems . . . . . . . . . . . . . . . . . . . . . . . . . 163

4.3.2.1 Datatype versus Object Properties . . . . . . . . . . . . . . . 163

4.3.2.2 Float Value Ranges in Datatype Properties . . . . . . . . . . . 164

4.3.2.3 Units of Measurement . . . . . . . . . . . . . . . . . . . . . . . 165

4.3.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4.4.1 Coverage of Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

4.4.2 Missing Product Features on the E-Commerce Web of Data . . . . . . . 167

4.4.3 Leverage Effect of Product Master Data on the Web . . . . . . . . . . . 169

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

For any business taking part in supply chain networks, it is crucial for reasons of efficiencyand competitiveness to maintain a consolidated view of all its core business entities suchas products, customers, or places. This data that is shared across different applicationsis generally referred to as master data [e.g. Ora11; MV05; Los09, p. 6; Whi+06a; Dre+08,p. 1]. For example, Loshin puts it as follows: “Master data objects are those corebusiness objects used in the different applications across the organization, along withtheir associated metadata, attributes, definitions, roles, connections, and taxonomies”[Los09, p. 6]; and Oracle defines master data as “the business objects that are shared

151


across more than one transactional application. This data represents the business objectsaround which the transactions are executed. This data also represents the key dimensionsaround which analytics are done. Master data creates a single version of the truth aboutthese objects across the operational IT landscape” [Ora11]. For a literature review, see[Sil+11].

To date, the automatic exchange of product information between business partners in avalue chain is typically done using business-to-business (B2B) catalog exchange standardssuch as Price Catalog Message (PRICAT) [UN 12], Commerce XML (cXML) [Ari16],BMEcat [SLK05a], or master data pools based on the Global Data SynchronizationNetwork (GDSN) standard [SLÖ08]. At the same time, the Web of Data, in particularthe GoodRelations [Hep08a] vocabulary, offers the necessary means to publish highlystructured product data in a machine-readable format. The advantage of the publicationof rich product descriptions can be manifold, including better integration and exchange ofinformation between Web applications, high-quality data along the various stages of thevalue chain, or the opportunity to support more precise and more effective searches. Inthis chapter, we show that existing product catalogs can provide a huge lever for productoffering descriptions on the Web. Initially, we (1) stress the importance of rich productmaster data for e-commerce on the Semantic Web, and then we (2) present a tool toconvert BMEcat Extensible Markup Language (XML) data sources into an RDF-baseddata model anchored in the GoodRelations vocabulary. The benefits of our proposal aretested using product data collected from a set of ⇠2, 500 online retailers of varying sizesand domains, as described in the previous Chapter 3.


Online shopping has experienced significant growth during the last decade. Preliminaryestimates of retail e-commerce sales in the USA show an increase of 14.7% between Q41

of 2013 and Q4 of 2014, while they grew to almost three times 2005 levels, totaling 7.7percent (96 billion U.S. dollars) of the entire U.S. retail sales market [Uni14]. These recentstatistics indicate a large body of different-sized online stores ranging from major retailerslike Amazon, Best Buy or Sears to small Web shops offering only tens or hundreds ofproducts. Hence it comes as no surprise that instances of popular commodities are offeredby a fairly large number of shopping sites. Many of those online shops maintain databaseswhere they can store information and data to describe their goods. Nonetheless, forsite owners it proves difficult to get hold of rich and high-quality product data for all of

1Q4 = Fourth quarter


their items over time, especially if their specifications originate from product catalogsby different manufacturers. Large-size retailers might obtain this information in a semi-automated fashion via some form of catalog exchange format. However, small shop ownersmight have to enter products and feature data manually. This scenario produces repeateddefinitions of the same product features, but mainly with incomplete, inconsistent andoutdated information across various online retailers. Little and inaccurate informationabout products ultimately hampers the effective matchmaking of products.

Another major source of product data for commodities are their manufacturers. Thesecompile and maintain specifications of all of their products. Typically, their productcatalogs are managed in product information management (PIM) systems that can exportcontent to different types of media, e.g. electronic product catalogs as seen on manymanufacturer sites or printed catalogs [Abr14, p. 1]. PIM systems host essential and coreproduct data also known as product master data [Whi07].

Table 4.1 presents a simple illustration of the situation using the example of threerandom products. The table compares the number of features provided by the goods’manufacturers with the features found at a large leading online retailer’s Web site (i.e.Amazon) and other online merchants of various sizes selected arbitrarily via the “Shopping”service of Google Germany2. Unless otherwise specified, by “features” we mean structuredproduct specifications (i.e. datasheets in tabular form published on the shop pages)without taking into account product pictures, product name, and product description.It can be seen that the richness of product data provided across the different sourcesvaries significantly, but also that the manufacturers expose much more detailed productinformation the retailers.

Table 4.1: Comparison of product features between manufacturers and retailersManufacturer Product Features Retailer Product Features Coveragea

Samsung LED TV ES6300 89

15 amazon.de

28.09%39 notebooksbilliger.de22 conrad-electronics.de24 voelkner.de

Siemens Kettle TW86103 25

10 amazon.de

23.64%22 redcoon.de4 quickshopping.de

13 elektro-artikel-shop.de

Suunto M5 Running Pack 33

12 amazon.de

49%3 sportscheck.com1 otto.de

15 klepsoo.com8 tictactime.de

a “Coverage” = Ratio of average number of retailer features and manufacturer features

2http://www.google.de/shopping/ (accessed on May 8, 2014)

http://www.google.de/shopping/


To date, product master data is typically passed along the value chain using B2B channelsbased on electronic data interchange (EDI) standards and catalog exchange formats suchas BMEcat (catalog from the German Bundesverband Materialwirtschaft, Einkauf undLogistik (Engl.: Federal Association of Materials Management, Purchasing and Logistics)(BME)) [SLK05a]. Such standards can significantly help improve the automatic exchangeof data. However, trading partners still have to negotiate and set up information channelsbilaterally, which prevents them from establishing ad-hoc business relationships and raisesthe barriers for potential business partners that either do not have the means or thebudget to connect via imposed B2B standards. Similarly, end users are neglected whocould benefit from enterprise data liberalization by facing better search and matchmakingservices for products [Di +03].

An approach to tackle this issue is to publish rich product master data from PIM systemsof manufacturers on the Web of Data, so that it can be electronically consumed by othermerchants intending to trade these goods. Under this premise, retailers and Web shopowners could then rely on widely used product “strong identifiers” such as EuropeanArticle Numbers (EANs), Universal Product Codes (UPCs), Global Trade Item Numbers(GTINs), or manufacturer part numbers (MPNs), to establish a link to this rich datastraight from the manufacturers. Figure 4.1 illustrates an example of this approach, wherethe data from three different online merchants can be augmented with product descriptionsand features as published by the manufacturer, on the basis of the corresponding productstrong identifier. Each online merchant can then use this rich manufacturer informationto augment and personalize their own offering of the product in question.

In this chapter, we outline the potential leverage of manufacturer datasheets fromPIM systems for product model master data on the Web. In particular, we proposeto use the XML-based BMEcat standard in order to make highly structured productfeature data available on the Web of Data. We describe a conceptual mapping andthe implementation of a respective software tool for automatically converting BMEcatdocuments into Resource Description Framework (RDF) data based on the GoodRelationsvocabulary for e-commerce [Hep08a]. This is attractive, because most PIM softwareapplications can export content to BMEcat. With our approach, a single tool can nicelybring the wealth of data from established B2B environments to the Web of Data. Ourproposal can manifest at Web scale and is suitable for every PIM system or catalogmanagement software that can create BMEcat XML product data, which holds forabout 82% of all of such software systems that we are aware of, as surveyed in [Web11].Furthermore, it can minimize the proliferation of repeated, incomplete, or outdateddefinitions of the same product master data across various online retailers by means of


Manufacturer Web Site: Datasheet Many Shop Sites (with Incomplete Product Features)

http://www.acme.com http://www.shop1.com http://www.shop2.de http://www.shop3.uk

Product page with

details

Shop 1: Offer page

Shop 2: Offer page

Shop 3: Offer page

High-quality picture

weight: 250gcolor: blue

EAN: 1234567890123GTIN14: 12345678901234MPN: ACME123brand: ACME

EAN: 1234567890123 MPN: ACME123brand: ACME

GTIN14: 12345678901234

price: $ 99.99 price: $ 102.10 price: $ 96.00

Search Engine or Browser Plug-in

price: $ 99.99

weight: 250gcolor: blue

Figure 4.1: Enriching shop pages with product master data from manufacturers based on “strongidentifiers” [from Hep12a]

simplifying the consumption of authoritative product master data from manufacturersby any size of online retailer. It is also expected as a result that the use of structureddata in terms of the GoodRelations vocabulary by manufacturers and online retailerswill bring additional benefits derived from being part of the Web of Data, such as searchengine optimization (SEO) in the form of rich snippets [Goo16], or the possibility ofbetter articulating the value proposition of products on the Web.

Figure 4.2 indicates how the modeling of the approach looks like in GoodRelations.The upper part of the figure denotes the concepts that can generally be provided byWeb shops, namely the description of the business entity, the offering description, andpossibly a basic product description. The lower part depicts the product model masterdata as provided by manufacturers, which might include much more comprehensive andgranular product details than those supplied by retailers. Via a GoodRelations predicategr:hasMakeAndModel (e.g. materialized according to the implicit link of product strongidentifiers), it is possible to add a link from a product to its product model and thusenrich the basic product descriptions with high-quality product model master data, e.g.from BMEcat catalogs.

To test our proposal, we converted a representative real-world BMEcat catalog of twowell-known manufacturers and analyzed whether the results validate as correct RDF/XMLdatasets grounded in the GoodRelations ontology. Additionally, we identified examples


gr:hasBusinessFunctiongr:hasPriceSpecification...

gr:name...

gr:Offering

gr:hasBrand...

gr:namegr:hasEAN_UCC-13gr:hasMPN...

gr:ProductOrService

gr:hasPOS...

gr:legalNamegr:hasGlobalLocationNumber...

gr:BusinessEntity

gr:hasBrandgr:hasManufacturer...{ product features }

gr:namegr:hasEAN_UCC-13gr:hasMPN...

gr:ProductOrServiceModel

gr:offers gr:includes

gr:hasMakeAndModel

WebShops

Manufacturers(BMEcat)

Implicit link using product"strong identifiers"

Explicit link:

Figure 4.2: Retailer and manufacturer data in GoodRelations

that illustrate the problem scenario described relying on structured data collected from⇠2, 500 online shops together with their product offerings. Our tests allowed us toconfirm the immediate benefits and impact that adopting our approach can bring to bothmanufacturers and retailers.

The rest of this chapter is structured as follows: Section 4.2 reviews previous efforts onproduct data management (PDM) with Semantic Web technologies; Section 4.3 coversthe key concepts that are the basis for the BMEcat2GoodRelations tool; Section 4.4focuses on the evaluation of our overall approach; and finally, the conclusions and futureopportunities of our work are discussed in Section 4.5.


In this section, we briefly describe the main characteristics of the BMEcat format andsummarize related work.

4.2.1 BMEcat

BMEcat is a sophisticated XML standard for the exchange of electronic product catalogsbetween suppliers and purchasing companies in B2B settings [HS00]. The current releaseis BMEcat 2005 [SLK05a], a fully downwards-compatible update of BMEcat 1.2 [SLK05a,p. 17]. The most notable improvements over previous versions are the support of externalcatalogs and multiple languages, and the consistent renaming of the ambiguous term


ARTICLE to PRODUCT [SLK05a, p. 17]. Figure 4.3 presents a high-level view of thedocument structure for the transmission of a catalog using BMEcat 2005.

CATALOGAGREEMENTSUPPLIERBUYER

CATALOG_GROUP_SYSTEM CATALOG_STRUCTUREPRODUCT PRODUCT_DETAILS PRODUCT_FEATURES PRODUCT_ORDER_DETAILS PRODUCT_PRICE_DETAILSPRODUCT_TO_CATALOGGROUP_MAP

HEADER T_NEW_CATALOG

Figure 4.3: BMEcat 2005 skeleton [based on SLK05a]

A BMEcat document comprises a header and a transaction part [HS00]:

• The header part defines global settings such as defaults for currency, eligible regionsor catalog language, and specifies seller and buyer parties involved in the transaction[HS00]. It further may state the agreement or contract that the document is basedon [HS00]. The default values specified in the document header may be overriddenby values defined at product instance level in the document [cf. SLK05a, p. 14].

• The transaction part consists of a product data section and data related to classifi-cation standards (e.g. eCl@ss3, United Nations Standard Products and ServicesCode (UNSPSC)4) or vendor-specific catalog group systems [cf. SLK05a, p. 14].Product data sections consist of product-related information, feature data, pricedetails, and order details [HS00]. The element name of the payload part deter-mines the transaction type and can be one of T_NEW_CATALOG (new catalog),T_UPDATE_PRODUCTS (update of product data), and T_UPDATE_PRICES(update of price data) [HS00].

4.2.2 Other Approaches

The rise of B2B e-commerce revealed a series of new information management challengesin the area of product data integration [e.g. Fen+01; SH01]. Separately, the gradualrealization of the Semantic Web vision has motivated significant efforts aimed at repre-senting existing e-commerce-related data and product classification standards adoptingopen semantic technologies and data models [e.g. Hep06; Hep07b; BM08].

3http://www.eclass.de/ (accessed on May 16, 2014)4http://www.unspsc.org/ (accessed on May 16, 2014)




Yet, in the particular context of managing product master data, two previous solutions[Bru+07; Wan+09] stand out based on their similarities with respect to our problemscenario. The study in [Bru+07] presents a meta-model in OWL DLP as part of asemantic application framework that can provide semantic capabilities to a generic PIMsystem. Wang et al. [Wan+09] have developed an extension that allows lifting the datafrom existing relational databases of leading master data management (MDM) systemsto the RDF format. This allows semantic interoperability across organizations’ core data,applications and systems.

Both solutions share our reliance on Semantic Web technologies to facilitate productmaster data integration and consistency across separate data sources. However, thereare several aspects where they deviate from the proposal that we are going to presentin the upcoming sections, most notably: (a) Their scope focuses on closed corporateenvironments which may involve proprietary applications or standards rather than opentechnologies at the scale of an open Web of Data; and (b) being aimed at generic PIM andMDM systems, their level of abstraction is rather high, introducing additional degreesof separation with respect to the applicability to the problem scenario targeted by ourconversion approach.

In that sense, our approach is, to the best of our knowledge, the only solution developedon the basis of open standards, readily available to both manufacturers and retailersto convert product master data from BMEcat into structured RDF data suitable forpublication and consumption on the Web of Data.

4.3 Product Model Master Data for the Semantic Web

The implementation of the logic behind the alignments to be presented herein resultedin the BMEcat2GoodRelations converter. In 2009, Mark Mattern, a master studentsupervised by Martin Hepp at the University of Innsbruck, developed a first version of anonline converter that implemented an extensive mapping from BMEcat to GoodRelationsas part of his master thesis [Mat09]. Yet, the online converter turned out to be impracticalwith respect to extremely large BMEcat files. In the following, we give proper creditto the valuable work of Mattern [Mat09] by extending it with additional mappingsspecially for product master data and suggesting a more robust tool architecture that canaccommodate conversions of large BMEcat files. Our tool, BMEcat2GoodRelations, is aportable command line application written in Python. The tool facilitates the conversionof BMEcat XML files into their corresponding RDF representation anchored in theGoodRelations ontology for e-commerce. It scales well to file sizes of several hundred

4.3 Product Model Master Data for the Semantic Web 159

megabytes. For more information about the project, we refer to the project landing pagehosting the open source code repository5, where one can find a detailed overview of allthe features of the converter, including a comprehensive user’s guide. A round-trippingtoy example that describes the file structure of the converter output is also availableonline6.

4.3.1 Aligning BMEcat with GoodRelations

In the following, we outline correspondences between elements of BMEcat and GoodRela-tions and propose a mapping between the BMEcat XML format and the GoodRelationsvocabulary. Given their inherent overlap, a mapping between the models is reason-able with some exceptions that require special attention. We will highlight these cases,nonetheless we cannot cover the full alignment here.

For the mapping between the two schemas we considered the following aspects:

• Company details (address, contact details, etc.),

• product offer details,

• catalog group structures,

• product features including links to media objects, and

• references to external product classification standards.

Furthermore, multi-language descriptions in BMEcat are handled properly, namely byassigning corresponding language tags to RDF literals. An illustrative example of acatalog and its respective conversion is available online7. However, within the scope ofthis work we focus mainly on product model data. Also, we do not provide alignmentsfor full classification standards that can be exchanged starting with BMEcat 2005,primarily because of the complexity and for legal reasons especially relevant whenconverting licensed classification standards. Moreover, there already exist proposalsthat focus on the conversion and publication of product classification standards (e.g.eClassOWL [Hep05b]).

5http://code.google.com/p/bmecat2goodrelations/ (accessed on August 23, 2014)6http://www.ebusiness-unibw.org/projects/bmecat2goodrelations/example/ (accessed on Octo-

ber 19, 2014)7http://www.ebusiness-unibw.org/projects/bmecat2goodrelations/example/ (accessed on Au-

gust 23, 2014)

http://code.google.com/p/bmecat2goodrelations/

http://www.ebusiness-unibw.org/projects/bmecat2goodrelations/example/



4.3.1.1 Product Details

At the center of the proposed alignments are product details and product-related businessdetails. Table 4.2 shows the BMEcat-2005-compliant mapping for product-specificdetails.

Table 4.2: Mapping of product details from BMEcat to GoodRelations

BMEcat GoodRelations

PRODUCT gr:Offering, gr:Individual/gr:SomeItems,gr:ProductOrServiceModel

SUPPLIER_PID type={ean, gtin} gr:hasEAN_UCC-13, gr:hasGTIN-14

PRODUCT_DETAILSDESCRIPTION_SHORT lang={en, de, . . . } gr:name with language en, de, . . .DESCRIPTION_LONG lang={en, de, . . . } gr:description with language en, de, . . .INTERNATIONAL_PID type={ean, gtin} gr:hasEAN_UCC-13, gr:hasGTIN-14MANUFACTURER_PID gr:hasMPNMANUFACTURER_NAME gr:hasManufacturer ! gr:BusinessEntity !

gr:legalNamePRODUCT_STATUS type={new, used, . . . } gr:condition

Table 4.2 adds an additional level of detail to the PRODUCT ! PRODUCT_DETAILSstructure introduced in Figure 4.3. The element name highlighted in bold font facedetermines a new nesting level, e.g. PRODUCT consists of an attribute for the productidentifier of the supplier and a subelement PRODUCT_DETAILS. The elements dis-cussed in the present context are all mapped to properties of product instances, productmodels and offers in GoodRelations. However, our main interest lies in the alignment togr:ProductOrServiceModel. The product identifier can be mapped in two different ways,at product level or at product details level, whereby the second takes precedence over theother (in other words, the globally scoped value is overwritten by the locally scoped value).Whether the EAN or the GTIN is mapped depends on the type attribute supplied with theBMEcat element. Furthermore, the mapping at product level allows to specify the MPN,product name and description, and condition of the product. Depending on the languageattribute supplied along with the DESCRIPTION_SHORT and DESCRIPTION_LONGelements in BMEcat 2005, multiple translations of product name and description canbe obtained. Lastly, the manufacturer name is mapped to a little more complex patternin GoodRelations, i.e. the value of MANUFACTURER_NAME maps to the name ofthe legal entity attached to the product model via gr:hasManufacturer. Listing 4.1 givesan example of product details in Terse RDF Triple Language (Turtle) after having beenmapped to GoodRelations.


1 samsung:LEDTV_ES6300 a gr:ProductOrServiceModel ;

2 gr:name "Samsung LED TV ES6300"@en ;

3 gr:hasEAN_UCC-13 "1234567890123"^^xsd:string ;

4 gr:hasMPN "ledtv_es6300"^^xsd:string ;

5 gr:hasManufacturer samsung:Samsung .

6

7 samsung:Samsung a gr:BusinessEntity ;

8 gr:legalName "Samsung Group"@en .

Listing 4.1: Example of product details in Turtle/N3

4.3.1.2 Product Features

BMEcat allows to specify products using vendor-specific catalog groups and features, orto refer to classification systems with externally defined categories and features. Themapping of product classes and features is shown in Table 4.3.

Table 4.3: Mapping of product features from BMEcat to GoodRelations


PRODUCT_FEATURESREFERENCE_FEATURE_SYSTEM_NAME referenced classification system identifierREFERENCE_FEATURE_GROUP_ID rdf:type (class id of classification system)REFERENCE_FEATURE_GROUP_NAME gr:category

FEATUREFNAME rdfs:label and property name in GoodRelationsFDESCR rdfs:commentFVALUE gr:hasValueFloatFUNIT gr:hasUnitOfMeasurementFREF feature id of referenced classification system, prop-

erty name in the GoodRelations context

The target GoodRelations property of the REFERENCE_FEATURE_GROUP_NAMEelement is gr:category. REFERENCE_FEATURE_SYSTEM_NAME (e.g. ECLASS-5.1 )and REFERENCE_FEATURE_GROUP_ID have no direct mapping, rather a combina-tion of them unambiguously determines the class identifier of a reference classificationsystem (e.g. eClassOWL [Hep05b]). Likewise, the FREF element can be used togetherwith FVALUE and an optional FUNIT element to specify a feature whose property isreferenced externally. Otherwise, if no FREF is available for a feature, then the featureis defined locally. The FUNIT element can be used to distinguish property types inGoodRelations, i.e. to assign a quantitative object property to the product model in RDFif a value for FUNIT is given, otherwise a datatype property. The distinction will beaddressed in more detail in Section 4.3.2. Listing 4.2 gives an example of product features


1 samsung:LEDTV_ES6300 a gr:ProductOrServiceModel ;

2 samsung:compatible_3D "true"^^xsd:boolean ;

3 samsung:screen_size [ a gr:QuantitativeValueFloat ;

4 gr:hasValueFloat "101.6"^^xsd:float ;

5 gr:hasUnitOfMeasurement "CMT"^^xsd:string ] .

6

7 samsung:compatible_3D a owl:DatatypeProperty ;

8 rdfs:subPropertyOf gr:datatypeProductOrServiceProperty ;

9 rdfs:label "Plays 3D content"@en ;

10 rdfs:domain gr:ProductOrService ;

11 rdfs:range xsd:boolean .

12

13 samsung:screen_size a owl:ObjectProperty ;

14 rdfs:subPropertyOf gr:quantitativeProductOrServiceProperty ;

15 rdfs:label "Screen size"@en ;

16 rdfs:domain gr:ProductOrService ;

17 rdfs:range gr:QuantitativeValue .

Listing 4.2: Example of product features in Turtle/N3

in Turtle after having been mapped to GoodRelations (prefix declarations omitted).

4.3.1.3 Catalog Group Systems

Catalog groups are hierarchical structures that are used for facilitating the navigationand finding of products in a catalog [HS00]. They often group products not only bytheir similarity but also typical usages or target audiences. With catalog groups, it ispossible to further refine product descriptions (see Chapter 5). A catalog group system ismapped using an rdfs:subClassOf hierarchy based on the GenTax algorithm [HdB07],which can create meaningful ontology classes for a specific context while at the sametime preserving the original hierarchy, i.e. the catalog group taxonomy. Table 4.4 outlinesthe mapping of catalog groups in BMEcat to RDF. The hierarchy is determined by thegroup identifier of the catalog structure that refers to the identifier of its parent group.

Table 4.4: Mapping of a catalog group system in BMEcat to a rdfs:subClassOf hierarchy


CATALOG_GROUP_SYSTEMCATALOG_STRUCTURE owl:ClassGROUP_ID class name of owl:ClassGROUP_NAME lang={en, de, . . . } rdfs:label with language en, de, . . .GROUP_DESCRIPTION lang={en, de, . . . } rdfs:comment with language en, de, . . .PARENT_ID rdfs:subClassOf (class id of superclass)


Listing 4.3 provides an example in Turtle of a catalog group structure built up accordingto the GenTax algorithm [HdB07] (lines 2–12).

1 # GenTax mapping

2 samsung:TV-tax a owl:Class ;

3 rdfs:label "Samsung TV [Category]"@en .

4 samsung:TV-gen a owl:Class ;

5 rdfs:subClassOf gr:ProductOrService ;

6 rdfs:label "Samsung TV [Product type]"@en .

7 samsung:LED_TV-tax a owl:Class ;

8 rdfs:label "Samsung LED TV [Category]"@en .

9 samsung:LED_TV-gen a owl:Class ;

10 rdfs:subClassOf gr:ProductOrService ;

11 rdfs:label "Samsung LED TV [Product type]"@en .

12 samsung:LED_TV-tax rdfs:subClassOf samsung:TV-tax .

13

14 # Assignment of product type to product model

15 samsung:LEDTV_ES6300 a gr:ProductOrServiceModel, samsung:LED_TV-gen .

Listing 4.3: Example of catalog group information in Turtle/N3

4.3.1.4 Product and Catalog Group Map

In order to link catalog groups and products, BMEcat maps group identifiers withproduct identifiers using PRODUCT_TO_CATALOGGROUP_MAP [SLK05a, pp. 220f.].Accordingly, products in GoodRelations are assigned corresponding classes from thecatalog group system, i.e. they are defined as instances (rdf:type) of classes derived fromthe catalog group hierarchy. In Listing 4.3 (line 15), a product type is assigned to aproduct model based on a mapping rule between product identifiers and catalog groupidentifiers specified in the BMEcat catalog.

4.3.2 Selected Modeling Problems

In the following, we cover aspects of the conversion where the alignment of the twoschemas turned out to be challenging.

4.3.2.1 Datatype versus Object Properties

The Web Ontology Language OWL distinguishes between object properties and datatypeproperties [DS04]. The former category describes properties that link between individuals,


whereas the latter links individuals to data values (literals), e.g. an entity with a numericvalue or a textual description. The GoodRelations vocabulary further refines the cate-gorization made by OWL by separating qualitative and quantitative object properties.On the other side, BMEcat does not explicitly discriminate types of features, so features(FEATURE ) typically consist of FNAME, FVALUE and, optionally, an FUNIT element[cf. SLK05a, pp. 138–143]. The presence of the FUNIT element helps to distinguishquantitative properties from datatype and qualitative properties, because quantitativevalues are determined by numeric values and units of measurements, e.g. “150 millimeters”or “1 bar”. Thus, any other feature is either a qualitative or a datatype property.

It is impossible to define a rule that reliably distinguish qualitative properties anddatatype properties in an automated way during the conversion (e.g. are “S”, “M”, and “L”qualitative values describing garment sizes or rather simple literal values?), so we deferthis task to the RDF world (potentially bringing in additional knowledge) and declare allsuch properties as datatype properties with a range of type string.

For those features whose values likely qualify as boolean values we provide a simpleheuristic, i.e. if the feature value is one of “y”, “n”, “yes”, “no”, “true”, or “false”, thenthe property is treated as a boolean datatype property. Similarly, all rules that applyto properties also apply to their respective values, i.e. a quantitative property impliesquantitative values, and so forth.

4.3.2.2 Float Value Ranges in Datatype Properties

Unlike GoodRelations, BMEcat does not allow to model range values by definition. Thereare two possible ways to model them in BMEcat, though. Either the BMEcat supplierdefines two separate features, or the range values are encoded in the FVALUE element ofthe feature. The first option defines a feature for the lower range value and a feature forthe upper range value, respectively. The downside of this approach is that two unrelatedGoodRelations properties arise. The second alternative, i.e. range values encoded assingle feature values, leads to invalid literals (e.g. gr:hasValueFloat “10–20”ˆ̂ xsd:float)when mapped to GoodRelations. For that reason, typical value patterns describing upperand lower ranges (like operating temperature of “5–40” degrees Celsius) are mappedduring conversion to pairs of gr:hasMinValueFloat and gr:hasMaxValueFloat propertiesand respective values in GoodRelations. This approach, however, works only for the mostprevalent syntactical patterns for range values in text fields.

4.4 Evaluation 165

4.3.2.3 Units of Measurement

BMEcat and GoodRelations recommend to use UN/CEFACT [Uni06] Common Codes todescribe units of measurement. In reality, though, it is common that suppliers of BMEcatcatalogs export raw unit of measurement codes, i.e. just as they are found in their PIMsystems. Instead of adhering to the standard three-letter UN/CEFACT Common Code,they often provide different representations of unit symbols, e.g. “cm”, “centimeters”, etc.in place of “CMT”. This is inconvenient with regard to potential applications that shouldconsume the data and compare products based on feature descriptions. As a means toenhance the data quality already during the conversion process, our tool allows for theprovision of a mapping table with invalid unit codes and their respective UN/CEFACTcounterparts.

4.3.3 Scalability

BMEcat files, especially of large industrial companies, can easily exceed 100 MB of filesize. In order to cope with such significant file sizes, an online tool like the one proposedin [Mat09] quickly reaches its limits. The tool presented herein is an advancementof the work by Mark Mattern incorporating some lessons learned, i.e. (a) realizing adecentralized architecture that allows to run the tool offline via a command line interface,and (b) parsing file contents intelligently using an event-based parsing strategy withouthaving to store the complete XML tree in main memory. By that it was possible toconvert a 500 MB input file in less than two hours on an Apple Macbook from 2008 with4 GB of main memory and an Intel Core 2 Duo processor running at 2.4 GHz. Evenmore importantly, the memory consumption during processing was fairly low.

4.4 Evaluation

To evaluate our proposal, we implemented two use cases that allowed us to produce alarge quantity of product model data from BMEcat catalogs. We tested the two BMEcatconversions using standard validators for the Semantic Web, presented in the upcomingSection 4.4.1. In Section 4.4.2, we then compare the product models obtained from oneof the BMEcat catalogs with products collected from Web shops through our focusedWeb crawl from Chapter 3. Finally, we show the potential leverage of product masterdata from manufacturers with regard to products offered on the Web.


4.4.1 Coverage of Use Cases

We tested our conversion using BMEcat files from two manufacturers, one in the domain ofhigh-tech electronic components (Weidmüller Interface GmbH und Co. KG8), the secondone a supplier of white goods (BSH Hausgeräte GmbH9). In the case of Weidmüller,the conversion results are available online10. While the Weidmüller catalog comeswith its own proprietary catalog group system, the products in the BSH catalog wereclassified according to eCl@ss 6.1. This allowed us to validate the BMEcat convertercomprehensively. Although the conversions completed without errors, we could still detecta few issues in each dataset that will be covered subsequently.

To validate the output of our conversion, we used publicly available online and offlinevalidators. In addition to that, our converter prints helpful warning messages to thestandard output. In summary, the converter was tested using the following validation steps:(1) BMEcat2GoodRelations converter output (including error and warning messages,if any); (2) RDF/XML syntax validity11; (3) Pellet validation12 for spotting semantic,logical inconsistencies; and (4) GoodRelations-specific compliance tests13 to spot datamodel inconsistencies.

The converter has built-in check steps that detect common irregularities in the BMEcatdata, such as wrong unit codes or invalid feature values. In Table 4.5, we list a number ofwarning messages that were output during the conversion of the BMEcat files, togetherwith the validation results of the different validation tools. As shown in the table, thetwo conversions pass most validation checks, with a few data quality issues reported bysome validators. In the BSH catalog, for example, some fields that require floating pointvalues contain non-numeric values like “/”, “0.75/2.2”, “3*16”, or “34 x 28 x 33.5”, whichoriginates from improper values in the BMEcat. Another data quality problem reportedis the usage of non-uniform codes for units of measurement, instead of adhering to therecommended 3-letter UN/CEFACT Common Codes (e.g. “MTR” for meters, “VLT” forVolt, etc.).

8http://www.weidmueller.com/ (accessed on August 23, 2014)9http://www.bsh-group.com/ (accessed on August 23, 2014)

10http://catalog.weidmueller.com/semantic/sitemap.xml (accessed on August 23, 2014)11http://www.rdfabout.com/demo/validator/ (discontinued as of August 23, 2014; but the source

code is still available), http://www.w3.org/RDF/Validator/ (accessed on August 23, 2014)12http://clarkparsia.com/pellet/ (accessed on August 23, 2014)13http://www.ebusiness-unibw.org/tools/goodrelations-validator/ (accessed on May 22, 2014)

http://www.weidmueller.com/

http://www.bsh-group.com/

http://catalog.weidmueller.com/semantic/sitemap.xml

http://www.rdfabout.com/demo/validator/

http://www.w3.org/RDF/Validator/

http://clarkparsia.com/pellet/

http://www.ebusiness-unibw.org/tools/goodrelations-validator/

4.4 Evaluation 167

Table 4.5: Validation of BMEcat conversions

Validation BSH Weidmüller

BMEcat2GoodRelations con-verter

warnings: (a) wrong valueswhere numeric values wereexpected; (b) non-standardunit codes detected

warnings: (a) non-standardunit codes detected

RDF Validator valid. warning: invalid lexicalvalue for literal

valid

W3C RDF Validation valid validPellet valid. warning: malformed

xsd:float detectedvalid

GoodRelations Validator step 32 failed: non-complianceof float literal with xsd:float

valid

4.4.2 Missing Product Features on the E-Commerce Web of Data

Table 4.1 from this chapter’s introduction has revealed a mismatch between the featurespublished by manufacturers and those published by online retailers via offering descriptions.In this section, we describe one additional example that uses structured data on the Webof Data.

In addition to the manufacturer BMEcat files, we took a real dataset obtained from afocused Web crawl whereby we collected product data from ⇠2, 500 shops (see Chapter 3).Figure 4.4 depicts the distribution of the product offer count across Web shops in thecrawl. For this figure, we did only consider product offers with EANs, which appearedin 847 shops. Furthermore, in order to remove any potential bias caused by multipledefinitions of the same product on different pages (because of non-canonical UniformResource Identifiers (URIs) containing query strings like prod_id=1&sess_id=XYZ), theboxplot was generated using the count of product offers per shop with distinct EANs.Hence, according to Figure 4.4, only 25% of the Web shops offer more than 52614 productswith distinct EANs, half of the shops offer less than 91 products, and one quarter of theshops offer less than 15 products. There is even one shop that offers 79, 076 productswith distinct EANs.

In Table 4.6, we complement the example given in the introduction with insights fromour collected data. The products listed in the first column of the table represent productmodels from the BSH dataset that match product instances from Web shops based onidentical EANs. In the current dataset, there exist 95 of such matches based on EANs.The comparison of the amount of properties from the manufacturer with the number of14The exact number of the upper quartile is 526.5, but since the number of products is discrete, we

herein refer to 526 products.


0 50 100 150 200 250 300 350 400 450 500 550 600 78900 78950 79000 79050 79100

min = 1q1 = 15

median = 91q3 = 526.5

max = 79,076

offer count

Figure 4.4: Boxplot of the product offer count (with EANs) across Web shops in the crawl

properties from the retailers shows a significant gain from augmenting retail product datawith manufacturers’ product model master data. For instance, take the vacuum cleaner(German: Bodenstaubsauger) in row 2 of Table 4.6. It shows 30 product propertiescoming from the manufacturer and an average number of nine properties across the threeshops offering that product. Therefore, the properties in the shops only amount to afraction (30%) of the properties available from the manufacturer.

Table 4.6: Product features in BSH BMEcat versus data from retailers publishing GoodRelationsmarkup

BSH Product Features Retailer Product Features Coveragea

TW86103 Wasserkocher(EAN: 4242003535615)

25 10 marketplace.b2b-discount.de 40%

Bodenstaubsauger Beutel30

10 www.ay-versand.de30%VS06G2410 2400 W 9 www.megashop-express.de

(EAN: 4242003356364) 8 fairplaysport.tradoria-shop.atMikrowelle HF25M5L2 Edelstahl(EAN: 4242003429303)

51 7 www.european-gate.com 13.73%

a “Coverage” = Ratio of average number of retailer features and BSH features

The relatively constant number of product features contributed by retailers may beexplained by the shop extensions that typically expose only standard features like productname, GTIN, EAN, stock keeping unit (SKU), product weight, and product dimensions.Although this helps to explain the numbers to some extent, it does not change our premisethat structured product master data is still lacking on the Web.

We gathered all the data in a SPARQL-capable RDF store and extrapolated somestatistics to substantiate the potential of our approach. The number of product modelsin the BSH was 1, 376 with an average count of 29 properties, while the WeidmüllerBMEcat consisted of 32, 585 product models with 47 properties on average created byour converter. By contrast, the product instances from the crawl only contain sevenproperties on average, the product offers 13, and product models nearly twelve (seeTable 3.2 in Chapter 3).

4.4 Evaluation 169

4.4.3 Leverage Effect of Product Master Data on the Web

Table 4.6 from Section 4.4.2 confirmed the scenario presented in Table 4.1 in the intro-duction of this chapter by comparing BSH product model data and structured productdata from a sample of ⇠2, 500 online shops.

In this section, we present some specific examples of the number of online retailers thatcould readily benefit from leveraging our approach. To remain within the scope of theuse cases discussed, the examples are chosen from the BSH BMEcat product catalogwithin the German e-commerce market.

We chose to check for the number of shops offering products using a sample size of 90random product EANs from the BSH BMEcat. The sample size was selected based ona 95% confidence level and 10% confidence interval (margin of error), i.e. requiring aminimum of 90 samples given the population of 1, 376 products in the BMEcat. Usingthe sample of EANs, we then looked up the number of vendors that offer the products byentering the EAN in the search boxes on Amazon.de15, Google Shopping Germany16, andthe German comparison shopping site preissuchmaschine.de17. This gave us a distributionof shops grouped by EAN as outlined in the boxplots in Figure 4.5.

min$

min$

min$

q1$

q1$

q1$

median$

median$

median$

q2$

q2$

q2$

max$

max$

max$

0$ 10$ 20$ 30$ 40$ 50$ 60$

h2p://www.google.de/shopping/$

h2p://www.amazon.de/$

h2p://www.preissuchmaschine.de/$

offer$count$

Figure 4.5: Boxplots of the distribution of shop offers per EAN

The numbers that we got from this experiment were surprisingly small. For example,there was a maximum number of 48 sellers offering a specific product. For half of theproducts that we tested at least 16 offers appeared in the price comparison search enginepreissuchmaschine.de. In the Amazon.de and Google Shopping Germany marketplaces bycomparison, the number of offers for a product among the sample of product EANs waseven lower. We can think of various explanations for this, namely that the marketplace15http://www.amazon.de/ (accessed on August 23, 2014)16http://www.google.de/shopping/ (accessed on May 8, 2014)17http://www.preissuchmaschine.de/ (accessed on August 23, 2014)

http://www.amazon.de/

http://www.google.de/shopping/

http://www.preissuchmaschine.de/


regulations try to limit competition among market participants and, more importantly,that adding products to the marketplace presents a barrier to smaller shop owners (inthe case of Google Shopping, a shop is asked to upload product data using a populatedproduct feed or an application programming interface (API) [cf. Goo15b]). Furthermore,the small numbers may be due to (1) localized searches (all shopping comparison enginesin the .de-domain), (2) the fact that shops rarely populate their products with EANidentifiers, or (3) the type of products in our sample, in this case from the domain ofwhite goods that are likely not the most popular product category for selling online.More precisely, unsupported small shop owners may not find it very attractive to selldishwashers online given the logistical effort involved.

To put Figure 4.5 (boxplots) in perspective, we did a comparison with a more popularproduct, i.e. “Canon PowerShot A2300 schwarz” (with EAN “8714574578828”). Werepeated the above searches with the same online services, but now using (a) the EANof this digital camera model and (b) the product name, suspecting that many retailersdo not populate their products with EANs but use other “strong identifiers” instead.Table 4.7 summarizes the results of this analysis. Amazon.de and preissuchmaschine.deconstantly returned 45 and 233 results, respectively. Google Shopping Germany, however,gave only four results when searching by EAN number, but 144 results for a search byproduct name. These results indicate that using a combination of different types of“strong identifiers” could leverage product master data on the Semantic Web.

Table 4.7: Product searches for a digital camera model on popular e-marketplaces

Marketplace EAN Hits Product Name Hits

preissuchmaschine.de 45 45Amazon.de 233 233Google Shopping Germany 4 144

Figure 4.6 shows the frequency distribution of EANs among the ⇠2, 500 shops of our Webcrawl. This information also contributes to an estimate of the impact of our approach.Almost half a million of the EANs are unique. For those unique products there isno big gain from manufacturers publishing product model master data, except for thehigh-quality product descriptions. The benefit becomes clearer for the 101 EANs thatappear between 100 and 250 times. If we managed to persuade a single manufacturer ofthese products to publish product model master data on the Web of Data, then at least100 retailers could benefit immediately. The expected lever is even higher given thatmanufacturers usually produce more than one good, and that the effort for publishingthe full product catalog as structured data is similar than for one product, since thereremains one BMEcat file to convert. Moreover, most retailers do not offer only one but

4.5 Conclusion 171

Figure 4.6: Frequency distribution of EANs with respect to the number of product offers for a particularEAN in the dataset

many product items by various manufacturers.

4.5 Conclusion

The proliferation of online retailers in recent years was accompanied by a growingnumber of products being offered on the Web. Such a substantial increase of online goodsintroduces new data management challenges. More specifically, it involves how information,in particular products, features or descriptions, can be processed by stakeholders alongthe product lifecycle. Our experience after a survey of ⇠2, 500 different-sized onlinemerchants indicates that in the current conditions product data from retail sites suffersfrom incomplete, inconsistent or outdated product detail information.

In this thesis, we have presented a conceptual mapping and workflow for lifting productmodel master data from the popular BMEcat standard to the GoodRelations data modeland the RDF meta-model, allowing to use the Semantic Web approach for augmentingproduct information from the Web. As a practical solution to mitigate the shortage ofmissing product master data in the context of e-commerce on the Web of Data, we haveproposed the BMEcat2GoodRelations converter. This ready-to-use solution comes asa portable command line tool that converts product master data from BMEcat XMLfiles into their corresponding OWL representation on the basis of the GoodRelationsontology. All interested merchants then have the possibility of electronically publishingand consuming this authoritative manufacturer data to enhance their product offerings


relying on widely adopted product “strong identifiers” such as EAN, UPC, GTIN, orMPN. Or alternatively, consumers of retail site markup could augment the raw datatherewith.

We argue that the construction of a firm basis of product master data is the prerequisitefor useful product discovery and matchmaking scenarios. The data we have collectedand analyzed should motivate manufacturers to release their product master data andencourage retailers to attach strong identifiers to their products. The immediate impactwould be a huge lever for enriching online offers by product features and less effort to beput into data cleansing thanks to a gain in more high-quality data. Both factors wouldpave the way to more granular data analysis and search experience for organizations andindividuals.

5 Product Type Information



5.2.1 Product Classification Standards . . . . . . . . . . . . . . . . . . . . . . 177

5.2.2 Proprietary Product Category Systems . . . . . . . . . . . . . . . . . . . 178


5.3 Deriving Product Ontologies from Knowledge Organization Systems . . . . . . . 180

5.3.1 Conceptual Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.3.2 Transformation of a Product Classification System . . . . . . . . . . . . 182

5.3.3 Converting Property Types, Range Information, and Enumerated Values 184

5.3.4 Serialization and Deployment . . . . . . . . . . . . . . . . . . . . . . . . 185

5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5.4.1 Correctness of the Derived Product Ontologies . . . . . . . . . . . . . . 187

5.4.2 Statistics on New Product Classes and Properties . . . . . . . . . . . . . 189

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

5.5.1 Classification of Product and Offer Descriptions . . . . . . . . . . . . . . 192

5.5.2 Navigation over Product and Offer Data . . . . . . . . . . . . . . . . . . 193

5.5.3 Semantic Annotation of Products and Offers on the Web . . . . . . . . . 193

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

The classification of products and services greatly facilitates reliable and efficient electronicexchanges of product data between organizations. Many companies classify products(a) according to generic or industry-specific product classification standards, or (b) byusing proprietary category systems. Such classification systems often contain thousandsof product classes that are updated as needed (e.g. to cover new types of products),which implies a large quantity of potentially useful product category information fore-commerce applications on the Web of Data. Thus, instead of engineering productontologies from scratch, which is costly, tedious, error-prone, and requires maintenanceeffort, it is generally attractive to derive them from existing classifications. This approachhas been studied in the literature before, e.g. [CG01; Kle02; ZL03; Hep05b; Hep06].

173


In this chapter, we (1) describe a generic, semi-automated method for deriving Web On-tology Language (OWL) ontologies from product classification standards and proprietarycategory systems, which is conceptually based on the GenTax algorithm [HdB07] andthe approach used for eClassOWL [Hep05b], but extended and updated to match LinkedOpen Data (LOD) principles [Ber06]. Moreover, we (2) show that our approach generateslogically and semantically correct vocabularies, and (3) demonstrate the practical benefitof our approach. The resulting product ontologies are compatible with the GoodRelationsvocabulary for e-commerce and schema.org, and they can be used to enrich product andoffer descriptions on the Semantic Web with granular product type information fromexisting data sources.


The categorization of products and services plays a crucial role for many businesses andbusiness applications [Fen+01]. It enables reliable and efficient electronic transactions onproduct data between organizations in a dynamic domain, characterized by innovation anda high degree of product specificity. In concrete terms, product classes allow for intelligentdecision-making and operations over aggregated data, e.g. facilitating spend analysis [cf.HLS07] or enhanced navigability and search within product catalogs [cf. HS00]. The abilityto operate on groups of products is often superior to applying heuristics on unstructuredproduct descriptions, especially at tasks that require abstractions over individual productmodels or that depend on accessing subtle differences between competing products. Forinstance, a search for a personal computer relying on textual matches will not onlyreturn personal computers but probably related accessories or books as well that discussthe broad topic personal computers. Alternatively, with class membership information,it is possible to reliably distinguish between personal computers and related, but notnecessarily relevant, products. Moreover, it facilitates to query an exhaustive set ofpersonal computers, which otherwise, with heuristics, would be difficult and expensive.

In practice, organizations often arrange products and services according to informalproduct classification systems that are not based on knowledge representation principles,e.g. eCl@ss [EClND] or UNSPSC [UniND]. At the same time, the number of quality,practically relevant product ontologies on the Web is still limited [Hep07a], among othersbecause most ontology engineering work is done in the context of academic researchprojects where efforts rarely go beyond early prototypes [Hep05b]. For serious e-commerceapplications on the Web of Data, though, we need a broad domain coverage of specificclasses, properties, and enumerated values for describing products and services. For this


reason, a cost-efficient solution able to accommodate business needs on the Web of Datawould be very useful.

Product classification systems are suitable candidates for creating high-quality and low-cost product ontologies for the Web [e.g. Hep06]. In many areas of e-commerce, wheredomains are typically composed of thousands of classes and properties, it proves difficultto engineer domain ontologies manually, because that would imply to get hold of a largenumber of concepts [Hep06]. Moreover, the conceptual dynamics [Hep07a] underlyingthe domain of products and services, determined by ongoing product innovation anda high degree of product specificity, make the manual creation of product ontologieseven more problematic [Hep06]. Let us exemplify the situation by comparing the releasesizes [ECl14] of different versions of eCl@ss [EClND], a comprehensive industry standardfor the classification and description of products and services (see Figure 5.1): WhileeCl@ss 5.1.4 had defined 30, 329 classes in 2007, eCl@ss 6.1, only announced two yearslater, was already counting 32, 795 classes. The differences become even more evident foreCl@ss 6.1 and eCl@ss 9.0 BASIC with an increase of 25%, reaching 40, 870 conceptswithin only five years.

In place of engineering new domain ontologies, it is often more practical to derive product

Figure 5.1: Conceptual dynamics of the eCl@ss product categorization standard [based on ECl14]


ontologies from works already in place, i.e. to reuse existing industrial taxonomies, asargued in [Hep06]. This has several benefits:

1. The product classifications provide a comprehensive coverage of the conceptualdomains [Hep06], and that often in multiple languages.

2. There is no significant overhead involved for maintaining derived product ontologies;on the contrary, they are automatically kept up-to-date with amendments to theclassifications conducted by domain experts in response to changes in the real world[cf. Hep06].

3. Existing industrial standards are popular and thus already in wide use to classifyproduct instance data [cf. Hep06]. In other words, a large amount of products inrelational databases are already classified according to such product categorizationstandards. At the same time, numerous Web shops create and maintain proprietarycategory systems along with their product catalogs.

Hence, instead of manually crafting complex domain ontologies and thereby in a sensereinventing the wheel, it appears sensible to unlock the potential of existing, well-maintained knowledge organization structures and to classify products on the SemanticWeb according to them.

Unfortunately, product classifications are difficult to use for the Semantic Web in theirraw form. They generally offer weak, ambiguous “topic” semantics, i.e. the same categorycan be used for very different types of entities. Furthermore, they only define an informalhierarchy, so that it is unclear whether a subsumption hierarchy shall be describedusing rdfs:subClassOf, skos:broader/skos:narrower, skos:broader-/skos:narrowerTransitive,etc. Even if a hierarchical relationship between classes in a classification systemcould be described using the rdfs:subClassOf property (e.g. ex:Car rdfs:subClassOf

ex:Vehicle), inconsistencies may still arise when applied to another subsumption pathof the same classification system (e.g. ex:Gearbox rdfs:subClassOf ex:Vehicle).

In this chapter, we present a generic approach and a fully-fledged, modular, and largelyautomated tool for deriving Web ontologies from product classification systems. Weshow that our approach generates logically and semantically correct domain ontologies inOWL DL that

1. establish canonical Uniform Resource Identifiers (URIs) for every conceptual elementin the original schema,

2. preserve the taxonomic structure of the original classification while making itscategories usable in multiple contexts,


3. comply with the GoodRelations vocabulary for e-commerce [Hep08a] and schema.org,and

4. can be readily deployed according to LOD principles [HB11] on the Web of Data.

The results of our transformation unlock additional semantics that enable novel Webapplications. Thanks to the enrichment of product master data and a more granulardescription of offers by virtue of product ontologies, search engines and other consumersof structured data can take advantage of product type information for product search,comparison and matchmaking.

The rest of this chapter is structured as follows: Section 5.2 covers relevant backgroundabout product classification systems and compares our proposal with relevant alternativesin the literature; Section 5.3 introduces our approach, which results are then evaluatedand discussed in Sections 5.4 and 5.5; and finally, Section 5.6 concludes our work anddiscusses future extensions.


For the scope of this research, we distinguish two groups of classification schemes (systems)relevant to the domain of commercial products and services. These are product classi-fication standards and proprietary product category systems (or structures). The mainaspects of both groups are discussed in this section. Additionally, there is further relevantinformation that cannot be included here due to space limitations, but is available online1.This supplementary material gathers a series of key attributes for every classificationsystem comprising version, organization(s) authoring and managing the classification,available data sources, official report, target usage domain, intended regional use, andlevel of multilingual support.

5.2.1 Product Classification Standards

Product classification standards (or product categorization standards) are widely acceptedknowledge structures often consisting of thousands of categories [cf. HLS07]. They typicallycomprise: (a) Hierarchical structures for the aggregation of products, which for exampleallow for spend analysis or reasoning over hierarchical relations; (b) common features andvalues related to product categories; and (c) multilingual descriptions of the elementsthat conform the standard.

1http://www.ebusiness-unibw.org/ontologies/pcs2owl/ (accessed on September 16, 2014)

http://www.ebusiness-unibw.org/ontologies/pcs2owl/


The product classification standards that we considered at the time of this research are:

• Classification of Products by Activity (CPA) [Eur08b],

• Central Product Classification (CPC)2,

• Common Procurement Vocabulary (CPV) [Eur08a],

• eCl@ss3,

• ElektroTechnisches InformationsModell (Engl.: Electro-Technical Information Model)(ETIM)4,

• FreeClass [Han07],

• Global Product Classification (GPC) [GS105],

• proficl@ss5, and

• Klassifikation der Wirtschaftszweige (Engl.: German Classification of EconomicActivities) (WZ) [Sta08].

The featured standards are based on industry consensus and exist for various businessfields, be it horizontal or vertical industries. eCl@ss, proficl@ss, and GPC, for example,describe a wide range of products from multiple industrial sectors. By contrast, CPV isintended for the procurement domain, whereas ETIM is focused on the field of electronics.Two standards, CPA and WZ, put forward classifications of comprehensive economicactivities instead of product classifications per se. Nonetheless, commercial products canbe classified according to them and their use is common among governmental publishers ofstatistical data. To solve potential ambiguity problems of product names, standards suchas eCl@ss, ETIM, and proficl@ss include synonyms to provide discriminatory features [cf.Nav09] and to retain higher recall in product search scenarios. Furthermore, manystandards (CPA, CPV, FreeClass, and WZ) feature translations in various languages.

5.2.2 Proprietary Product Category Systems

Proprietary product category systems (or catalog group systems, category structures)are also suited for organizing products and services. Other than product classificationstandards, catalog group systems are generally characterized by little community agree-ment. Instead of communities or standardization bodies, single organizations or small

2http://unstats.un.org/unsd/cr/registry/cpc-2.asp (accessed on September 16, 2014)3http://www.eclass.de/ (accessed on May 16, 2014)4http://www.etim.de/ (accessed on May 16, 2014)5http://www.proficlass.de/ (accessed on September 16, 2014)

http://unstats.un.org/unsd/cr/registry/cpc-2.asp


http://www.etim.de/

http://www.proficlass.de/


interest groups are taking the lead for the development of such category structures. Thus,they are accepted only by a relatively small number of stakeholders, and their usage isoften limited to a narrow context, e.g. to represent a navigational structure in a Webshop. Some examples of catalog group hierarchies considered in the context of this workare proprietary product taxonomies like the Google product taxonomy [Goo13] and theproductpilot6 category system (the proprietary category structure of a subsidiary ofMesse Frankfurt), as well as product categories transmitted via catalog exchange formatslike BMEcat7 [SLK05a]. The latter can take advantage of both product categorizationstandards and catalog group structures in order to organize types of products and servicesand to contribute additional granularity in terms of semantic descriptions, as previouslycovered in Chapter 4 [see also SRH13b].


This research work partially builds upon previous works in the area of transformingclassification standards into Web ontologies. The challenges in the conversion of productclassification standards were already discussed in [Hep05a; Hep06], whose findings ledtowards the development of the GenTax algorithm in [HdB07], still a core componentof our solution. The subsequent initial release of the GoodRelations ontology [Hep08a]motivated the first transformation of the eCl@ss standard [cf. Hep05b] (version 5.1.48) asa GoodRelations-compliant ontology relying on the GenTax methodology.

Alternatively, there have been previous efforts to convert other product classificationschemes that are also supported by our tool: Most notably CPV ([PAA08], and anothereffort in the context of a project concerned with the publishing of open government data9),primarily used to streamline the procurement and tendering process in the public sector.On a broader scope, the research in [Vil11] provides the most recent and comprehensivesurvey of methods and tools for the refactoring of most types of non-ontological resourcesinto ontological resources, i.e. Web ontologies. Villazon-Terrazas [Vil11] developed acomprehensive qualitative framework to categorize non-ontological resources based ontheir characteristics. One of the types of non-ontological resources acknowledged in hiswork are actually the general classification schemes for any given domain, such as thosefor products considered in the current research. In fact, two methods, [Hak+06], again

6http://www.productpilot.com/ (accessed on September 16, 2014)7Developed by the German Bundesverband Materialwirtschaft, Einkauf und Logistik (Engl.: Federal

Association of Materials Management, Purchasing and Logistics) (BME).8http://www.heppnetz.de/projects/eclassowl/ (accessed on September 16, 2014)9http://linked.opendata.cz/resource/dataset/cpv-2008 (accessed on September 16, 2014)

http://www.productpilot.com/

http://www.heppnetz.de/projects/eclassowl/

http://linked.opendata.cz/resource/dataset/cpv-2008


GenTax, and a tool, SKOS2OWL10 [HR09], are identified to focus on the conversion ofclassification schemes into Web ontologies.

Yet, in summary, to the best of our knowledge, the approach described in this chapter isthe only methodology with mature tool support that extends the features and capabilitiesof all the conversion efforts previously mentioned, on at least one, if not several ofthe following fronts: (1) The level of automation; (2) modular, extensible architecturesupporting the conversion of an arbitrary number of classification systems; (3) theapplication to a broad set of non-ontological resources, i.e. almost all relevant classificationschemes; (4) traceability including preservation of the taxonomic structure between theelements in the original classification scheme and those in the derived Web ontology;(5) improved support for properties and enumerations; (6) high degree of configurationoptions aimed at the deployment on the Web of Linked Open Data (LOD); and, lastly,(7) compliance to the GoodRelations and schema.org vocabularies, which currently allowsfor the publication of product information in various Web data formats (e.g. Microdataand Resource Description Framework in Attributes (RDFa)).

5.3 Deriving Product Ontologies from Knowledge Organization Systems

In this section, we present a generic, semi-automated approach to turn standards andproprietary product classification systems into respective product ontologies. Subsequently,we outline the conceptual architecture of our proposal, followed by a description of theconceptual transformation.

5.3.1 Conceptual Approach

Figure 5.2 depicts the conceptual approach of PCS2OWL11. The tool consists of a modulararchitecture that builds upon three layers, namely parser, transformation process, andserializer. Prior to executing the script, a moderate amount of initial human labor isneeded, mainly to prepare the import modules (parsers) for the respective classificationsystems, as indicated by the dashed rectangle in Figure 5.2. This preliminary task includesproviding the essentials for mapping the taxonomy and setting up the handling of propertytypes. Apart from defining these details, the parsers’ purpose is to load categories, features,and values of product classification systems into an internal model, which specifies ontologyclasses, properties, and individuals. The next steps, the transformation and serialization10http://www.heppnetz.de/projects/skos2owl/ (accessed on September 16, 2014)11http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL (accessed on May 22, 2014)

http://www.heppnetz.de/projects/skos2owl/


5.3 Deriving Product Ontologies from Knowledge Organization Systems 181

CPA 2008

FreeClass

Product Categorization Standards

.xls, .csv, .xml,

.txt, .mdb, ...

Objects(classes, properties, individuals)

RDF

RDF/XML, HTML, sitemap.xml

Transformation

Serialization

CPV 2008

GPC

eCl@ss 5.1.4 and 6.1

proficl@ss 4.0

ETIM 4.0

WZ 2008

Google product taxonomyCPC Ver.2 productpilot

BMEcat catalog groups

Proprietary Hierarchies andCatalog Group Structures

Custom Parsers

Figure 5.2: Conceptual architecture of PCS2OWL

processes, are fully automated. In the transformation step, the internal model, consistingof entities for classes, properties, and individuals, is turned into an Resource DescriptionFramework (RDF) graph that describes the final ontology. At this stage, also the logicalrules from the parsers are applied to the internal model. Finally, the RDF model isserialized as RDF/XML, and all other files required for the online deployment of theproduct ontologies are created accordingly.

In the context of this work, we have developed custom parsers for a number of popularcategorization standards and proprietary taxonomies for products and services, previouslyintroduced in Sections 5.2.1 and 5.2.2 and outlined in Figure 5.2. The input formatsof the source files of the classification systems are irrelevant to the converter, sincethe parsers have to be hand-crafted anyway. For our conversions, we had to deal withExcel spreadsheets (files ending in “.xls”), comma-separated values (CSV) files (“.csv”),Extensible Markup Language (XML) files (“.xml”), database tables (“.mdb”), and plaintext files (“.txt”).

The effort necessary for developing a parser module is negligible in comparison to hand-crafting a product ontology from scratch. For simple classification systems such as GPC orthe Google product taxonomy that merely comprise classes and do not define properties,it was for example sufficient to extend an empty parser template by only twenty linesof additional code. Even the most complex parser module that we have created so farrequired less than 200 lines of custom code. This module for the FreeClass classification


standard includes sophisticated rules for raising the data quality of the resulting productontology.

5.3.2 Transformation of a Product Classification System

A core aspect of the transformation step is the creation of the product classes in theresulting ontology based on the source product classification system. To create theontology classes, the PCS2OWL tool relies on the GenTax approach introduced in[HdB07]. GenTax allows to generate a consistent OWL DL ontology while preserving thetaxonomic structure of the original categories in the product classification system. Inorder to do so, the GenTax method creates, for each category in the product classificationsystem, two corresponding OWL classes in the target ontology:

1. Broad topic: A broader taxonomic class that represents the category from theproduct classification system.

2. Specific type: A context-specific class, in our case in the domain of products andservices.

For a given category identified as “ID” in the original product classification system,let us hereinafter refer to the pair of OWL classes that GenTax creates as C_ID-genand C_ID-tax, following the naming convention of the original GenTax specification[HdB07].

There are additional design decisions that are applied in the conversion process to createthe classes and the class structure of the resulting ontology [cf. HdB07]:

1. All taxonomic classes (C_ID-tax ) are arranged in a subsumption class hierar-chy via the rdfs:subClassOf relation to preserve the hierarchical structure of thecorresponding categories in the original product classification system.

2. Every context-specific class (C_ID-gen) is defined as a subclass of the commonGoodRelations product class gr:ProductOrService via the rdfs:subClassOf propertyto state that it represents entities that are products.

3. Every context-specific class (C_ID-gen) is at the same moment also a subclass ofthe corresponding C_ID-tax taxonomic class, to preserve its traceability to thecategory in the original product classification system that it was derived from.

4. There exist no subsumption relationships between context-specific classes (C_ID-gen), because it is not possible to determine automatically whether a subsumption


relation between two C_ID-gen classes holds or does not due to frequent anomaliesand ambiguities in the original categorization schemas [HdB07].

Figure 5.3 shows an example that results from the conversion of the following fragmentof the English version of the Google product taxonomy [Goo13]:

Cameras & Optics > Cameras > Digital Cameras

Cameras & Optics > Cameras > Disposable Cameras

Figure 5.3 exhibits all four design decisions of the GenTax algorithm outlined previously.The right side shows the taxonomic class hierarchy, whereas the left part describes thecontext-specific class hierarchy. The black solid arrows stand for the rdfs:subClassOfrelationships. As indicated, (1) the taxonomic classes represent the categories in theGoogle product taxonomy and preserve the same hierarchical structure; (2) the context-specific classes represent actual products and services and, hence, are subsumed bygr:ProductOrService; (3) all context-specific (product) classes are at the same timesubclasses of their respective taxonomic class, e.g. the product class C_Cameras-gen is asubclass of the broad topic C_Cameras-tax ; and (4) no subsumption relation is imposedupfront between the product classes, thus in visual terms they are arranged as mutualpair-wise siblings. This design decision is based on the observation that the hierarchicalrelationships in informal category systems frequently suffer from modeling anomaliesattributable to their specific intended usage; for a deeper analysis, see [Hep05a; Hep06;HdB07]. Nonetheless, a hierarchy between product classes might still be establishedeither manually or automatically. If set up automatically, a statistical test might benecessary, i.e. to safeguard the reliability of the conceptual choice by taking randomsamples which validity is checked (e.g. the use of rdfs:subClassOf relationships to representthe hierarchy).

The adoption of the GenTax approach provides several features to the resulting ontologiesproduced by the PCS2OWL tool. GenTax creates meaningful, practically useful productclasses (i.e. “-gen” classes on the left side of Figure 5.3) by defining these as subclassesof gr:ProductOrService, which at the same time renders the resulting OWL DL productontology compatible with GoodRelations and schema.org. By preserving the hierarchicalstructure of the product classification system (i.e. “-tax” classes on the right side ofFigure 5.3), GenTax allows the execution of generalization/specialization queries basedon the original product classification system. For example, it permits to query thecommon category C_Cameras-tax in order to get the union of all instances of the classesC_DigitalCameras-gen and C_DisposableCameras-gen. The use of the rdfs:subClassOfrelationship in the taxonomic classes means that no reasoning capabilities beyond the


Cameras and Optics

Category Concepts (Topics, Categories)

Digital Cameras

Cameras

Disposable Cameras

Cameras and Optics

Disposable Cameras G

ener

ic C

once

pts

(Obj

ects

of a

cle

arly

defi

ned

type

)TAX

GENConcept based on label in one narrow

context (here: product or service)

Broad meaning of the label so thatthe edges are subClassOf

This concept represents all objects

that are related to Cameras and Optics

as a topic

This concept represents all instances of

Cameras and Optics

This concept represents all actual

Digital Cameras

gr:ProductOrService

Digital Cameras

Cameras

Figure 5.3: GenTax applied to a subset of the Google product taxonomy [cf. HdB07]

widely supported RDF Schema (RDFS) inferencing are required to navigate throughthe taxonomic structure of the original product classification system in the generatedontology. Additionally, for traceability and provenance purposes, every class indicatesthe ontology that it is described by through taking advantage of the rdfs:isDefinedByproperty; and moreover, every taxonomic class specifies a hierarchy code (materializedvia an annotation property :hierarchyCode) to link it to the corresponding category codeused in the source classification system.

5.3.3 Converting Property Types, Range Information, and Enumerated Values

Our approach extends the GenTax algorithm of the previous section. In addition tothe extraction of OWL classes from hierarchical classifications, PCS2OWL convertsfeatures and feature values of product classification systems, thus contributing additionalsemantics to categories. The different types of properties that are supported by the toolare in line with the GoodRelations ontology and consist of

• qualitative properties (gr:qualitativeProductOrServiceProperty),

• quantitative properties (gr:quantitativeProductOrServiceProperty), and

• datatype properties (gr:datatypeProductOrServiceProperty).

Similarly, our tool distinguishes between enumerations or qualitative values (gr:Qualitative-Value, e.g. the color “red”), quantitative values (gr:QuantitativeValue of type xsd:float or


xsd:integer, e.g. values that indicate ranges like “500 milliliters”), and literal values withdatatypes (xsd:float, xsd:integer, xsd:boolean, or xsd:string).

Custom rules and heuristics guide the distinction of the property types and relatedvalues. They have to be provided with the parser modules in order to be applicablefor the subsequent transformation step where respective OWL properties are generatedautomatically. Thus, the quality of the conversion strongly depends on the correctnessof these logics: As a general rule of thumb, a numerical value accompanied by a unitcode in the product classification system will yield a quantitative value in the resultingproduct ontology, and not a qualitative value or a datatype literal. Similarly, if thereare predefined values in a classification system, the corresponding properties and theindividuals created from the values themselves will be of qualitative nature. Sometimes,classification standards even provide additional information to facilitate the distinction ofthe intended type of features and values. E.g., ETIM indicates logical values with an “L”metadata flag, hence best mapped as boolean literals in RDF. A corresponding guidelinefor eCl@ss was detailed in [Hep08b].

5.3.4 Serialization and Deployment

In this section, we describe the serialization and deployment of the resulting productontologies. This includes deciding on a canonical URI pattern for publishing the entitieson the Web, and providing alternative ways to support standards-compliant Web ontologydeployments.

The product classes and related entities in the ontologies obey a common URI pattern,which is comprised of

1. the base URI of the ontology;

2. a prefix to help humans distinguish URIs of different entity types, namely “C_” forclasses, “P_” for properties, and “V_” for values;

3. an identifier that is unique in the context of the category system, which for categoriesis typically the hierarchy code; and, for classes,

4. a suffix to distinguish context-specific or generic (“-gen”) classes from taxonomic(“-tax”) classes.


Following this pattern, the URI12 of a context-specific class “Disposable Cameras” (hier-archy code “10001488”) in the resulting GPC product ontology is

http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_10001488-gen

PCS2OWL offers two deployment alternatives for product ontologies, namely basedon hash and slash URIs. Hash URIs use the number sign character (“#”) to refer toentities, whereas slash URIs address entities directly and are ended by a slash character(“/”). In the latter case, the difference between entities and their respective documentrepresentations is established using Hypertext Transfer Protocol (HTTP) forwarding withthe status code 303 See Other [SC08, Section 4.2]. The two deployment alternatives aredescribed in [SC08, Section 4].

The first option generates a single comprehensive dump of the RDF graph, which isserialized as RDF/XML. The downside of this approach is the potentially huge filesize that can make it infeasible for large classification systems, because every HTTPrequest against the URI of a single element from the resulting ontology will require thetransmission of the entire representation, as the hash fragment part is not sent to theserver according to the specification for URIs [BFM05, Section 3.5; SC08, Section 4.1].By contrast, the slash-based option generates a series of small RDF files, comprisingseparate files for all taxonomic and generic classes, and, if available, also for propertiesand individuals. This has the advantage that it allows serving smaller chunks of code forindividual elements compared to its full dump counterpart. Moreover, with this optionthe tool creates a navigable documentation consisting of a set of interlinked HypertextMarkup Language (HTML) pages that mimic the hierarchical relationships. The twodeployment alternatives imply different URI patterns, that are of the following form:

http://example.org/pcs#C_1234-gen ! hash-based

http://example.org/pcs/C_1234-gen ! slash-based

In addition to the creation of RDF/XML and HTML files, PCS2OWL also generatesa Semantic Sitemap [Cyg+08], and an .htaccess13 file for the easy deployment on anApache Web server. Content negotiation [FR14d, Sections 3.4 and 5.3] for the delivery ofthe Web resources is ensured using best practice patterns described online [BP08, Recipe5]. For slash URIs it means that by dereferencing an arbitrary entity URI (e.g. a classURI), an HTML-preferring client is redirected to a respective HTML document using theHTTP response status code 303 See Other. Similarly, the client retrieves RDF/XML, if12http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_10001488-gen (accessed on

September 16, 2014)13https://httpd.apache.org/docs/current/howto/htaccess.html (accessed on February 19, 2016)

http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_10001488-gen

https://httpd.apache.org/docs/current/howto/htaccess.html

5.4 Evaluation 187

the media type supplied with the HTTP Accept-header is application/rdf+xml. Inthis sense, our approach constitutes a fully LOD-compliant deployment [HB11].

5.4 Evaluation

In the following, we evaluate our approach. We focus on two key aspects, namely on thecorrectness of the conversion results and on the amount of new product classes, properties,and enumerations obtained.

5.4.1 Correctness of the Derived Product Ontologies

In this part of the evaluation, we analyze whether the product ontologies properly reflectthe elements and the hierarchical structure provided by the original product classificationsystems. We first did a quantitative comparison of the conceptual elements in the productclassification systems and all classes, properties and individuals of the correspondingproduct ontologies. For this purpose, we examined the number of concepts in the sourcefiles or database tables and contrasted them to the number of files produced for relatedtypes of concepts, e.g. the number of taxonomic classes in the ontologies. If the numbersmatched, it implied that the concepts were successfully converted and are containedin the product ontologies, which was actually the case for all ontologies that we havegenerated.

We complemented and further confirmed our previous findings by an experiment conductedon a product ontology derived from the Google product taxonomy [Goo13]. The taxonomyfile is publicly available online14 as plain text. It is line-based and characterized by acategory tree which hierarchical structure is expressed using delimiting angle brackets asfollows:

Food, Beverages & Tobacco > Beverages > Coffee > Coffee Pods

The taxonomy is read from the left starting with the most generic concept and gettingmore specific moving to the right. Accordingly, “Coffee” is a more specific concept than“Beverages” with respect to the Google product taxonomy. Our idea was basically toreverse-engineer the original taxonomy starting from the product ontology that we loadedinto a SPARQL Protocol and RDF Query Language (SPARQL) endpoint. A set ofappropriate SPARQL queries allowed us to build up the whole hierarchy in a top concept! ... ! bottom concept fashion. We then concatenated the respective RDFS labels using14http://www.google.com/basepages/producttype/taxonomy.en-US.txt (accessed on July 22, 2014)

http://www.google.com/basepages/producttype/taxonomy.en-US.txt


the exact same delimiters as advocated by the Google product taxonomy file format. Andfinally, the results of the concatenation were compared to the contents in the originalsource file. This way we were able to recreate an equivalent copy of the original file, whichconfirms the completeness and reliability of our conversion. Figure 5.4 illustrates thesingle steps of our evaluation approach, which are described in more detail online15.

Food, Beverages & Tobacco > Beverages > Coffee > Coffee Pods!

Food, Beverages & Tobacco > Beverages > Coffee > Coffee Pods!

TXT

TXT

Conversion

SPARQL + Concatenation + Validation

[Food,_Beverages_and_Tobacco] Food, Beverages & Tobacco

[Beverages] Beverages

[Coffee] Coffee

[Coffee_Pods] Coffee Pods

Top node(most generic)

Leaf node(most specific)

Figure 5.4: Reverse-engineering of the Google product taxonomy

The last step in the evaluation of the conceptual correctness comprised a qualitativeassessment of the consistency of the classification hierarchies. An important observationis that taxonomies are mostly created for a special purpose, e.g. to provide a naviga-tional structure for products in a Web shop or to classify products from a procurementperspective. If we were able to show, though, that a significant share of the taxonomicclasses form a valid subsumption hierarchy in the context of products and services, thenthe taxonomic relationships could be adopted for product classes as well. This wouldadd a lot of value to e-commerce scenarios, e.g. by enhancing reasoning capabilities overproducts. For a related experiment, we chose the product ontology of GPC. We looked forinconsistent paths in the subsumption hierarchies with respect to the domain of productsand services. Figure 5.5 exemplifies (a) a valid and (b) an invalid subsumption path.As Figure 5.5a demonstrates, “Juice Drinks” represents a valid subsumption path, bothin the original context and in the target domain of products or services. Likewise isthe subsumption path for swap drives perfectly valid in the original context, since GPCis a classification standard for trading in the supply chain. However, it is invalid with

15http://www.ebusiness-unibw.org/ontologies/pcs2owl/evaluation/ (accessed on September 16,2014)

http://www.ebusiness-unibw.org/ontologies/pcs2owl/evaluation/

5.4 Evaluation 189

[50000000] Food/Beverage/Tobacco

[50200000] Beverages

[50202300] Non Alcoholic Beverages Ready to Drink

[10000223] Juice Drinks Ready to Drink (Shelf Stable)



(a) valid

[65000000] Computing

[65010000] Computers/Video Games

[65010300] Computer Drives

[10001134] Swap Drives



(b) invalid

Figure 5.5: Examples of valid and invalid subsumption relations from the GPC hierarchy when inter-preted as product classes

regard to products and services (see Figure 5.5b): “Computer Drives” and therefore “SwapDrives” are not subclasses (or specializations) of “Computers/Video Games”, rather theyare parts of them.

Many such inconsistent examples can be found in GPC16, which led us to conclude thatthe product taxonomy for the GPC product ontology cannot be simply derived fromthe original classification system automatically. A similar example of an inconsistentsubsumption relationship we have just encountered before are “Coffee Pods” from theGoogle product taxonomy, which are no true specializations of “Beverages” and “Coffee”,but consumables for coffee machines.

5.4.2 Statistics on New Product Classes and Properties

In Section 5.1, we have argued that our approach produces a large number of readilyusable product classes for the Web that would be infeasible to craft and maintain manually.In order to support this claim, we will now analyze relevant statistics about the derivedproduct ontologies17.

As a preliminary step, we loaded all product ontologies into a SPARQL endpoint. Storingeach product ontology as a different named graph [Car+05] (urn:cpa, urn:gpc, etc.)allowed us later on to execute SPARQL queries based on their graph names. To give anexample, we used the SPARQL 1.1 query of Listing 5.1 (prefix declarations omitted) todetermine the number of hierarchy levels in the product ontologies. We executed the16We drew a representative random sample of subsumption paths as explained in [HdB07], but a large

number of relationships were identified as invalid with respect to the domain of products and services.17http://www.ebusiness-unibw.org/ontologies/pcs2owl/ (accessed on September 16, 2014)



1 SELECT (COUNT(DISTINCT ?c) AS ?num_classes) WHERE {2 GRAPH <urn:gpc> {3 ?c a owl:Class .4 ?c rdfs:subClassOf{3} ?sc .5 FILTER NOT EXISTS {?c rdfs:subClassOf gr:ProductOrService}6 }7 }

Listing 5.1: Calculating the number of hierarchy levels of product classification systems

query repeatedly where in every step we incremented the property path length by oneunit until we obtained no results.

Increasing the property path length from three to four in the provided example yieldszero results, meaning that the hierarchy depth of the product ontology is four, i.e. thelongest existing path consists of four classes linked by three consecutive rdfs:subClassOfpredicates. The FILTER statement of the query assures that only taxonomic classes areregarded, excluding those classes defined as products or services which would lead tootherwise incorrect results.

As reported in Section 5.3, our research took into account ten popular product classificationstandards, among them two different versions of eCl@ss, and three proprietary categorystructures. The common abbreviations of the product classification systems togetherwith the versions that have been converted are given in the first column of Table 5.1.

The upper part of the table lists the statistics for the product categorization standards,whereas the lower three rows represent the proprietary category systems. For BMEcat we

Table 5.1: Statistics of product classification standards and category systemsClassification Number of ClassSystem Levels Classes Properties Individuals Top-Level

ClassesDistr.

(%)CPC Version 2 5 4,409 0 0 10 18CPA 2008 6 5,429 0 0 21 53CPV 2008 4 10,419 0 0 254 6eCl@ss 5.1.4 4 30,329 7,136 4,720 25 18eCl@ss 6.1 4 32,795 9,910 7,531 27 16ETIM 4.0 2 2,213 6,346 7,001 54 8FreeClass 2012 4 2,838 174 1,423 11 21GPC 2012 4 3,831 1,710 9,562 37 17proficl@ss 4.0 6 4,617 4,243 6,815 17 36WZ 2008 5 1,835 0 0 21 33Google prod. tax. 7 5,508 0 0 21 17productpilot 8 7,970 0 0 20 28BMEcat na na 0 0 na na

5.5 Discussion 191

cannot report specific numbers, since the standard supports the transmission of cataloggroup structures of various sizes and types. Columns two to six capture the number ofhierarchy levels, product classes, properties, value instances, and top-level classes for eachproduct ontology. It is worth noting that some of the product ontologies have a fixednumber of hierarchy levels (e.g. eCl@ss has four levels), while for others the numbers vary(e.g. proficl@ss, which has up to six levels). Similarly, some of them are quite shallow (e.g.ETIM with two levels), while others provide deep hierarchies (e.g. CPA with six levels)with sometimes redundant concept names at consecutive levels. The large quantity ofentities (classes, properties, individuals) implies an extensive coverage of the productor services domain, which, if built up manually, would be prohibitively expensive andtime-consuming. Besides product classes, some product ontologies also contain propertiesand individuals that contribute valuable product details for the Semantic Web. Lastly,the seventh column (“Class Distr. (%)”) indicates the distribution of classes within thederived product ontology [cf. HLS07, Table 2]. This distribution is measured as thepercentage of classes that belong to the largest top-level class with respect to the totalnumber of classes in the ontology. This value describes the topology of the hierarchicalstructure and is thus an indicator for the quality of the product ontology. For example,in CPA one (“manufactured products”) of the 21 top-level classes contains more than halfof all the classes in the standard, while the classes in ETIM are more evenly distributedacross various branches (only 8% of all classes belong to the largest class “hand tools”).For a similar analysis, which also inspired our approach, see [HLS07].

Among the classification systems with multilingual support, CPA is the one with themost translations, featuring class labels in 26 languages on average. Other productontologies that also support multiple languages are CPV with an average of 22.9 languages,FreeClass with 6.9 languages, and WZ and the productpilot category system both havingtwo translations. The variety of languages supported increases the chance of findingproducts annotated with product classes on the Web more easily. Translations can also bevaluable input to future hybrid product search approaches that combine formal knowledgerepresentations with natural language processing.

5.5 Discussion

The following section presents a series of e-commerce use cases that embody some of thenovel opportunities that search engines and other consumers of structured data can exploitin areas such as product search, comparison, and matchmaking. These opportunitiesarise from using the now available Web product ontologies from PCS2OWL that allow to


articulate more granular product descriptions across both the Web of Documents andthe Web of Data.

Let us consider, for instance, an online retailer interested in improving its product tradingand data management processes. One enhancement consists in the adoption of the GPCclassification standard instead of developing a custom scheme from scratch, leveragingthe GPC ontology on the Web. Now further imagine that our retailer has published onthe Web a snippet in Microdata syntax as in Listing 5.2, describing an offer for a specificdisposable camera in GoodRelations.

1 <div itemtype="http://schema.org/Offer" itemscope itemid="http://example.org/#offer">2 <div itemprop="priceSpecification" itemscope3 itemtype="http://schema.org/UnitPriceSpecification">4 <meta itemprop="priceCurrency" content="USD" />5 <meta itemprop="price" content="12.0" />6 </div>7 <div itemprop="itemOffered" itemtype="http://schema.org/SomeProducts" itemscope

itemid="http://example.org/#product">8 <link itemprop="additionalType"9 href="http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_10001488-gen" />

10 <meta itemprop="name" content="Kodak 35mm Single Use Camera Flash" />11 12 <meta itemprop="mpn" content="KMF135" />13 </div>14 </div>

Listing 5.2: Annotation example in Microdata syntax

For readability, the qualified names of the vocabulary URIs involved are used hereinafter.They rely on the prefix declaration of gr: for GoodRelations [Hep08a], gpc: for the GPCproduct ontology18, s: for schema.org19, and ex:20 for the product data traded by ouronline merchant.

5.5.1 Classification of Product and Offer Descriptions

Listing 5.2 specifies a disposable camera and the associated offer via the URIs ex:productand ex:offer, respectively. The offer is declared to be an instance of s:Offer (equivalent togr:Offering) and is accompanied by a price specification consisting of a price value and acurrency. The product is defined as an instance of the class s:SomeProducts (equivalentto gr:SomeItems) and, thanks to the additionalType property in schema.org/Microdata,

18http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/ (accessed on September 16, 2014)19http://schema.org/ (accessed on May 21, 2014)20http://example.org/# (accessed on September 16, 2014)

http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/

http://schema.org/

http://example.org/#

5.6 Conclusion 193

it is an instance of the class gpc:C_10001488-gen as well. This definition, together withthe existing linkage across the classes gpc:C_10001488-gen, gpc:C_10001488-tax, and theproperty gpc:hierarchyCode in the GPC Web ontology, materializes the product ex:producton the Web as an instance of the category “10001488” labeled as “Disposable Cameras”in the original GPC classification standard.

5.5.2 Navigation over Product and Offer Data

The adoption of the GPC Web ontology would allow our online retailer to navigate alongthe product categories of the original GPC standard. Applied to the example in Listing 5.2,this navigation path is determined by the super- and subclasses of gpc:C_10001488-tax,which are defined via the rdfs:subClassOf relationship. For example, the immediate parentclass of gpc:C_10001488-tax (the category of our camera) is gpc:C_68020100-tax 21. Or,in terms of the original schema, the GPC product category “68020100 Photography” isthe parent category of “10001488 Disposable Cameras”.

5.5.3 Semantic Annotation of Products and Offers on the Web

The fact that product classes are published on the Web using URIs renders them applicablefor use with common Web data formats, such as Microdata, RDFa, and Facebook OpenGraph Protocol (OGP). Product annotations in those syntaxes can lead to improvementswith regard to the current state of the document-based Web, namely in the form of searchengine result snippets (known as rich snippets or rich captions, respectively) [Goo16;MicND] and other mid-term benefits that may arise from providing more computer-accessible meaning. In this context, it is also worth noting that the existing alignment22

between the GoodRelations ontology and schema.org allows to annotate products usingURIs in both the gr: and the s: namespaces.

5.6 Conclusion

The ontology engineering task in the domain of products and services is typically tedious,costly, and time-consuming. To master this problem, we presented a generic methodand a toolset for deriving product ontologies in a semi-automatic way from existing21http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_68020100-tax (accessed on

September 16, 2014)22http://wiki.goodrelations-vocabulary.org/Cookbook/Schema.org (accessed on September 16,

2014)

http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_68020100-tax

http://wiki.goodrelations-vocabulary.org/Cookbook/Schema.org


product classification standards and proprietary category systems, which is superior tobuilding them up manually in several aspects. For example, it successfully addressesthe generally large number of concepts in product categorization standards and theconceptual dynamics inherent to the domain of products and services. We have supportedour contribution by converting 13 practically relevant product classification systemsof different scopes, sizes, and structures, and have shown that we can generate usefulproduct ontologies while effectively preserving the original taxonomic relationships. Theseontologies are ready for deployment on the Web of LOD. Furthermore, we exemplifiedhow products can be annotated using the derived product ontologies, rendering themmore visible and discernible on the Web. In particular, employing product classes tosemantically annotate product instances empowers product data consumers to find andaggregate products and respective offers with less effort. For example, they could bereadily used for assisting faceted search over semantic e-commerce data.

As future work, we imagine to extend the set of available parsers by additional productclassification systems, and to publish all converted product ontologies including thosethat, at the time of writing this chapter, we were not yet granted permission due to lack ofcopyright clearance. Moreover, we think that our product ontologies could attract relatedresearch fields, such as finding correspondences across product classification systems bymeans of ontology matching techniques. Similarly, we should point out that our generictoolset could be easily adapted to convert classification systems and taxonomies notnecessarily connected to the domain of products and services.

6 Cleansing and Enrichment


6.2 Typology of Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

6.2.1 Redundant Entity Definitions . . . . . . . . . . . . . . . . . . . . . . . . 198

6.2.2 Schema Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

6.2.3 Unit of Measurement Mismatches . . . . . . . . . . . . . . . . . . . . . . 201

6.2.4 Missing, Invalid, and Inconsistent Data . . . . . . . . . . . . . . . . . . . 202

6.2.4.1 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

6.2.4.2 Invalid Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

6.2.4.3 Inconsistent Data . . . . . . . . . . . . . . . . . . . . . . . . . 204

6.2.5 Data Granularity Mismatches . . . . . . . . . . . . . . . . . . . . . . . . 204

6.2.6 Natural Language Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

6.3 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

6.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

6.3.2.1 RDF Datatype Cleansing . . . . . . . . . . . . . . . . . . . . . 210

6.3.2.2 Other Cleansing Heuristics . . . . . . . . . . . . . . . . . . . . 211

6.3.3 Entity Consolidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

6.3.3.1 Entity Consolidation Based on Identifiers . . . . . . . . . . . . 212

6.3.3.2 Entity Consolidation Based on Proper Names . . . . . . . . . 214

6.3.4 Schema Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

6.3.4.1 Two Schemas with Structural Mismatches . . . . . . . . . . . 216

6.3.4.2 Two Schemas with Direct Correspondences . . . . . . . . . . . 216

6.3.4.3 One Schema with Multiple Patterns . . . . . . . . . . . . . . . 218

6.3.4.4 One Schema with Modeling Shortcuts . . . . . . . . . . . . . . 218

6.3.5 Missing Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

6.3.5.1 Product Model Information Based on Identifiers . . . . . . . . 220

6.3.5.2 Product Model Information Based on Proper Names . . . . . 221

6.3.5.3 Product Feature Inheritance from Product Model to Product 221

6.3.5.4 Product Feature Inheritance from Product Variants . . . . . . 223

195


6.3.5.5 Consumables, Accessories, and Spare Parts . . . . . . . . . . . 224

6.3.6 Missing Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

6.3.7 Data Lifting and Enrichment . . . . . . . . . . . . . . . . . . . . . . . . 228

6.3.8 Unit Conversion and Canonicalization . . . . . . . . . . . . . . . . . . . 230

6.3.8.1 Conversion of Units of Measurement . . . . . . . . . . . . . . 230

6.3.8.2 Currency Conversion . . . . . . . . . . . . . . . . . . . . . . . 232

6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

6.5 Implementation of a Data Management Web User Interface . . . . . . . . . . . . 238

6.5.1 User Interface Tabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

6.5.1.1 Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

6.5.1.2 Cleansing Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 240

6.5.1.3 Dynamic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 240

6.5.2 Data Management with RDF Graphs . . . . . . . . . . . . . . . . . . . . 240

6.5.3 Execution Order of Cleansing Rules . . . . . . . . . . . . . . . . . . . . . 242

6.5.4 Translation versus Canonicalization . . . . . . . . . . . . . . . . . . . . . 242

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

Despite the existence of ontology languages and top-level and domain ontologies, e-commerce data at Web scale will typically exhibit a significant degree of structuraland semantic heterogeneity, and suffer from data quality problems such as omissionsor contradictions. Both will hamper the direct consumption of data for automatedinformation processing. In this section, we analyze this problem, develop a typology ofsuch obstacles and develop and implement prototypical solutions for selected problems.


With the Web of Data [BHB09], there exists a rich source of product data that consumingclients, at least in theory, can immediately benefit from. In particular, the Web ofData promises sophisticated Web product search and matchmaking opportunities. Yet,in practice, many data-consuming applications find it very challenging to process rawproduct data from the Web, because it is heterogeneous and exhibits a range of dataquality problems. Generally speaking, not the amount of data, but the variety in thedata is the main large bottleneck of the current e-commerce Web of Data. For researchon data quality problems on the Web of Data, see e.g. [MP15; SH15c].

6.2 Typology of Obstacles 197

The diversity in the data is mainly linked to the fact that disparate data sources aredeveloped independently by different parties, and serving distinct purposes. Heterogeneityand data quality problems can already be observed at small scale. For example, theypredominate within corporate settings whenever enterprises or departments have differentbusiness needs, or following mergers and acquisitions that require data integration betweenvarious information systems [cf. Hal05]. However, the situation is getting much morecomplex and critical at larger scale, such as when harvesting e-commerce data from theWeb. This implies to deal with varying data formats and vocabularies, often at differentlevels of granularity, competing modeling patterns among vendors and manufacturers,inconsistent use of units of measurement and currency units, or linguistic idiosyncracies(e.g. homonyms and synonyms) [SH01]. One could argue that there exist standards(e.g. data formats, ontology languages, product ontologies, or code standards) that, ifeverybody would adhere to, could solve the heterogeneity problem. Unfortunately, thispremise is unrealistic for a large body of distributed data sources like the Web. There aretoo many stakeholders, systems, and applications involved with each having individualrequirements.

A consolidated view on the data is a prerequisite for effective product searches andmatchmaking [e.g. Di +03]. To give an example, when looking for a new car, the queryought to include entities typed as cars, but automobiles as well. Furthermore, in searchfor automobiles manufactured by BMW, we require all corresponding cars (1) to featurea property that links to the respective car manufacturer, and (2) to have a consistententity representation for BMW. In fact, the great variety of product data on the Webcomplicate these data integration tasks.

In the following, we analyze the types of problems and their causes and sketch techniquesfor using the data for product search despite the underlying deficiencies. The rest of thischapter is structured as follows: In Section 6.2, we elaborate on a typology of obstaclesrelevant to the domain of e-commerce on the Web of Data; Section 6.3 presents viabletechniques for overcoming these obstacles; as an evaluation, we demonstrate in Section 6.4how frequent selected issues are in our real Web crawl (see also Chapter 3); in Section 6.5,we showcase a prototypical implementation that eases the cleansing and enrichmentprocess; and finally, Section 6.6 concludes this chapter.

6.2 Typology of Obstacles

In this section, we present a categorization of prevalent obstacles that are regularly foundin structured e-commerce data on the Web. In a sense, these obstacles are special kinds


of data quality problems.

Data quality problems are a result of poor data quality. It is widely accepted thatdata quality can be defined with regard to consumers, e.g. Wang and Strong [WS96]characterize data quality as “data that are fit for use by data consumers” [WS96]. Somerelevant dimensions for evaluating data quality are accuracy, completeness, relevancy,timeliness, ease of understanding, and accessibility of data [WS96]. Consequently, ifany of these dimensions is not satisfied by the data, then we are facing a data qualityproblem.

Many data quality problems arise from data misinterpretation, i.e. that the semanticsof the data is not always clear in any context [Mad03]. E.g., an attribute “price”, if notfurther specified, can be understood as with or without taxes included. This does notnecessarily pose a problem as long as price values are viewed in isolation, but it mightlead to incorrect results once prices are aggregated or compared.

Rahm and Do [RD00] developed a comprehensive categorization of data quality problemsconcerning data at schema and instance level that may appear in single-source or multi-source environments [RD00] (see Section 2.4.5). The problem categories that they mentionare also relevant for the e-commerce on the Web of Data, including integrity constraintviolations, data entry errors, heterogeneous data models, and inconsistent data [cf. RD00].In the context of this thesis, we focus on the following types of problems:

• Redundant entity definitions,

• schema heterogeneity,

• unit of measurement mismatches,

• missing, invalid, and inconsistent data,

• data granularity mismatches, and

• natural language issues.

6.2.1 Redundant Entity Definitions

In a distributed system like the World Wide Web (WWW), it is very likely that the exactsame entities are defined more than once. This repeated definition of entities is commonlytermed as redundancy [cf. RD00]. A proper synchronization between data providerson the Web fails due to the vast number of data sources being created and maintainedautonomously and in parallel. It is also generally easier to define entities locally instead


of looking up an authoritative Uniform Resource Identifier (URI) on the Web, if suchexists. Unfortunately, this makes query formulation more tedious and challenging ascompared to textbook-style examples in the SPARQL Protocol and RDF Query Language(SPARQL) documentation or respective tutorials [cf. HS13].

For certain entity types we observed substantial redundancy. In the context of e-commerce,these entity types are

• manufacturer,

• brand,

• product model,

• dealer or vendor, and

• location.

This is because for these types of objects, there will typically be multiple Web resourcesthat expose identical or near-identical information without explicit links. In some cases,the very same information is included in most or all pages of the same Web site (e.g.dealer master data might be contained in every single offer page of a shop), or productdetails are part of both manufacturer and dealer Web sites. In the context of e-commercescenarios, many of these entities can be considered master data. For an overview of thenotion of master data, see e.g. [Los09; Dre+08].

Regardless of the business case, the specification of master data is usually acceptedamong various systems, applications, and processes. Product model data published by amanufacturer, e.g., can be used by a great number of retailers and vendors. Therefore,repeated definitions of this kind of entity types can be consolidated without introducingconflicts. Other entities on the Web shall not be consolidated, though. For example,product offers are specific to a single vendor because of their unique prices and conditions.Consolidating two offers that belong to the same product, e.g. based on their productidentifiers, would inevitably lead to inconsistencies and contradictions.

In summary, entity consolidation based on strong identifiers is only applicable to thoseobjects that are actually representations of the very same object based on some identitycriterion. For an overview of the problem of identity in the context of knowledgerepresentation, see the OntoClean approach in [GW02] and [GW09].


6.2.2 Schema Heterogeneity

Redundant entity definitions, as addressed in the previous section, are a form of hetero-geneity at the instance level, because a canonical representation of entities is missing.However, many data quality issues arise from differences at the schema level.

The problem of schema heterogeneity already persists for decades, and one reason why“schema heterogeneity is difficult and time-consuming is that it requires both domain andtechnical expertise” [Hal05]. In other words, it requires a lot of human effort to createmappings between different schemas. Schema-level heterogeneity on the Web is describedby structural and semantic discrepancies between ontologies [cf. Hal05]. For product dataon the Semantic Web, we need to distinguish between two important levels of schemaheterogeneity:

1. The top-level e-commerce data model : This level of schema heterogeneity refers tothe conceptual differences due to competing data formats and vocabularies. Forexample, there exist several alternatives for modeling e-commerce data on theSemantic Web, namely

a) the h-product Microformat standard [Çel14], i.e. a particular data formatand vocabulary for embedding product data in Hypertext Markup Language(HTML);

b) data-vocabulary.org1, an earlier attempt by Google to establish a vocabularyon the Web, and meanwhile deprecated by schema.org;

c) GoodRelations [Hep08a] in its original (gr:) namespace, e.g. where an offer islinked to a business entity via gr:offers, and a product is linked to an offer viagr:includes;

d) schema.org [SchND] prior to the GoodRelations integration [Guh12], e.g. wherean offer had to be linked to a product via s:offers;

e) schema.org extended with GoodRelations, where among others a new propertys:itemOffered was added to schema.org that corresponds to gr:includes inGoodRelations [cf. Hep15a].

2. Product types, features, and enumerations: This level of schema heterogeneityentails structural and semantic differences between domain ontologies. For instance,there exist competing classifications that define product classes, properties, andindividuals using distinct naming and structure.

1http://www.data-vocabulary.org/ (accessed on February 19, 2016)

http://www.data-vocabulary.org/


In addition to that, even within a single schema, there can be heterogeneity in theform of multiple patterns or idioms for the same type information. E.g., GoodRela-tions defines two classes for location, namely the older and meanwhile deprecatedclass gr:LocationOfSalesOrServicesProvisioning, and a new class gr:Location. Or, tex-tual descriptions are attached to product-related items sometimes using rdfs:label andrdfs:comment, sometimes using domain-specific attributes like gr:name/gr:description orschema:name/schema:description.

6.2.3 Unit of Measurement Mismatches

In Section 2.2.6.2, we learned about a number of different code standards, i.e. date andtime formats, country codes, language codes, codes for units of measure, and currencycodes. In this section, we by and large focus on mismatches of codes for units ofmeasurement, and of currency codes, since they can be regarded as a special type of unitsof measure. The other code standards will not be considered in here, first due to theirgeneral relevance beyond the narrow scope of product search, and second because theycan mostly be solved using heuristics at countless degrees of freedom (e.g. to convert adate string into its canonical form).

Mismatches on units of measurement can generally manifest in two ways, namely

1. as various unit standards that one might choose from:

• UN/CEFACT [Uni09b],

• Unified Code for Units of Measure (UCUM) [SM13],

• Quantities, Units, Dimensions and Types (QUDT) [Hod+14],

• Ontology for Units of Measure and Related Concepts (OM) [RvAT13], and

2. as one standard, but with differing unit codes for describing the same physicaldimensions (e.g. metric and imperial units like “gram” versus “pound”, or base andderived units such as “kilogram” versus “gram”).

In addition, data providers often exhibit units of measure that are not standards-compliant.This is the case, for example, when manufacturers miss to curate quantities usingstandardized units in their product information management (PIM) system, e.g. using“Volts” or “V” rather than the UN/CEFACT Common Code “VLT”, or when vendors priceproducts with “Euros” or the currency symbol “e” instead of adhering to the ISO 4217currency code “EUR”. This variety adds substantial complexity for data consumers.


6.2.4 Missing, Invalid, and Inconsistent Data

As completeness and consistency of data are major factors for data quality, we willsubsequently discuss them in more detail. The related problems in this context are

1. missing data,

2. invalid data, and

3. inconsistent data.

6.2.4.1 Missing Data

As we have described in Chapters 3 and 4, many product offers published by Web shopslack granular product data for enabling deep product comparison. Missing data canactually mean missing data in the representation, or data that is available, albeit onlyimplicitly. By implicit data we mean latent information that is not explicitly stated, e.g.published in the form of a free-text field or image. In the following, we outline differentkinds of missing data.

Missing Entity Type Information A lot of product data on the Web is not categorized.A possible reason is that most Web shops neither have the means to link their productsto classification standards nor do they have a mechanism to mint URIs for every categoryfrom their category system. So it is very common to publish products without furtherspecifying their kind or to attach a textual description, e.g. using a property like gr:categoryas shown in Listing 6.1.

1 ex:GoldenNecklace a gr:SomeItems ;

2 # name and description of the item

3 gr:name "18K yellow-gold necklace"@en ;

4 gr:description "Necklace of yellow gold with a metal purity of 18K weighs 4 grams."@en ;

5 # provide some category information

6 gr:category "Jewelry > Necklace"@en ;

7 # additional product features

8 gr:hasWeight [ a gr:QuantitativeValueFloat ;


10 gr:hasUnitOfMeasurement "GRM"^^xsd:string ] .

Listing 6.1: Categorizing products with textual properties


Missing Relationships In spite of related entities that could well be interlinked, thereare conceptual gaps often present in the data. For example, relationships are often missingbetween products and their respective product models, among products and productvariants, or between products and their consumables, accessories, and spare parts.

Missing Attributes Many data quality problems originate from missing attributes inproduct data. Publishers often omit unit codes for quantitative values, or publish priceswithout currency units. Furthermore, statements about entities are often underspecifiedand thus ambiguous, in particular if value-added taxes are not detailed for prices, or ifvalidity durations of product or price offers are incomplete, e.g. when either no start orend dates are supplied. The same holds for opening hours of physical stores.

Missing Datatype Information Especially when harvesting data from Microdata markup[Hic13], Resource Description Framework (RDF) datatype information will often be miss-ing [cf. Kel14, Section 3.2]. Then, SPARQL queries and ordering operations will not workproperly. For example, examine the SPARQL query in Listing 6.2. Imagine that someproduct data specifies the currency unit without the datatype xsd:string. In that case,the query would not match the data. Thus, the fixing of missing datatype informationconstitutes an important preprocessing step for product search.

1 SELECT ?product ?price

2 WHERE {

3 ?product a gr:SomeItems ;

4 gr:hasCurrency "EUR"^^xsd:string ;

5 gr:hasCurrencyValue ?price .

6 }

Listing 6.2: SPARQL SELECT query to retrieve products with prices in “Euros”

6.2.4.2 Invalid Data

Invalid data comprises unexpected data values that a SPARQL processor cannot handleproperly. In real-world data, such possible problems are the presence of

• data entry errors (e.g. misspelled product names, or erroneous product identifiers),

• string values where numerals are expected (e.g. a currency value that containsalphanumerical characters),


• wrong datatypes used for literals (e.g. xsd:string instead of xsd:float), or

• inconsistently used units of measurement (see Section 6.2.3).

6.2.4.3 Inconsistent Data

In addition to missing and invalid data, data can also be inconsistent. Such inconsistenciesare often either formal conflicts or conflicts at the business logic level.

Integrity Constraint Violations A source of formal inconsistencies are integrity con-straint violations [e.g. RD00; Hal05]. This includes violated domain or range constraints,non-compliance to cardinality constraints, or minimum values greater than maximumvalues. In this case, the conflict is between the instance data and the schema [cf. RD00].

Redundant Data with Conflicts When data sources are merged, assertions comingfrom different data sources are often conflicting [cf. RD00]. E.g., the same triples mightappear multiple times with different values. In this case, the conflict is on the instancelevel between multiple instances.

Business Logic Violations Another problem source in the domain of e-commerce is thatthe more general and formal data quality problems are often complemented by specificinconsistencies at the business logic level. This entails cases such as that the validity of aproduct offer ends before it starts (the same holds for opening hours), price specifications(or opening hours) are partially overlapping, the list price is set much lower than theretail price, the price tag is an extreme outlier, or that multiple list prices were specifiedfor the same product offer.

6.2.5 Data Granularity Mismatches

Even within a single schema, there often exist multiple patterns for the same type ofinformation that differ in the amount of structure. Without proper handling, this willlimit the recall for a query, because the respective SPARQL patterns will not be found.

Basically, two types of data granularity mismatches stand out, namely

1. differing modeling patterns, and

2. weakly structured information.


GoodRelations offers some modeling shortcuts to ease the publication of recurringpatterns in data, namely the properties gr:includes, gr:hasValue, gr:hasValueFloat, andgr:hasValueInteger [Hep11]. Otherwise, it has e.g. always been tedious to model a productoffer and its respective product instance. To link these two entities, the code in Listing 6.3was needed [cf. Hep11].

1 ex:Offer gr:includesObject [ a gr:TypeAndQuantityNode ;

2 gr:amountOfThisGood "1"^^xsd:float ;

3 gr:hasUnitOfMeasurement "C62"^^xsd:string ;

4 gr:typeOfGood ex:Product ] .

Listing 6.3: Linking two entities with the gr:includesObject modeling pattern

The rationale was that by means of this flexible and powerful pattern one could modelproduct offers consisting of a single product item or bundles with multiple items. However,most product offers do not require such complex modeling, as they include only oneproduct item. Thus, a shortcut was added to GoodRelations that allows to express thisinformation more elegantly (see Listing 6.4) [cf. Hep11].

1 ex:Offer gr:includes ex:Product .

Listing 6.4: Linking two entities with the gr:includes modeling shortcut

Prior to consuming data based on this shortcut, the shortcut needs to be expanded tothe canonical long form as shown above [Hep13].

Similar rules hold for the shorthand properties gr:hasValue, gr:hasValueFloat, andgr:hasValueInteger. They describe point values and can be easily expanded to respectiveintervals (e.g. gr:hasMinValue and gr:hasMaxValue with both having the same rangevalue) in order to simplify query formulation.

It is common practice on the Web to publish data at varying levels of data granularity.Some databases do not have a sophisticated data model and thus are only capable toexport weakly structured information. Of course, it can also be the other way around,albeit less frequently. Another reason why data providers would publish weakly structuredinformation is that there exist no adequate property definitions in the existing ontologiesthat they could rely on. Therefore, product data, especially product features, are oftenonly available as plain string literals. For example, imagine a data provider using aproperty ex:hasOperatingVoltage with a value “220–240 V”. To unleash the semantics,the value should better be split into its value and unit constituents and the interval be


1 ex:Product a gr:SomeItems ;

2 ex:hasOperatingVoltage [ a gr:QuantitativeValue ;

3 gr:hasMinValue 220 ;

4 gr:hasMaxValue 240 ;

5 gr:hasUnitOfMeasurement "VLT" ] .

Listing 6.5: Modeling of intervals in GoodRelations

modeled via two datatype properties gr:hasMinValue and gr:hasMaxValue, as indicatedin the query in Listing 6.5.

To give yet another example of weakly structured information, it is difficult for manysystems (e.g. shop software) to keep price and currency values apart when exposingproduct offer data on the Web. For this reason, schema.org accepts four differentmodeling patterns for price specifications (see Listing reflst:schemaorg-price-modeling).Please note that variants 1 and 2 outlined in Listing 6.6 represent shortcuts of variants 3and 4 in schema.org.

1 # 1. Price attached directly to the offer as a textual property

2 ex:Offer a schema:Offer ; schema:price "100.0 USD" .

3 # 2. Price attached directly to the offer but in a more granular way

4 ex:Offer a schema:Offer ; schema:price "100.0" ; schema:priceCurrency "USD" .

5 # 3. Price modeled via a detailed price specification node and as a textual property

6 ex:Offer a schema:Offer ;

7 schema:priceSpecification [ a schema:UnitPriceSpecification ;

8 schema:price "100.0 USD" ] .

9 # 4. Price modeled via a detailed price specification node but in a more granular way


11 schema:priceSpecification [ a schema:UnitPriceSpecification ;

12 schema:price "100.0" ; schema:priceCurrency "USD" ] .

Listing 6.6: Price modeling patterns in schema.org

In summary, for a data consumer it poses difficulties to take advantage of weakly structuredinformation and to cater for the variety of possible modeling patterns.

6.2.6 Natural Language Issues

Despite an ontology-based representation, queries will frequently be hybrid in nature,i.e. include the search for certain keywords in string values. When dealing with naturallanguage though, we need to be aware of some caveats. It is possible that entities are

6.3 Techniques 207

described using multiple, different languages, which is not inherently bad since it canincrease recall. A keyword that fails to match items based on a specific language mightstill match them based on a translation. Unfortunately, for the same language a myriadof dialects and regional differences is often available. This can lead to ambiguity of termsand incompatible or incorrect spellings.

When speaking of ambiguous terminology, we frequently intend the problems connectedwith homonyms and synonyms that in information retrieval (IR) can lead to poor precisionand recall ratios, respectively [cf. Dee+90]. Homonyms, on one hand, are terms thatshare the same name but carry different meanings [NO95]. An example thereof is “chair”2,which, as a noun, can actually mean a piece of furniture or a professorship, and, as averb, the act of leading a meeting, event, or discussion. On the other hand, synonymsare words with the same meaning even though their spellings are different [NO95]. “Car”and “automobile” are two different terms to refer to a motorized vehicle with four wheels.Similarly, “bicycle”, “cycle”, and “bike” all denote the same kind of objects. The AmericanEnglish term “elevator” is known as “lift” in British English, etc.

Finally, an important class of problems with natural language are incorrect and incom-patible spellings, as well as acronyms [cf. RD00]. The first group of spelling mistakesare due to data entry errors (e.g. “childern” instead of “children”, or “compair” insteadof “compare”), while incompatible spellings are due to differences in the language, i.e.dialects or regional differences. In English, an important distinction is made betweenAmerican English and British English, where “fiber” becomes “fibre” or “categorization”becomes “categorisation”. The third group, acronyms, are abbreviations for words. E.g.,“Serial ATA” is often abbreviated as “SATA” or “European Union” as “EU”.

Simplistic SPARQL queries will fail as soon as a single of the aforementioned problemsare found in the corpus of data.

6.3 Techniques

In this section, we sketch potential solutions to the obstacles outlined in the precedingsection.

2http://www.merriam-webster.com/dictionary/chair (accessed on November 3, 2014)

http://www.merriam-webster.com/dictionary/chair


6.3.1 Overview

Table 6.1 gives an overview of the challenges and a short list of possible approaches. Inthe following, we address some of them in more detail.

Table 6.1: Obstacles with respective solutions

Challenge Approaches

Redundant entity definitions ! Entity consolidation / instance matching [e.g. RD00;DH05]

Schema heterogeneity ! Schema alignment [e.g. RD00; RB01]Unit of measurement mismatches ! Unit conversion and canonicalization [cf. Cul+07;

SH13a]Missing, invalid, and inconsistent data ! RDF datatype cleansing, cleansing heuristics, data min-

ing, enrichment [e.g. RD00; DH05; Cul+07]Data granularity mismatches ! Data lifting heuristics, enrichment, schema matching

[e.g. RD00; RB01]Natural language issues ! Word sense disambiguation (WSD), named entity recog-

nition (NER) and named entity disambiguation (NED)[e.g. Nav09; MRN14], query expansion [e.g. Vor94]

In general, one might choose between multiple basic techniques for data cleansing andenrichment. The most popular techniques are summarized below along with their typicaluse cases:

• OWL and RDFS reasoning : Reasoners allow to infer knowledge from implicit factsusing logical inferences (see Section 2.3.8.3). On the Semantic Web, reasoners fordifferent ontology languages are available. An RDF Schema (RDFS) reasoner drawsconclusions from RDFS statements, mainly rdfs:subClassOf, rdfs:subPropertyOf,rdfs:domain, and rdfs:range. An Web Ontology Language (OWL) reasoner, inits simplest form, infers knowledge from axioms including owl:equivalentClass,owl:equivalentProperty, or owl:sameAs. Reasoners can either materialize inferredtriples, or perform reasoning tasks at query time [e.g. KD11, p. 249].

• SPARQL: With SPARQL, there basically exist two approaches for cleansing andenrichment. The first option is to formulate SPARQL queries, possibly using nestedSELECT queries. This is very cumbersome, though, because it is limited to (a) thecognitive complexity of formulating such one-turn SPARQL queries, and (b) end-point restrictions like the execution time and the length of the SPARQL queries thatcannot be arbitrarily long. More promising is the SPARQL CONSTRUCT featurethat provides a convenient mechanism for defining custom rules. By SPARQLCONSTRUCT queries, new triples can be created (consequent) based on a graph

6.3 Techniques 209

pattern match (antecedent) [cf. AH11, pp. 88f., pp. 115f.]. Subsequent queriesthen execute over the original graph plus the newly materialized data. Anotheralternative for defining custom rules is the use of a dedicated rule language (e.g.Semantic Web Rule Language (SWRL) [Hor+04]), that we are not going to discussin more detail here, though.

• Script-based approaches: For complex functions that are inefficient or impossibleto achieve with OWL and RDFS inferencing and SPARQL rules, it is sometimessuperior to export data in order to process it offline using scripts, or to takeadvantage of user-defined functions (UDFs) or functions built into a SPARQLendpoint. Viable use cases are geocoding and currency conversion.

Most of our solutions presented in this section will rely on SPARQL CONSTRUCT rules.We will use them to define production rules that materialize as new data.

For the rest of this chapter, we will omit the prefix declarations in the header of theTerse RDF Triple Language (Turtle)/Notation 3 (N3) examples and SPARQL queries forbrevity. Unless otherwise stated, we assume the namespace declarations for Turtle andSPARQL3 as provided in Listing 6.7.

1 # example namespace

2 @prefix ex: <http://www.example.com/> .

3

4 # domain ontologies

5 @prefix gr: <http://purl.org/goodrelations/v1#> .

6 @prefix schema: <http://schema.org/> .

7 @prefix vso: <http://purl.org/vso/ns#> .

8 @prefix pto: <http://www.productontology.org/id/> .

9

10 # ontology languages

11 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

12 @prefix owl: <http://www.w3.org/2002/07/owl#> .

13

14 # datatypes


Listing 6.7: Namespace declarations used for Turtle/N3 and SPARQL examples

3Although the syntax for declaring namespaces is slightly different with SPARQL, i.e. replacing “@prefix”by the keyword “PREFIX” and omitting the trailing dot.


6.3.2 Preprocessing

In the following, we consider some problems with the representation of structured dataon the Web. These can mainly be fixed by simple preprocessing steps.

6.3.2.1 RDF Datatype Cleansing

Missing RDF datatypes can be added to simple, plain, or untyped literals based on theschema definition.

The left-hand side of Listing 6.8 shows an example where the RDF datatypes are missing.The right-hand side constitutes the same data but with the correct datatypes, as intended.Though the difference seems minimal, even such slight representational discrepancies canlead to a loss of recall in SPARQL queries over RDF data.

1 ex:NumSeats a gr:QuantitativeValueInteger ;

2 gr:hasValue "9" ;

3 gr:hasUnitOfMeasurement "C62" .

1ex:NumSeats a gr:QuantitativeValueInteger ;

2gr:hasValueInteger "9"^^xsd:int ;

3gr:hasUnitOfMeasurement "C62"^^xsd:string .

Listing 6.8: Adding RDF datatypes to plain literals

For historical reasons, RDF expects type information to be stated in each single literalrather than being taken from the underlying schema, which is burdensome for datapublishers and thus often lacking in the resulting data. According to the RDF 1.1specification [CWL14, Section 3.3], simple literals without a datatype or language tagare syntactic sugar for literals with xsd:string datatype. In the worst case, this couldturn numeric literals into xsd:string literals as well. Thus, it is better to add the rightdatatype information to untyped RDF literals based on the datatype indicated in theschema. The GoodRelations vocabulary definition for the property gr:hasValueInteger isgiven in Listing 6.9.

1 gr:hasValueInteger a owl:DatatypeProperty ;

2 rdfs:label "has value integer (0..1)"@en ;

3 rdfs:domain gr:QuantitativeValueInteger ;

4 rdfs:range xsd:int .

Listing 6.9: OWL definition of the gr:hasValueInteger property

With the SPARQL CONSTRUCT rule outlined in Listing 6.10, we can generate triplesthat add the RDF datatype to untyped literals based on the schema definition, leadingto the graph on the right-hand side of Listing 6.8.

6.3 Techniques 211

1 CONSTRUCT {?s ?p ?new_o}

2 WHERE {

3 ?s ?p ?o .

4 ?p rdfs:range ?range .

5 FILTER(datatype(?o) != ?range)

6 BIND(STRDT(?o, ?range) AS ?new_o)

7 }

Listing 6.10: SPARQL CONSTRUCT query to recover the correct datatype from schema information

The query in Listing 6.10 can also be used to fix literals with wrong datatypes by takinginto account the schema constraints (e.g. range restrictions), as the side-by-side examplein Listing 6.11 exhibits. Note that this approach assumes that the vocabulary is knownand defines exactly one datatype per property. As soon as complex OWL class definitionsare used as the ranges of a property (e.g. text or number), additional heuristics have tobe employed.

1 ex:NumSeats a gr:QuantitativeValueInteger ;

2 gr:hasValueInteger "9"^^xsd:string ;

3 gr:hasUnitOfMeasurement "C62"^^xsd:int .

1ex:NumSeats a gr:QuantitativeValueInteger ;

2gr:hasValueInteger "9"^^xsd:int ;


Listing 6.11: Assigning correct RDF datatypes to literals with incorrect datatypes

6.3.2.2 Other Cleansing Heuristics

Invalid data values are very common, especially differences in the formatting of numericalvalues. Regional differences affect the use of decimal point and thousands separator, e.g.“1,200.5”, “1.200,5”, or “1200.5”. The latter is the default format required and understoodby most modern computers, i.e. no thousands separator and a decimal period.

In the examples listed in Listing 6.12, we contrast incorrect and correct numerical values.

1 ex:Weight a gr:QuantitativeValueFloat ;

2 gr:hasValueFloat "5,0"^^xsd:float ;

3 gr:hasUnitOfMeasurement "GRM"^^xsd:string .

1ex:Weight a gr:QuantitativeValueFloat ;

2gr:hasValueFloat "5.0"^^xsd:float ;

3gr:hasUnitOfMeasurement "GRM"^^xsd:string .

Listing 6.12: Converting the invalid data value “5,0” to “5.0”

The approach of how to transform invalid data values might differ. For the specific usecase presented herein, we supply a suitable SPARQL CONSTRUCT query in Listing 6.13that replaces the decimal point in any data values of type float, double, or decimal.


1 CONSTRUCT {?s ?p ?new_o}

2 WHERE {

3 ?s ?p ?o .

4 FILTER(datatype(?o) = xsd:float || datatype(?o) = xsd:double || datatype(?o) = xsd:decimal)

5 # simplistic, a more generic solution would consider REGEX

6 FILTER(CONTAINS(str(?o), ",") && !CONTAINS(str(?o), "."))

7 BIND(STRDT(REPLACE(str(?o), ",", "."), datatype(?o)) AS ?new_o)

8 }

Listing 6.13: SPARQL CONSTRUCT query to convert invalid numerical values

A more robust solution would be to employ regular expressions (REGEX) instead of theCONTAINS and REPLACE functions in Listing 6.13. Although it is computationallymore expensive, such a regular expression pattern as outlined subsequently is capable ofmatching numerical values like “+1.602e-19”:

[-+]?[0-9]*\.?[0-9]+([Ee][-+]?[0-9]+)?

6.3.3 Entity Consolidation

A viable solution to redundant entity definitions is entity consolidation (or instancematching, see also Section 2.4.3). Redundantly defined entities share certain propertiesthat are unique to them. For product models e.g., these unique properties are productidentifiers. For business entities, even the legal name of the business might be sufficientlydistinctive. Yet, if no such unique properties exist, it is often still possible to find aunique combination of properties for an entity, as we will see shortly.

6.3.3.1 Entity Consolidation Based on Identifiers

As already stated in Chapter 4, product identifiers are particularly suitable for entityconsolidation. For product models, duplicate entity definitions can be consolidated usingproduct identifiers such as the European Article Number (EAN) or the Global TradeItem Number (GTIN). Listing 6.14 shows two product models with identical EANs.

To consolidate these two product model entities, a SPARQL CONSTRUCT query as inListing 6.15 can be issued. In addition to EAN-13, the SPARQL CONSTRUCT queryalso captures other product identifiers, i.e. GTIN-8 and GTIN-14. The graph pattern inthe query matches any combination of two product models with different Web identifiers(i.e. URIs), but the same product identifiers. For every matching pair of product models,

6.3 Techniques 213

1 ex:Model1 a gr:ProductOrServiceModel ;

2 gr:name "Siemens Silence Pro 1800"@en ;

3 gr:hasEAN_UCC-13 "1234567890123"^^xsd:string .


5 gr:name "Siemens 1800W Vacuum Cleaner"@en ;


Listing 6.14: Redundant product models with the same EAN

1 CONSTRUCT {?model2 owl:sameAs ?model1}

2 WHERE {

3 ?model1 a gr:ProductOrServiceModel ;

4 ?hasProductId ?productId1 .


6 ?hasProductId ?productId2 .

7 FILTER (?hasProductId IN (gr:hasEAN_UCC-13, gr:hasGTIN-14, gr:hasGTIN-8))

8 FILTER (?model1 != ?model2 && str(?productId1) = str(?productId2) && str(?productId1) != "")

9 }

Listing 6.15: SPARQL CONSTRUCT query for product models based on arbitrary product identifiers

the query generates a triple using the owl:sameAs property, denoting that the two entitydefinitions should be considered the same.

Keep in mind that the consolidation of product entities on this basis is valid only forproduct models (e.g. two datasheets for the same consumer electronics commodity).Actual products of the same kind, e.g. on two different eBay auctions, do typically notrefer to the very same object, but two objects of the same make and model.

The previous query executed on the data in Listing 6.14, yields the two triples outlinedin Listing 6.16. Recall from Chapter 2 that the equals sign (“=”) is syntactic sugar in N3that represents the owl:sameAs property.

1 ex:Model1 = ex:Model2 .

2 ex:Model2 = ex:Model1 .

Listing 6.16: owl:sameAs links between redundant product model entities

The consolidation of other types of entities resembles the consolidation of product models.For these entity types, some identifiers are particularly suitable for consolidation, namely

• for business entities, identifiers attached using the GoodRelations properties gr:has-DUNS, gr:hasNAICS, gr:hasGlobalLocationNumber, gr:hasISICv4, and


• for locations, identifiers supplied with gr:hasGlobalLocationNumber or gr:hasISICv4.

At a more general level, this approach can be used for any pair of entities of the sametype that share the same property value for a property that exposes a reliable identitycriterion.

6.3.3.2 Entity Consolidation Based on Proper Names

As an alternative to strong identifiers, entity consolidation can also be based on

• proper names,

• combinations of multiple weak identifiers,

• combinations of proper names and weak identifiers, and

• other criteria.

When proper names are used, they need to be distinctive enough to reliably consolidateentities. Examples thereof are

• for product models, the product name contained in gr:name,

• for business entities, the legal name (gr:legalName), or

• for brands, the brand name in gr:name.

Proper names can be compared based on varying matching criteria. For example,consolidation with brand names could require an exact match of proper names. With thelegal name, a small gazetteer of acronyms could help to match variants of business entitytypes like “Limited” and “Ltd.”. Moreover, product names could be compared using stringdistance metrics like the Levenshtein string distance [Lev66], Jaccard coefficient [Jac12],or other popular methods [cf. CRF03], in combination with a similarity threshold.

Furthermore, combinations of names and weak identifiers can help to increase the reliabilityof proper name matches. E.g., a brand or manufacturer name could be used togetherwith a manufacturer part number (MPN) to unambiguously identify a product model.Listing 6.17 gives an example where two models are defined. Although their productnames are different, they could be consolidated, as they have the same brand name andMPN.

The consolidation rule is defined as a SPARQL CONSTRUCT query and outlined inListing 6.18. Note that the current rule ignores capitalization. The normalization step

6.3 Techniques 215


2 gr:name "Bosch 9-Gallon Dust Extractor with Semi-Auto Filter Clean VAC090S"@en ;

3 gr:hasBrand [ a gr:Brand ;

4 gr:name "Bosch"@en ] ;

5 gr:hasMPN "VAC090S"^^xsd:string .


7 gr:name "Bosch Carpet Extractor 9 Gallon"@en ;

8 gr:hasBrand [ a gr:Brand ;

9 gr:name "Bosch"@en ] ;

10 gr:hasMPN "VAC090S"^^xsd:string .

Listing 6.17: Redundant product models based on the combination of manufacturer name and MPN

1 CONSTRUCT {?model1 owl:sameAs ?model2}

2 WHERE {


4 gr:hasBrand [ gr:name ?brandName1 ] ;

5 gr:hasMPN ?mpn1 .


7 gr:hasBrand [ gr:name ?brandName2 ] ;

8 gr:hasMPN ?mpn2 .

9 FILTER(?model1 != ?model2 &&

10 lcase(?brandName1) = lcase(?brandName2) &&

11 lcase(?mpn1) = lcase(?mpn2))

12 }

Listing 6.18: SPARQL CONSTRUCT query for consolidating redundant product models based onidentical pairs of brand names and MPNs

could be further adjusted by stripping whitespace characters, punctuations, dashes, etc.The output of the query is the same as already indicated in Listing 6.16.

As a third possibility, consider consolidating entities based on other criteria, e.g. based onmatching addresses or geo coordinates between entities, or based on singular properties.Albeit this can be a quite powerful mechanism, it often involves fuzzy operations thatmight cause unwanted side-effects (e.g. consolidation of two distinct companies located inthe same building).

6.3.4 Schema Alignment

A comprehensive list of approaches to overcome schema heterogeneity has already beenoutlined in Section 2.4.2, where we discussed schema and ontology matching. In this


section, we elaborate on working solutions on how we can consolidate different e-commerceschemas on the Semantic Web. For schema alignment tasks, OWL reasoning is ofparticular importance, because it constitutes a simple way to exploit OWL and RDFSaxioms without the need to define SPARQL CONSTRUCT rules.

6.3.4.1 Two Schemas with Structural Mismatches

Two schemas that represent the same fact but with different structures, need to beharmonized in order to allow for comfortable querying. Listing 6.19 presents a structuralmismatch between product offers in schema.org and GoodRelations.

1 ex:Product a schema:SomeProducts ;

2 schema:offers ex:Offer .

3 ex:Offer a schema:Offer .

1ex:Offer a gr:Offering ;

2gr:offers ex:Product .

3ex:Product a gr:SomeItems .

Listing 6.19: Product offering definition in schema.org and GoodRelations

Listing 6.20 shows how the entities between these two schemas can be mediated usinga SPARQL CONSTRUCT query. The query translates the data in the left part ofListing 6.19 to the data on the right-hand side.

1 CONSTRUCT {

2 ?offer a gr:Offering ;

3 gr:offers ?product .

4 ?product a gr:SomeItems .

5 }

6 WHERE {

7 ?product a schema:SomeProducts ;

8 schema:offers ?offer .

9 ?offer a schema:Offer .

10 }

Listing 6.20: SPARQL CONSTRUCT query to convert a product offer in schema.org to the respectiveoffer in GoodRelations

6.3.4.2 Two Schemas with Direct Correspondences

Some classes, properties, or instances are either equivalent or at least very similar across dif-ferent schemas. E.g., the schema:Person class defined in schema.org is more specific thanthe gr:BusinessEntity class defined in GoodRelations. Thus, the former can be defined asa subclass of the latter. Likewise, schema:ProductModel and gr:ProductOrServiceModel

6.3 Techniques 217

are equivalent classes. Listing 6.21 shows their class definitions in the respective vocabu-laries.

1 schema:ProductModel a rdfs:Class ;

2 rdfs:label "Product Model"@en ;

3 rdfs:subClassOf schema:Product .

1gr:ProductOrServiceModel a owl:Class ;

2rdfs:label "Product or service model"@en ;

3rdfs:subClassOf gr:ProductOrService.

Listing 6.21: Product model definition in schema.org (actually schema.rdfs.org) and GoodRelations

The RDFS and OWL ontology languages offer properties to define simple alignmentaxioms between related concepts of different schemas [HB11, pp. 24f.]. For example, theproperties to express equality in OWL are owl:equivalentClass between classes [HB11,p. 24], owl:equivalentProperty between properties [HB11, p. 24], and owl:sameAs betweeninstances [cf. Vol+09]. Furthermore, somehow weaker properties that most RDFS andOWL reasoners are capable of are rdfs:subClassOf and rdfs:subPropertyOf [cf. HB11,p. 24]. At line 1 in Listing 6.22, the two product model definitions are aligned usingan owl:equivalentClass property. Furthermore, lines 3–4 define instances of these twoclasses.

1 schema:ProductModel owl:equivalentClass gr:ProductOrServiceModel .

2

3 ex:Model1 a gr:ProductOrServiceModel .

4 ex:Model2 a schema:ProductModel .

Listing 6.22: Axiom to translate among two product model classes and product model instances

The SPARQL SELECT query in Listing 6.23 only matches ex:Model1 of the knowledgebase in Listing 6.22, since it is explicitly defined as a GoodRelations product model.However, by employing an OWL reasoner capable of handling the owl:equivalentClassrelationship, the same query would also return the product model defined in schema.org.The query in Listing 6.23 thus yields both product models, as shown in the result tablenext to the query.

1 SELECT ?s

2 WHERE {

3 ?s a gr:ProductOrServiceModel .

4 }

s

1 ex:Model22 ex:Model1

Listing 6.23: SPARQL SELECT query and triples returned by selecting all GoodRelations productmodels


6.3.4.3 One Schema with Multiple Patterns

Within the same schema, multiple patterns to represent the same information may exist.In schema.org, the price information can, among others, be attached to the product offerdirectly, or indirectly via a price specification entity. The two alternatives are outlinedin Listing 6.24. Obviously, a query that respects only one type of modeling would missentities expressed with the other pattern.


2 # no intermediate price ...

3 # ... specification node

4 schema:price "100.0" ;

5 schema:priceCurrency "USD" .

1ex:Offer a schema:Offer ;

2schema:priceSpecification [

3a schema:UnitPriceSpecification ;

4schema:price "100.0" ;

5schema:priceCurrency "USD" ] .

Listing 6.24: Two equivalent modeling patterns for prices in schema.org

Under the assumption that a query is formulated that matches only the pattern on theright side of Listing 6.24, we can use the SPARQL CONSTRUCT query in Listing 6.25to bring the non-matching pattern into the required form.

1 CONSTRUCT {

2 ?offer schema:priceSpecification [ a schema:UnitPriceSpecification ;

3 schema:price ?price ;

4 schema:priceCurrency ?currency ] .

5 }

6 WHERE {

7 ?offer a schema:Offer ;

8 schema:price ?price ;

9 schema:priceCurrency ?currency .

10 FILTER NOT EXISTS {?offer schema:priceSpecification ?pspec}

11 }

Listing 6.25: SPARQL CONSTRUCT query to translate between two equivalent modeling patternswithin the same schema

6.3.4.4 One Schema with Modeling Shortcuts

As modeling patterns are often complex although in frequent use, some schemas definemodeling shortcuts for them. GoodRelations e.g., defines handy shortcuts like gr:includes,gr:hasValue, gr:hasValueFloat, and gr:hasValueInteger.

Listing 6.26 contrasts the shortcut gr:includes with its longer form. Comparing thesetwo variants, it is easy to notice that the shortcut assumes some implicit defaults, such

6.3 Techniques 219

1 ex:Offer a gr:Offering ;

2 gr:includes ex:Product .

3 # the current modeling pattern ...

4 # ... is sufficient for describing ...

5 # ... an offer that includes ...

6 # ... a single product item.

7 ex:Product a gr:SomeItems .

1ex:Offer a gr:Offering ;

2gr:includesObject [

3a gr:TypeAndQuantityNode ;

4gr:amountOfThisGood "1.0"^^xsd:float ;

5gr:hasUnitOfMeasurement "C62"^^xsd:string ;

6gr:typeOfGood ex:Product ] .

7ex:Product a gr:SomeItems .

Listing 6.26: Modeling shortcut and expanded version for attaching a product to an offer

as that the offer is composed of only one item.

In a SPARQL endpoint, the modeling shortcut can be easily expanded by issuing aSPARQL CONSTRUCT query as illustrated in Listing 6.27. The advantage of shortcutexpanding is that it facilitates query formulation, essentially that queries can safely relyon the full modeling patterns.

1 CONSTRUCT {

2 ?offer a gr:Offering ;

3 gr:includesObject [ a gr:TypeAndQuantityNode ;

4 gr:amountOfThisGood "1.0"^^xsd:float ;

5 gr:hasUnitOfMeasurement "C62"^^xsd:string ;

6 gr:typeOfGood ?product ] .

7 }

8 WHERE {

9 ?offer gr:includes ?product .

10 ?product a ?ptype .

11 FILTER (?ptype != gr:ProductOrServiceModel)

12 FILTER NOT EXISTS {?offer gr:includesObject [ gr:typeOfGood ?product ]}

13 }

Listing 6.27: SPARQL CONSTRUCT query to expand a shortcut pattern for products to its canonicallong variant

In addition to product instances, the gr:includes shortcut also works with product modelsattached to product offers. In that case, the SPARQL CONSTRUCT query in Listing 6.27would unfold one additional link gr:hasMakeAndModel between the product instance andthe product model.

Another modeling shortcut exists for point values. They can be expanded to intervals sothat they are matched by interval queries, e.g. querying the lower and upper limits of aquantitative value. A simple example of such an expansion is provided in Listing 6.28,where the left-hand side denotes the point value, and the right-hand side constitutes thecorresponding interval definition.


1 ex:ScreenSize a gr:QuantitativeValueFloat ;

2 # rewrite point value below to interval

3 gr:hasValueFloat "7"^^xsd:float ;

4 gr:hasUnitOfMeasurement "INH"^^xsd:string .

1ex:ScreenSize a gr:QuantitativeValueFloat ;

2gr:hasMinValueFloat "7"^^xsd:float ;

3gr:hasMaxValueFloat "7"^^xsd:float ;

4gr:hasUnitOfMeasurement "INH"^^xsd:string .

Listing 6.28: Quantitative values as point values and intervals

The translation of point values to intervals for the example at hand can be achieved byemploying the SPARQL CONSTRUCT query in Listing 6.29. The query ensures thatthe shortcut is only expanded if neither a minimum value nor a maximum value exists inthe data. At the same time, it prevents duplicate expansion in case the same SPARQLCONSTRUCT query is executed more than once.

1 CONSTRUCT {

2 ?qv gr:hasMinValueFloat ?v ;

3 gr:hasMaxValueFloat ?v

4 }

5 WHERE {

6 ?qv gr:hasValueFloat ?v .

7 # RDF graph does not contain a property for the min value

8 FILTER NOT EXISTS {?qv gr:hasMinValueFloat ?v}

9 # RDF graph does not contain a property for the max value

10 FILTER NOT EXISTS {?qv gr:hasMaxValueFloat ?v}

11 }

Listing 6.29: SPARQL CONSTRUCT query to convert point values to intervals

6.3.5 Missing Relationships

Subsequently, we are elaborating on rules and axioms and their usage for solving theproblem of missing relationships in the data.

6.3.5.1 Product Model Information Based on Identifiers

Many Web shops are missing granular product information, which could easily be suppliedby manufacturers (see Chapter 4). To equip them with product model information though,either an explicit, typed relationship or a shared identifier between product offers (orproduct instances) and product models need to be in place. The idea is similar to the onefor entity consolidation between the same concepts, as discussed in Section 6.3.3.1. Theleft part of Listing 6.30 shows a product item and a product model with the same product

6.3 Techniques 221

identifier. Consequently, it would make sense to materialize the correspondence betweenthe product and the product model in the data. This assertion can be represented usingthe gr:hasMakeAndModel relationship, as indicated in the right part of Listing 6.30.



3 # no link to make and model given

4 ex:Model a gr:ProductOrServiceModel ;


1ex:Product a gr:SomeItems ;

2gr:hasEAN_UCC-13 "1234567890123"^^xsd:string ;

3gr:hasMakeAndModel ex:Model .

4ex:Model a gr:ProductOrServiceModel ;

5gr:hasEAN_UCC-13 "1234567890123"^^xsd:string .

Listing 6.30: Product model information based on matching EANs

The respective SPARQL CONSTRUCT query to generate the bridge axiom between theproduct item and the product model is given in Listing 6.31.

1 CONSTRUCT {?product gr:hasMakeAndModel ?model}

2 WHERE {

3 ?product gr:hasEAN_UCC-13 ?ean1 .

4 FILTER NOT EXISTS {?product a gr:ProductOrServiceModel}

5 ?model a gr:ProductOrServiceModel ;

6 gr:hasEAN_UCC-13 ?ean2 .

7 FILTER NOT EXISTS {?product gr:hasMakeAndModel ?model2}

8 FILTER(str(?ean1) = str(?ean2) && str(?ean1) != "")

9 }

Listing 6.31: SPARQL CONSTRUCT query to establish a link between products and product modelswith matching EANs

6.3.5.2 Product Model Information Based on Proper Names

Corresponding products and product models could be linked based on proper names aswell, or based on combinations of proper names and weak identifiers, or other criteria.This becomes especially relevant if no shared product identifier is available between aproduct and a product model. The way of how to establish links between products andproduct models differs by the type of unique keys, i.e. product identifier or proper name,and is equivalent to what we have already seen in the context of entity consolidation withproper names in Section 6.3.3.2.

6.3.5.3 Product Feature Inheritance from Product Model to Product

Once a link between a product item and a product model is established, product fea-tures defined for the product model could be used to augment the product item and


ultimately the product offer. Consider the following example in Listing 6.32, wherethe left part describes the source data with a product linking to a product model viagr:hasMakeAndModel, and where the right part shows the product data augmented bythe product information from the product model (i.e. the triple that links to the weightof the product).


2 gr:hasMakeAndModel ex:Model ;

3 gr:name "Galaxy S5 - White"@en .

4 # no weight given

5 ex:Model a gr:ProductOrServiceModel ;

6 gr:hasEAN_UCC-13 "1234567890123"^^xsd:string ;

7 gr:name "Samsung Galaxy S5 W"@en ;

8 gr:weight ex:WeightSGS5 .

9 # the weight the product model refers to

10 ex:WeightSGS5 a gr:QuantitativeValueFloat ;


12 gr:hasUnitOfMeasurement "GRM"^^xsd:string .

1ex:Product a gr:SomeItems ;

2gr:hasMakeAndModel ex:Model ;

3gr:name "Galaxy S5 - White"@en ;

4gr:weight ex:WeightSGS5 .

5ex:Model a gr:ProductOrServiceModel ;

6gr:hasEAN_UCC-13 "1234567890123"^^xsd:string ;

7gr:name "Samsung Galaxy S5 W"@en ;

8gr:weight ex:WeightSGS5 .

9# the weight both entities refer to

10ex:WeightSGS5 a gr:QuantitativeValueFloat ;


12gr:hasUnitOfMeasurement "GRM"^^xsd:string .

Listing 6.32: Product with and without product features from the product model

The transformation in Listing 6.32, where the product item inherits product features ofthe product model, is achieved by the SPARQL CONSTRUCT query in Listing 6.33.Note that the query ensures that (1) the product itself is not a make and model (line 5);(2) the expanded properties are actual product features and not arbitrary non-technicalproperties (lines 7–9), since that could have unwanted side-effects (e.g. consider inheritingthe name of the make and model); and, (3) the very same properties (with possiblydifferent values) do not yet exist for the specific product item (line 11).

1 CONSTRUCT {?product ?property ?modelValue}

2 WHERE {

3 ?model a gr:ProductOrServiceModel .

4 ?product gr:hasMakeAndModel ?model .

5 FILTER NOT EXISTS {?product a gr:ProductOrServiceModel}

6 ?model ?property ?modelValue .

7 VALUES ?superProperty {gr:qualitativeProductOrServiceProperty

8 gr:quantitativeProductOrServiceProperty gr:datatypeProductOrServiceProperty}

9 ?property rdfs:subPropertyOf ?superProperty .

10 # product does not have this property yet (also covers rdf:type statements)

11 FILTER NOT EXISTS {?product ?property ?productValue}

12 }

Listing 6.33: SPARQL CONSTRUCT query for the inheritance of product features from the productmodel

6.3 Techniques 223

6.3.5.4 Product Feature Inheritance from Product Variants

Some product models are variants of other product models, i.e. they have most features incommon. Therefore, a product variant will usually inherit all product-related propertiesfrom the base product model, save for those that are already defined by the variant, andthose that are constituent parts of and thus unique to the other product model (e.g.product identifiers).

In Listing 6.34, we specify a product model for a Ford T, and a variant thereof, a redFord T model. While it is pretty safe for the variant to inherit the technical features(like engine power and displacement, in our example), other details like the model date(or similar identifiers like Vehicle Identification Numbers (VINs) or GTINs) shall not beadopted, as indicated in the right box of Listing 6.34. Further, the color is not inherited,because the product variant already defines this property itself, which would otherwiselead to conflicts.

1 ex:RedFordTModel a gr:ProductOrServiceModel,

2 vso:Automobile ;

3 gr:isVariantOf ex:FordTModel ;

4 vso:color "red"@en .

5 # no engine displacement given

6 # no engine power given

7 ex:FordTModel a gr:ProductOrServiceModel,

8 vso:Automobile ;

9 gr:name "Ford T Model - Black"@en ;

10 vso:modelDate "2002-01-01"^^xsd:date ;

11 vso:color "black"@en ;

12 vso:engineDisplacement ex:DisplacementFT ;

13 vso:enginePower ex:PowerFT .

14 ex:PowerFT a gr:QuantitativeValueFloat ;


16 gr:hasUnitOfMeasurement "KWT"^^xsd:string .

17 ex:DisplacementFT a gr:QuantitativeValueFloat ;


19 gr:hasUnitOfMeasurement "LTR"^^xsd:string .

1ex:RedFordTModel a gr:ProductOrServiceModel,

2vso:Automobile ;

3gr:isVariantOf ex:FordTModel ;

4vso:color "red"@en ;

5vso:engineDisplacement ex:DisplacementFT ;

6vso:enginePower ex:PowerFT .

7ex:FordTModel a gr:ProductOrServiceModel,

8vso:Automobile ;

9gr:name "Ford T Model - Black"@en ;

10vso:modelDate "2002-01-01"^^xsd:date ;

11vso:color "black"@en ;

12vso:engineDisplacement ex:DisplacementFT ;

13vso:enginePower ex:PowerFT .

14ex:PowerFT a gr:QuantitativeValueFloat ;


16gr:hasUnitOfMeasurement "KWT"^^xsd:string .

17ex:DisplacementFT a gr:QuantitativeValueFloat ;


19gr:hasUnitOfMeasurement "LTR"^^xsd:string .

Listing 6.34: Product variant with and without product features from a related product model

The logic behind the inheritance of product features based on the product model variant isdefined by the SPARQL CONSTRUCT query in Listing 6.35. The query makes sure that(1) the expanded properties are actual product features and not arbitrary non-technicalproperties (lines 7–9); (2) certain properties (especially variant-specific identifiers) notalready filtered by in step 1 are excluded (line 11); and, (3) properties that already existfor the variant (e.g. vso:color in Listing 6.34) are not overwritten (line 13).


1 CONSTRUCT {?model2 ?property ?value1}

2 WHERE {

3 ?model1 a gr:ProductOrServiceModel .


5 gr:isVariantOf ?model1 .

6 ?model1 ?property ?value1 .

7 VALUES ?superProperty {gr:qualitativeProductOrServiceProperty

8 gr:quantitativeProductOrServiceProperty gr:datatypeProductOrServiceProperty}

9 ?property rdfs:subPropertyOf ?superProperty .

10 # exclude model-specific identifiers

11 FILTER(?property NOT IN (vso:modelDate))

12 # do not override existing properties

13 FILTER NOT EXISTS {?model2 ?property ?value2}

14 }

Listing 6.35: SPARQL CONSTRUCT query for the inheritance of product features from productvariants

6.3.5.5 Consumables, Accessories, and Spare Parts

A lot of product data on the Semantic Web is published independently, so a proper linkagebetween products and their consumables, accessories, or spare parts is often missing. Onthe other hand, with this additional information valuable product recommendations couldbe made. Let us exemplify the problem using an example of a consumable4. Consider aretailer that publishes data about a laser printer. Now, it would be interesting to seesuitable toner cartridges listed together with the printer. However, as toner cartridgesare sold by many different vendors, this appears to be a difficult endeavor. Listing 6.36shows the status quo on the left, and the target state on the right.

1 ex:Printer a gr:SomeItems,

2 pto:Printer_(computing) ;

3 gr:hasMPN "7800V_DN"^^xsd:string ;

4 gr:name "Phaser 7800"@en .

5 ex:TonerCartridge a gr:SomeItems,

6 pto:Toner_cartridge ;

7 gr:name "Black Hi Capacity Toner Cartridge

for Phaser 7800 (7800V_DN ...)"@en .

8 # no link for consumable given

1ex:Printer a gr:SomeItems,

2pto:Printer_(computing) ;

3gr:hasMPN "7800V_DN"^^xsd:string ;

4gr:name "Phaser 7800"@en .

5ex:TonerCartridge a gr:SomeItems,

6pto:Toner_cartridge ;

7gr:name "Black Hi Capacity Toner Cartridge

for Phaser 7800 (7800V_DN ...)"@en ;

8gr:isConsumableFor ex:Printer .

Listing 6.36: Products where one (a toner cartridge) is a consumable for another (a printer)

An important observation is that products on the Web often use a product modelidentifier of the consumer product in the product name or description of the consumable.

4The situation is similar for accessories and spare parts, thus we will not cover them here.

6.3 Techniques 225

In particular, the MPN is often part of the product name of the consumable. Thissmall detail allows, even if pretty fuzzy, to accomplish the relationship between consumerproducts and consumables. Listing 6.37 details a SPARQL CONSTRUCT rule to generatethe gr:isConsumableFor link between a product and its consumable.

1 CONSTRUCT {?consumable gr:isConsumableFor ?product}

2 WHERE {

3 VALUES ?ptype {gr:ProductOrService gr:ProductOrServiceModel gr:SomeItems gr:Individual} # omit

deprecated classes

4 VALUES ?ctype {gr:ProductOrService gr:ProductOrServiceModel gr:SomeItems gr:Individual}

5 ?product a ?ptype ;

6 gr:hasMPN ?mpn .

7 ?consumable a ?ctype ;

8 gr:name ?consumableName .

9 FILTER(?product != ?consumable && CONTAINS(str(?consumableName), str(?mpn))) # comparison

based on str(literal) is more robust than literal!

10 }

Listing 6.37: SPARQL CONSTRUCT query to add gr:isConsumableFor link based on the MPN of oneproduct contained in the product name of the other product

Alternatively, if there is no MPN or other product identifier available, one could tryto find the product name of the consumer product in the product name or descriptionof the consumable. This method is, of course, not as accurate as relying on productidentifiers.

The sketched approach is of course very simplistic and will suffer from false positives (e.g.textual content like “not compatible with XYZ” or “supersedes XYZ” will also create arespective gr:isConsumableFor statement). In real-world applications, the pattern shouldbe complemented by more advanced natural language processing approaches.

6.3.6 Missing Attributes

While relevant attributes like the price value or the value of a quantitative characteristicare mostly present, other attributes are sometimes missing in the data, as for examplethe unit of measurement of a quantitative value, the currency code of a price, detailsabout value-added taxes, or the start and/or end date of a validity period. Unfortunately,quantitative characteristics of products cannot be compared or filtered by as long as theunit of measurement is missing, and missing validity dates in an open world can meanthat a product offer is either valid or not. I.e., the open-world assumption (OWA) [e.g.


DFH11, p. 21] that e.g. GoodRelations as an OWL ontology is based on, does not allowdrawing conclusions from the absence of a statement.

However, to some extent, we can recover missing attributes by assuming implicit defaults.For the unit code, we can assume that if there is no unit code present, that there existsno relevant unit for a quantitative value. Thus we could assume “C62”, which means“one unit or piece” [cf. Hep08b; Uni09a]. By comparison, the currency code could bedetermined based on the shop location. E.g., if the shop is located in the USA, it is likelythat the prices are given in U.S. dollars, whereas a shop in Germany will likely publishprices in Euros. Similarly, implicit default values might apply to value-added taxes basedon experience. Finally, a heuristic for a missing start date of the price validity, could beto rely on the date of the Hypertext Transfer Protocol (HTTP) request for the data (ifavailable), otherwise to assume the current date.

Listing 6.38 illustrates two examples of quantitative values with lacking unit codes. Thefirst example specifies a price without a currency code, and the second example models aquantitative value for the number of central processing unit (CPU) cores where the unitof measurement code is absent.

1 ex:Price a gr:UnitPriceSpecification ;

2 gr:hasCurrencyValue "49.99"^^xsd:float .

1ex:NumCPUCores a gr:QuantitativeValueInteger ;

2gr:hasValueInteger "4"^^xsd:int .

Listing 6.38: Two examples of where some unit codes (code of the unit of measurement and the currencycode) are missing for quantitative values

If we knew for example that the price specification was published by a Web shop locatedin the USA, we could issue the SPARQL CONSTRUCT query shown in Listing 6.39.

1 CONSTRUCT {?price gr:hasCurrency "USD"^^xsd:string}

2 WHERE {

3 ?price gr:hasCurrencyValue ?v .

4 FILTER NOT EXISTS {?price gr:hasCurrency ?currency}

5 FILTER(isNumeric(?v))

6 }

Listing 6.39: SPARQL CONSTRUCT query to assign the default currency value wherever missing

The SPARQL CONSTRUCT query generates the missing triple with the currency codedefaulting to “USD”, as shown below:

ex:Price gr:hasCurrency "USD"^^xsd:string .

6.3 Techniques 227

If we assume that for codes for the unit of measurement the default unit code is “C62”intending “no unit code” or “a piece” [cf. Hep08b; Uni09a], the respective SPARQLCONSTRUCT query looks as indicated in Listing 6.40. For the sake of simplicity, let usassume that for this query reasoning support is enabled. Starting from the propertiesgr:hasMinValue and gr:hasMaxValue, an RDFS-style reasoner could then infer triplesusing the subproperties outlined in Figure 6.1. Similarly, the gr:QuantitativeValueclass subsumes the classes gr:QuantitativeValueFloat and gr:QuantitativeValueInteger [cf.Hep11]. In addition to quantitative values, the query is also able to cope with entitiestyped as gr:TypeAndQuantityNode and a respective property gr:amountOfThisGood,which is used in combination with the gr:includesObject modeling pattern [Hep08b].

1 CONSTRUCT {?qv gr:hasUnitOfMeasurement "C62"^^xsd:string}

2 WHERE {

3 VALUES ?qvt {gr:QuantitativeValue gr:TypeAndQuantityNode}

4 ?qv a ?qvt .

5 ?qv gr:hasMinValue|gr:hasMaxValue|gr:amountOfThisGood ?v .

6 FILTER NOT EXISTS {?qv gr:hasUnitOfMeasurement ?uom}

7 }

Listing 6.40: SPARQL CONSTRUCT query to recover missing unit codes in quantitative values

gr:hasMaxValue

gr:hasMaxValueIntegergr:hasMaxValueFloat gr:hasValue

gr:hasMinValue

gr:hasMinValueInteger gr:hasMinValueFloat

gr:hasValueInteger

gr:hasValueFloat

Figure 6.1: Property hierarchy of quantitative values in GoodRelations [based on Hep11]

Executing the SPARQL CONSTRUCT query in Listing 6.40 on the data in Listing 6.38yields the following RDF triple:

ex:NumCPUCores gr:hasUnitOfMeasurement "C62"^^xsd:string .


6.3.7 Data Lifting and Enrichment

Data lifting and enrichment consists of improving the data granularity (i.e. the degreeof explicit structure) by (1) rules to derive structured data from textual properties, and(2) enrichment using external resources.

The data lifting task essentially comprises heuristics like regular expressions to get struc-tured data from text. Listing 6.41 exemplifies the situation by specifying a quantitativevalue where the unit code is included in the value property as it happens quite often (i.e.“60.0 LTR”). By comparison, the quantitative value on the right-hand side outlines thetarget status.

1 ex:Capacity a gr:QuantitativeValue ;

2 gr:hasValue "60.0 LTR"^^xsd:string .

3 # unit of measurement code missing in here

1ex:Capacity a gr:QuantitativeValue ;

2gr:hasValue "60.0"^^xsd:float ;

3gr:hasUnitOfMeasurement "LTR"^^xsd:string .

Listing 6.41: Comparison of non-granular and granular quantitative value descriptions

In this specific case, it is sufficient to employ a simple heuristic to split the string intothe value and code parts. This is exactly what the SPARQL CONSTRUCT rule inListing 6.42 is doing. It binds the term before the whitespace character as the numericpart, and the term after the whitespace character as the unit code.

1 CONSTRUCT {

2 ?qv ?qvp ?numericPart ;

3 gr:hasUnitOfMeasurement ?uomPart .

4 }

5 WHERE {

6 VALUES ?qvt {gr:QuantitativeValue gr:QuantitativeValueFloat gr:QuantitativeValueInteger

7 gr:TypeAndQuantityNode}

8 ?qv a ?qvt .

9 VALUES ?qvp {gr:hasValue gr:hasValueFloat gr:hasValueInteger gr:amountOfThisGood}

10 ?qv ?qvp ?v .

11 FILTER NOT EXISTS {?qv gr:hasUnitOfMeasurement ?uom}

12 BIND(STRBEFORE(?v, " ") AS ?numericPart)

13 BIND(STRAFTER(?v, " ") AS ?uomPart)

14 }

Listing 6.42: SPARQL CONSTRUCT query that applies a heuristic to extract a value and a unit codefrom a free-text field

Another important class of data values often modeled as one textual property are openand closed intervals. Consider e.g. the seating capacity of a van, which is typicallybetween one and nine passengers. Listing 6.43 gives a comparison of the same information

6.3 Techniques 229

1 ex:SeatingCapacity a gr:QuantitativeValue ;

2 # interval provided in text

3 gr:hasValue "1-9"^^xsd:string ;

4 gr:hasUnitOfMeasurement "C62"^^xsd:string .

1ex:SeatingCapacity a gr:QuantitativeValue ;

2gr:hasMinValue "1"^^xsd:int ;

3gr:hasMaxValue "9"^^xsd:int ;


Listing 6.43: Intervals modeled in text as compared to individual intervals

(1) encoded as text, and (2) modeled in a more granular way using gr:hasMinValue andgr:hasMaxValue properties.

Notice that the property type used to attach the textual label is arbitrary. Insteadof using gr:hasValue, we could likewise have used the textual properties gr:name orgr:description. The approach is essentially the same. To lift integer intervals encodedin text, the SPARQL CONSTRUCT query in Listing 6.44 could be used. A regularexpression is applied on the textual label to match possible interval definitions, and theextracted numbers are assigned suitable datatypes.

1 CONSTRUCT {

2 ?qv gr:hasMinValue ?minv ;

3 gr:hasMaxValue ?maxv

4 }

5 WHERE {

6 VALUES ?qvt {gr:QuantitativeValue gr:QuantitativeValueFloat gr:QuantitativeValueInteger}

7 ?qv a ?qvt ;

8 gr:hasValue ?v .

9 FILTER(REGEX(?v, "^[0-9]+-[0-9]+$")) # matches integer intervals

10 BIND(STRDT(STRBEFORE(?v, "-"), xsd:int) AS ?minv)

11 BIND(STRDT(STRAFTER(?v, "-"), xsd:int) AS ?maxv)

12 }

Listing 6.44: SPARQL CONSTRUCT query to convert intervals in text (decimals or integers) tointervals modeled using appropriate properties

Enrichment, unlike data lifting, takes advantage of external resources. In particular,product data can be augmented by additional data from

• product classifications (for an overview of classification systems, see Chapter 5),

• linguistic and lexical databases (e.g. WordNet [Mil95], the Product Types Ontology(PTO)5),

5http://www.productontology.org/ (accessed on February 20, 2016)



• existing Linked Data sources (e.g. WordNet RDF6, DBPedia7, Freebase8, PTO).

Besides the higher recall obtained through the additional data from these data sources,an important advantage is the linguistic aspect. The synonyms and translations found inthese data sources can help to disambiguate terms with often conflicting meanings suchas caused by homonyms and synonyms in natural language.

6.3.8 Unit Conversion and Canonicalization

Quantitative information is often based on different units. The GoodRelations ontologypermits publishers to supply unit codes with quantitative values and price specifications,which need to be canonicalized prior to comparing them. In the following, we discusspossible solutions for the conversion of units of measurement and monetary amounts.

6.3.8.1 Conversion of Units of Measurement

For the representation of units of measure, there exist different code standards andontologies. A popular code standard for units of measurement are the UN/CEFACTCommon Codes. GoodRelations, for example, suggests to use these three-letter unit codesfor representing quantitative values. At the same time, the QUDT9 collection of ontologies,developed by TopQuadrant and the National Aeronautics and Space Administration(NASA), has also an attribute for UN/CEFACT Common Codes. QUDT is an ontologythat encodes knowledge to convert between different units of measure. Hence, quantitativevalues expressed in GoodRelations can be converted using the QUDT ontology [Hod+14]and its complementing vocabularies for unit conversion [e.g. Mas+11]. In [AH11, pp. 289–294], Allemang and Hendler give a nice overview of how unit conversion with QUDTworks for GoodRelations.

Listing 6.45 highlights (in N3 syntax) the relevant parts from QUDT needed for theconversion between centimeters and meters. Each unit exhibits a conversion multiplierand an offset to convert it into a corresponding unit. The supply of a dimensional unittype (here: qudt:LengthUnit) allows to establish the necessary linkage between compatibleunits, as depicted in Figure 6.2. A second unit type is further used to describe whetherthe corresponding unit represents a base unit or a derived unit.

6http://wordnet-rdf.princeton.edu/ (accessed on June 4, 2014)7http://dbpedia.org/ (accessed on May 12, 2014)8http://www.freebase.com/ (accessed on May 12, 2014)9http://www.qudt.org/ (accessed on September 30, 2014)

http://wordnet-rdf.princeton.edu/

http://dbpedia.org/


http://www.qudt.org/

6.3 Techniques 231

1 @prefix qudt: <http://qudt.org/schema/qudt#> .

2 @prefix unit: <http://qudt.org/vocab/unit#> .


4

5 unit:Centimeter a qudt:LengthUnit, qudt:SIDerivedUnit ;

6 qudt:uneceCommonCode "CMT" ;

7 qudt:conversionMultiplier "0.01"^^xsd:double ;

8 qudt:conversionOffset "0.0"^^xsd:double .

9

10 unit:Meter a qudt:LengthUnit, qudt:SIBaseUnit ;

11 qudt:uneceCommonCode "MTR" ;

12 qudt:conversionMultiplier "1"^^xsd:double ;

13 qudt:conversionOffset "0.0"^^xsd:double .

Listing 6.45: Base and derived units in QUDT represented in N3 syntax [adapted from Mas+11]

unit:Centimeter

qudt:LengthUnit

rdf:type

qudt:DerivedUnit

rdf:type

unit:Meter

rdf:type

qudt:SIBaseUnit

rdf:type

Figure 6.2: Base and derived units in QUDT linked via a common type qudt:LengthUnit [based onNAS10]

The formula for the conversion between two compatible values is as follows:

value

target

=(o↵set

source

+ value

source

⇥multiplier

source

)� o↵set

target

multiplier

target

(6.1)

As said before, GoodRelations and QUDT commonly refer to the UN/CEFACT CommonCode standard. While GoodRelations defines the gr:hasUnitOfMeasurement property,QUDT provides an attribute qudt:uneceCommonCode. The SPARQL CONSTRUCTquery in Listing 6.46 outlines a generic conversion of quantitative values to their referenceor base units (e.g. centimeters to meters). Executing this query yields new quantitativevalues carrying the same dimensions as before, but canonicalized to their base units.

Note that the expansion of the shortcut for point values to the long, canonical form of aninterval with matching lower and upper boundaries must have been executed prior toapplying this heuristic to point values.


1 PREFIX qudt: <http://qudt.org/schema/qudt#>


3 PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

4

5 CONSTRUCT {

6 ?s ?prop [ a gr:QuantitativeValue ;

7 gr:hasMinValue ?min_value_new ;

8 gr:hasMaxValue ?max_value_new ;

9 gr:hasUnitOfMeasurement ?base_uom ] .

10 }

11 WHERE {

12 ?s ?prop ?qv .

13 VALUES ?qvt {gr:QuantitativeValue gr:QuantitativeValueFloat gr:QuantitativeValueInteger}

14 ?qv a ?qvt ;

15 gr:hasMinValue|gr:hasMinValueFloat|gr:hasMinValueInteger ?min_value ;

16 gr:hasMaxValue|gr:hasMaxValueFloat|gr:hasMaxValueInteger ?max_value ;

17 gr:hasUnitOfMeasurement ?uom .

18 ?current_unit a ?unit_type ;

19 qudt:uneceCommonCode ?qudt_uom ;

20 qudt:conversionMultiplier ?multiplier ;

21 qudt:conversionOffset ?offset .

22 ?base_unit a qudt:SIBaseUnit, ?unit_type ;

23 qudt:uneceCommonCode ?qudt_base_uom .

24 FILTER (str(?uom) = str(?qudt_uom) && str(?uom) != str(?qudt_base_uom))

25 BIND ((xsd:float(?min_value)*xsd:float(?multiplier))+xsd:float(?offset) AS ?min_value_new)

26 BIND ((xsd:float(?max_value)*xsd:float(?multiplier))+xsd:float(?offset) AS ?max_value_new)

27 BIND (STRDT(str(?qudt_base_uom), xsd:string) AS ?base_uom) # convert to xsd:string datatype

28 }

Listing 6.46: Unit conversion of quantitative values in GoodRelations

As an aside, when investigating QUDT we could find units in the vocabulary where nounit code was provided. For the most relevant ones, we thus decided to manually supplythe missing axioms, as indicated in Listing 6.47.

6.3.8.2 Currency Conversion

In many business applications that could be built on the basis of Linked Open Data(LOD) [cf. Ber06; BHB09; HB11], the conversion of monetary amounts from one currencyto another is a much-needed functionality. The ISO 4217 [Int08] standard for currenciescurrently defines 179 accepted currencies world-wide10. Despite the existence of currency10http://www.currency-iso.org/dam/downloads/table_a1.xml (accessed on September 30, 2014) lists

currently 179 distinct currencies.

http://www.currency-iso.org/dam/downloads/table_a1.xml

6.3 Techniques 233

1 @prefix qudt: <http://qudt.org/schema/qudt#> .

2 @prefix unit: <http://qudt.org/vocab/unit#> .


4

5 unit:Inch qudt:uneceCommonCode "INH"^^xsd:string .

6 unit:Inch qudt:uneceCommonCode "INC"^^xsd:string . # wrongly used unit code for inch

7 unit:OunceMass qudt:uneceCommonCode "ONZ"^^xsd:string .

8 unit:Gram qudt:uneceCommonCode "GRM"^^xsd:string .

9 unit:Kilogram qudt:uneceCommonCode "KGM"^^xsd:string .

10 unit:Kilometer qudt:uneceCommonCode "KMT"^^xsd:string .

11 unit:MilePerHour qudt:uneceCommonCode "HM"^^xsd:string .

12 unit:Liter qudt:uneceCommonCode "LTR"^^xsd:string .

Listing 6.47: Provision of additional axioms that are not covered by QUDT

conversion application programming interfaces (APIs) on the Web, their integration intooperations over RDF data is still burdensome and requires proprietary code.

To the best of our knowledge, there exist no RDF-based Web services dedicated to exchangerates. QUDT as a vocabulary for quantities, units, dimensions and types, entails all ofthe world’s currencies. Nonetheless, QUDT does not offer currency conversion, because itis a relatively static document but exchange rates change very frequently, at least on adaily basis. For currency conversion, it is thus necessary to call an external Web servicethat retrieves the most current exchange rates (e.g. by invoking a SPARQL InferencingNotation (SPIN) function) [cf. Knu09].

To fill this gap, we proposed in a scientific publication [SH13a] a conceptually clean andscalable way to add currency conversion functionality to the Web of Linked Data. In anutshell, we

1. defined an OWL ontology11 for modeling currency exchange rates in RDF, and

2. put online a RESTful [Fie00] Web service to serve RDF representations populatedwith the latest currency exchange rates from open Web APIs or data feeds. Ourservice is available online12.

By having the exchange rates available as triples in RDF, it renders our approachapplicable to standard SPARQL processors. In a blog post, Knublauch [Knu13] picked upour online service to showcase how to implement currency conversion by defining SPINfunctions.11Exchange Rate Ontology (XRO), prefixed with xro:, and available online at http://purl.org/xro/

(accessed on September 30, 2014)12http://www.currency2currency.org/ (accessed on September 30, 2014)

http://purl.org/xro/

http://www.currency2currency.org/


Listing 6.48 describes a sample currency exchange rate between Euros and U.S. dollarsas published by our Web service.

1 @prefix dcterms: <http://purl.org/dc/terms/> .

2 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

3 @prefix xch_EUR: <http://www.currency2currency.org/EUR#> .

4 @prefix xro: <http://purl.org/xro/ns#> .


6

7 xch_EUR:USD a xro:ExchangeRateInfo ;

8 rdfs:label "Euro to US Dollar"@en ;

9 rdfs:comment "1 EUR = ? USD"@en ;

10 xro:base <http://dbpedia.org/resource/Euro> ;

11 xro:counter <http://dbpedia.org/resource/United_States_dollar> ;

12 xro:rate "1.2436"^^xsd:decimal ;

13 xro:inverseRate "0.804117079447"^^xsd:decimal ;

14 dcterms:source <http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml> ;

15 xro:timeOfConversion "2014-11-16T23:00:00+00:00"^^xsd:dateTime .

Listing 6.48: Example of a populated currency exchange rate instance

In Listing 6.49, we demonstrate a SPARQL CONSTRUCT query for currency conversionwith our online service. The query shows how the prices of product offers are canonicalizedto a base price expressed in Euros, eventually materialized by the query as an RDFgraph consisting of the new price specification. Note that the the currencies are modeledas DBPedia currencies (see Listing 6.48), which specify a property dbpprop:isoCode forholding the three-letter ISO 4217 currency codes.

The general formula to convert between prices is

price

A

= rate

A2B · price

B

(6.2)

where A and B represent an arbitrary currency pair. In this formula, the price expressedin the currency A (price

A

) is calculated from the price according to a currency B

(price

B

). The conversion factor is described by the exchange rate between those twocurrencies. More precisely, rate

A2B means the rate of currency A with regard to currencyB. Accordingly, the formal currency conversion of five Euros into US dollars with respectto the exchange rate from Listing 6.48 is:

price

USD

= rate

USD2EUR

· price

EUR

= 1.2436 · 5 = 6.22 (6.3)

6.4 Evaluation 235


2 PREFIX xro: <http://purl.org/xro/ns#>

3 PREFIX dbpprop: <http://dbpedia.org/property/>

4

5 CONSTRUCT {

6 ?s gr:hasPriceSpecification [ a ?ptype;

7 gr:hasCurrencyValue ?base_price ;

8 gr:hasCurrency ?base_code ;

9 gr:hasUnitOfMeasurement ?uom ] .

10 }

11 WHERE {

12 ?s gr:hasPriceSpecification [ a ?ptype ;

13 gr:hasCurrency ?code ;

14 gr:hasCurrencyValue ?price ] .

15 ?xrate xro:rate ?rate ;

16 xro:base [ dbpprop:isoCode ?base_code ] ;

17 xro:counter [ dbpprop:isoCode ?counter_code ] .

18 FILTER (str(?counter_code) = str(?code) && str(?base_code) = "EUR" && ?rate != 0)

19 BIND (?price/?rate AS ?base_price)

20 }

Listing 6.49: SPARQL CONSTRUCT rule for currency conversion with SPARQL

6.4 Evaluation

Data cleansing and enrichment are of significant importance for the Web of Linked Data,and in particular for deep product comparison over structured data. So far, we have onlydealt with toy examples to show how data cleansing techniques can be used to solve dataquality problems. In this section, we demonstrate that the data quality problems are alsoprevalent in our crawl, hence we aim to substantiate our work from the previous sectionswith real numbers.

In the following, we will give answers to the following questions with regard to our Webcrawl from Chapter 3:

• How many redundantly defined product models are in the crawl?

• How many attributes incompatible to the UN/CEFACT Common Codes standardcan be found in the data?

• How many invalid currency codes with respect to the ISO 4217 standard are used?

• How many products did we encounter that lack type information?

• How many products could potentially be linked to their product models?


• How many products could inherit product features from their respective productmodels?

• How many unit codes for quantitative data are missing?

• How many price specifications do not feature currency codes?

• How many price specifications do lack information about value-added taxes?

• How many prices do not exhibit validity durations?

• How many entities do use shortcuts (e.g. gr:includes, gr:hasValue) rather than thefull modeling pattern?

To find answers to these questions, we executed a number of SPARQL SELECT querieson our crawl data. The absolute frequency of some interesting entities in the crawl corpusis shown in Table 6.2. Our evaluation takes into account two different datasets, namelythe full crawl, and, whenever not feasible (e.g. if the query is too complex), a subset ofthe crawl. We obtained this subset by drawing a random sample of 100 offerings fromthe crawl. This gives us the means to test the aforementioned techniques under realisticconditions. Table 6.3 shows the results of our analysis by outlining the problems withtheir frequency in the data, complemented by a short explanation.

Table 6.2: Statistics of entities in the crawl corpus

Entity Type Number of Instancesa GoodRelations Concepts

Product offers 3,097,631 gr:Offering classProduct items 2,674,366 gr:ProductOrServiceb, gr:SomeItems, and

gr:Individual classesProduct models 72,982 gr:ProductOrServiceModel classPrice specifications 3,517,854 gr:UnitPriceSpecificationc classQuantitative values 1,525,063 gr:QuantitativeValue classQuant. float values 18,705 gr:QuantitativeValueFloat classQuant. integer values 627 gr:QuantitativeValueInteger classUnits of measurement 5,189,941 gr:hasUnitOfMeasurement propertyCurrencies 3,379,722 gr:hasCurrency propertya Note that these statistics somewhat deviate from the statistics reported in Table 3.2 of Chapter 3,

because this time we counted distinct instances.b The class gr:ProductOrService was here considered as a product item, although in a strict sense

it subsumes classes for product items (gr:SomeItems and gr:Individual) and product models(gr:ProductOrServiceModel).

c GoodRelations also defines other types of price specifications [cf. Hep11], whereas we limit ourselvesto gr:UnitPricespecification for our analysis.

Table 6.3 reveals the percentage of invalid currency codes and codes for the units ofmeasurement and their wrong instances, respectively. However, most of the codes usedare correct three-letter unit/currency codes. E.g., we could detect ten different codes for

6.4 Evaluation 237

Table 6.3: Data quality problems in the crawl corpus

Problem Frequency Ratioa Description

Redundant productmodel entity definitions

167 0.23% EAN comparison based on perfectmatches (including identical datatypes)

Invalid codes for the unitof measurement

686 0.01% Instances: “KG”, “” (empty string)

Invalid currency codes 11,014 0.33% Instances: “�”, “RO”, “EURO”, “” (emptystring)

Missing product typeinformation

2,575,279 96.29% By contrast, only 99,087 products havetype information

Missing links betweenproducts and make andmodels

47,830 nab Missing links according to identi-cal EANs. In comparison, 71,957gr:hasMakeAndModel links are availablein the crawl right away

Products that couldinherit product features

55,233 2.07% Based on the presence of explicitgr:hasMakeAndModel links

Missing unit codes forquantitative values

0c na Due to its complexity, the query wasexecuted over a random sample of 100offering instances only

Missing currency codes inprice specifications

167,956 4.77% We counted based on the absence of agr:hasCurrency relationship

No indication ofgr:valueAddedTaxIncluded

962,270 27.35% The price specification lacks a statementabout value-added taxes

Missing validity durationsfor prices

2,471,861 70.27% . . . missing gr:validFrom679,791 19.32% . . . missing gr:validThrough678,000 19.27% . . . missing both validity start and end

gr:includes shortcuts products:2,078,480

77.72% gr:includesObject ! gr:typeOfGood

productmodels: 565

0.77% gr:includesObject ! gr:typeOfGood !gr:hasMakeAndModel

gr:hasValue shortcuts 18,586 1.22% Shortcut for gr:hasMinValue andgr:hasMaxValue

gr:hasValueFloatshortcuts

18,588 99.37% Shortcut for gr:hasMinValueFloat andgr:hasMaxValueFloat

gr:hasValueIntegershortcuts

0 0.00% Shortcut for gr:hasMinValueInteger andgr:hasMaxValueInteger

a The ratio is calculated with respect to the value for an appropriate entity from Table 6.2.b For entity links it was not useful to calculate a ratio (thus “na” for “not available”), because potentially

there could exist m⇥n links, where m is the number of product models and n the number of products.c This number reflects the analysis of a random sample of 100 offering instances, but in reality the

value is expected to be much higher.


the units of measurement. Of those, nine are correct, i.e. “C62”, “CMT”, “GRM”, “INH”,“KGM”, “LBR”, “MGM”, “MTR”, and “ONZ”. “LBM” does not exist in the UN/CEFACTcode table [cf. Uni09a]. Furthermore, we found 36 correct currency codes [cf. Int08],namely “ARS”, “AUD”, “BGN”, “BRL”, “CAD”, “CHF”, “CLP”, “CNY”, “COP”, “CZK”,“DKK”, “EUR”, “GBP”, “HUF”, “IDR”, “ILS”, “INR”, “IRR”, “JPY”, “KES”, “LTL”, “LVL”,“MKD”, “MXN”, “MYR”, “PHP” “PLN”, “RON”, “RUB”, “SEK”, “TND”, “TRY”, “UAH”,“USD”, “VND”, and “ZAR”. Among them, two have been replaced by the Euro though, i.e.“LTL” (Lithuanian Litas) and “LVL” (Latvian Lats).

In addition, we encountered 62 different language tags in the crawl. Further, we evaluatedwhether wrong datatypes are present in the data. Out of the sample of 100 randomproduct offers, we detected the following instances which literals have wrong datatypes:

ex:Price1 gr:hasCurrencyValue "139.00"^^xsd:string .

ex:Price2 gr:valueAddedTaxIncluded "1"^^xsd:integer .

ex:Price3 gr:valueAddedTaxIncluded "true"^^xsd:string .

The correct instances would be:

ex:Price1 gr:hasCurrencyValue "139.00"^^xsd:float .

ex:Price2 gr:valueAddedTaxIncluded "true"^^xsd:boolean .

ex:Price3 gr:valueAddedTaxIncluded "true"^^xsd:boolean .

Our statistics indicate that the data quality problems presented in this chapter are alsoprevalent in the crawl. We thus conclude that data cleansing and enrichment are essentialfor the Web as a whole.

6.5 Implementation of a Data Management Web User Interface

For the ease of maintenance of the cleansing rules presented in this chapter, we developed aWeb user interface for data management. The user interface incorporates three importantfunctionalities, namely

• the loading and unloading of data from RDF files into an RDF store,

• rules for cleansing and the enrichment of data, and

• dynamic rules.

6.5 Implementation of a Data Management Web User Interface 239

The data management user interface is illustrated in Figure 6.3. With tabbing, it ispossible to switch between the three functions. In the example given, the tab for managingthe loading and unloading of data is active. Data can be added and removed convenientlyto and from a SPARQL endpoint, and a tabular view allows to examine the data of everyRDF graph currently available in the RDF store.

Figure 6.3: Data management Web user interface

6.5.1 User Interface Tabs

In the following, we describe the three tabs of the user interface in Figure 6.3 in moredetail. We arranged them by their natural order of execution (e.g. the data loadingprecedes the cleansing task).

6.5.1.1 Loading Data

First of all, some data needs to be present in a SPARQL endpoint before it can becleansed or enriched. The data management section allows to add triples from a set oflocal RDF files placed in a particular folder. They are uploaded to a SPARQL endpointusing the insert functionality of SPARQL Update queries13. Similarly, triples can bedeleted anytime from the SPARQL endpoint.

13For selected SPARQL endpoints such as Stardog, Fuseki, and Virtuoso Open Source, we providedimplementations relying on their native APIs to obtain better throughput.


6.5.1.2 Cleansing Rules

The next step is to cleanse the data and prepare it for product search. The cleansingand enrichment rules implemented into the prototype for the most part align with thetechniques presented earlier in this chapter. After executing the cleansing rules, theproducts in the SPARQL endpoint are ideally canonicalized for effective querying.

6.5.1.3 Dynamic Rules

In addition to the cleansing rules executed prior to product search, some triples maybe added to the RDF graph during the search process. We refer to this functionalityas dynamic rules. They could be added in response to user interaction or an operatingrecommender system, e.g. by taking into account user settings, preferences, past purchases,etc. Up to now, we have implemented a simple mechanism where the user is asked whetherhe would like to expand the search by products that belong to more generic categoriesthan the current category. It would be straightforward to extend it to user-assistedontology or instance matching.

6.5.2 Data Management with RDF Graphs

For effective data management, it must be possible to seamlessly add and remove triplesfrom a SPARQL endpoint. In particular, it is a key requirement to be always able to turnback to the previous or initial state. For this reason, the amendments to the SPARQLendpoint need to be traced in a reliable way.

As data from specific data sources is added to a SPARQL endpoint, the new triplesare assigned a named graph [Car+05] reflecting their provenance. The named graph isrepresented by a Uniform Resource Name (URN) composed of the name of the datasource, i.e. the name of the RDF file or the cleansing rule that was applied. An exampleof such a graph name is urn:0-2-1-unit-conversion. It denotes the RDF graph withRDF triples generated by applying the cleansing rule for unit conversion. This simpletechnique allows to comfortably delete any previously added RDF triples. Nonetheless, itdoes not support the distinction between novel and prior statements, so there could ariseconflicting assertions.

In order to avoid conflicting assertions, we add some meta information together withthe newly created RDF graph. I.e., we replace an entity by redefining an updated entityand deprecate the old entity by attaching some meta information to the new graph.Our mechanism is illustrated in Figure 6.4. More precisely, we use the owl:deprecated

6.5 Implementation of a Data Management Web User Interface 241

annotation property to flag entities as deprecated if they were replaced by a newerversion.

“An annotation with the owl:deprecated annotation property and the value equal to"true"^^xsd:boolean can be used to specify that an IRI is deprecated.” [MPP12,Section 5.5]

Hence, instead of replacing RDF triples in an RDF store, we only invalidate the oldtriples, which permits later on to revert to the previous state. Since the meta informationis part of the new graph, it will be discarded as well on deletion of the respective RDFgraph.

ex:Product

ex:Weight

gr:weight

gr:QuantitativeValue

rdf:type

250

gr:hasValue

"GRM"^^xsd:string

gr:hasUnitOfMeasurement

(a) urn:original-graph.ntWeight in gram

ex:Weight

"true"^^xsd:boolean

owl:deprecated

ex:Product

ex:Weight2

gr:weight

gr:QuantitativeValue

rdf:type

0.25

gr:hasValue

"KGM"^^xsd:string

gr:hasUnitOfMeasurement

(b) urn:new-graph.ntWeight in kilogram

Figure 6.4: Axiom replacement mechanism

In a SPARQL query, data marked as deprecated can be filtered out conveniently, asillustrated in Listing 6.50. The query in Listing 6.50 is intended to be applied on thejoint RDF graph in Figure 6.4. Because the query filters out any deprecated entities, itdoes not match the definition of ex:Weight from Figure 6.4.

1 SELECT ?value ?uom

2 WHERE {

3 ex:Product gr:weight ?weight .

4 ?weight gr:hasValue ?value ;

5 gr:hasUnitOfMeasurement ?uom .

6 FILTER NOT EXISTS {?weight owl:deprecated true}

7 }

Listing 6.50: SPARQL SELECT with owl:deprecated

In a production environment, more advanced approaches for capturing additions andremovals can be considered, like deltas on RDF graphs [BC04] or relevant parts from thePubSubHubbub protocol [PM10].


6.5.3 Execution Order of Cleansing Rules

Sometimes, we encounter non-trivial dependencies when executing inference rules forcleansing and enrichment. To give an example, it is pretty safe to interchange thecleansing rules for expanding and fixing numeric values as shown in Listing 6.51, becausenone of these rules makes assumptions about the other rule.

1 ex:QV gr:hasValue "11990,0"^^xsd:float .

23 # 1. expansion:

4 ex:QV gr:hasMinValue "11990,0"^^xsd:float .

5 ex:QV gr:hasMaxValue "11990,0"^^xsd:float .

67 # 2. fix numeric values:

8 ex:QV gr:hasMinValue "11990.0"^^xsd:float .

9 ex:QV gr:hasMaxValue "11990.0"^^xsd:float .

1ex:QV gr:hasValue "11990,0"^^xsd:float .

23# 1. fix numeric values:

4ex:QV gr:hasValue "11990.0"^^xsd:float .

567# 2. expansion:

8ex:QV gr:hasMinValue "11990.0"^^xsd:float .

9ex:QV gr:hasMaxValue "11990.0"^^xsd:float .

Listing 6.51: Interchangeable execution of two cleansing rules

However, for other cleansing rules that rely on a certain pattern in the data, it is non-trivial. It matters for example whether we replace deprecated classes by their newcounterparts before we execute rules that are relying on these new classes. Similarly,the unit conversion functionality assumes unit values to be expressed as intervals, i.e.modeled as values with lower and upper boundaries. Thus, the rule to convert pointvalues to intervals needs to be executed beforehand.

Thus, it is extremely important to arrange cleansing rules in a way that the executionorder gives credit to potential interdependencies of rules.

6.5.4 Translation versus Canonicalization

In general, there are two possible approaches for unit conversion in a SPARQL endpoint:Either on-the-fly translation of values during query execution, or canonicalization. Thiscanonicalization can be accomplished via materialization of values to a given base unit(e.g. “KGM”). Compared to real-time translation, it is computationally less expensive for aSPARQL endpoint to compare values based on uniform units than still having to translatethem on the fly. For applications, it is further no big deal to define a function thattranslates values into their preferred units (e.g. to “GRM”). It is basically one conversionformula that needs to be applied, e.g. the value in “KGM” has to be multiplied by exactlyone thousand to obtain the value in “GRM”.

6.6 Conclusion 243

To consolidate various units, our data management interface executes the cleansing ruleshown in Listing 6.46 that updates14 a quantitative value with the one represented in itsbase unit.

6.6 Conclusion

In this chapter, we have developed a typology of data quality problems for e-commerceand presented techniques to address these problems. We further analyzed the prevalenceof these obstacles in the Web crawl from Chapter 3, whereby we confirmed an urgentneed for data cleansing on the e-commerce Web of Data. We complemented our workwith a data management Web user interface that facilitates the maintainability of datamanagement, data cleansing, and enrichment rules.

The cleansing and enrichment rules introduced in this chapter are neither meant tobe comprehensive nor to be exhaustive. Without any doubt, there exist many morecleansing rules that could be applied to e-commerce data at different levels of complexity.Besides the rules presented herein, we envisage more complex rules that take advantageof translations in various languages, natural language processing (NLP) techniques, orinformation extraction tasks. For example, missing price specifications could be obtainedby employing a simple extraction rule to create the respective properties from prices intext.

14Please note that in the example given in Listing 6.46 and in all previous examples of this chapter, thetriple with the owl:deprecated property was omitted, since its rationale was only explained later.

7 Faceted Product Search on the Semantic Web



7.2.1 Faceted Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249


7.2.2.1 Adaptive Faceted Search . . . . . . . . . . . . . . . . . . . . . 251

7.2.2.2 Faceted Search over RDF Data . . . . . . . . . . . . . . . . . 251

7.2.2.3 Faceted Search over Structured E-Commerce Data . . . . . . 251

7.2.2.4 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

7.3 Adaptive Faceted Search Interface for Product Offers . . . . . . . . . . . . . . . 252

7.3.1 Faceted Search User Interface . . . . . . . . . . . . . . . . . . . . . . . . 253

7.3.1.1 Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . 255

7.3.1.2 Faceted Navigation . . . . . . . . . . . . . . . . . . . . . . . . 255

7.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

7.3.3 Incremental Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 257

7.3.4 Instance-based Search Filtering . . . . . . . . . . . . . . . . . . . . . . . 258

7.3.5 User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

7.4.1 Impact of Search Specificity on the Size of the Result Set in Product Search260

7.4.1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

7.4.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

7.4.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

7.4.2 Usability Studies of Faceted Search Interfaces for Products . . . . . . . . 262

7.4.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

7.4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

7.4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

7.4.3 Proof of Concept with Real Product Data from the Web . . . . . . . . . 266

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

245


Linked Open Data (LOD) [e.g. BHB09; HB11] has become a popular paradigm forpublishing and consuming data on the Web. In the past few years, a growing amount ofe-commerce information has been published either as LOD or as embedded Microdata [cf.Hic13] or Resource Description Framework in Attributes (RDFa) [cf. Adi+13] markupin Hypertext Markup Language (HTML) that can be easily converted into ResourceDescription Framework (RDF) and combined with LOD sources.

Unfortunately, the usage of such data for product search and comparison remains anopen challenge, for the following reasons: First, the products and services are themselvesspecific and heterogeneous with regard to their relevant characteristics. Second, thesearch process involves learning about the option space, i.e. it is difficult to formulatequeries without knowing how well conceptual elements from the schemas are populatedand how much they influence the size and characteristics of the result set. Third, there isalso learning about correspondences in the underlying product features, i.e. approximateontology alignments. For instance, the product features “input voltage” and “supplyvoltage” might be equivalent in the context of a particular product search, while theyare not exactly equivalent in the general case. A human user might discover, refine, andrevise such context-bound correspondences in the process of searching. Search interfacesfor e-commerce data from the LOD paradigm have so far either tried to consolidate thedata into a single set of product features and product categories or confronted the humanusers with the raw data and its inherent conceptual heterogeneities.

In this chapter, we present an adaptive faceted search interface over RDF data for deepproduct comparison on the Web that is directly based on the popularity of schemaelements in the data and does not rely on a rigid, conceptual schema with hard-wiredproduct features, thereby being suitable for arbitrary product domains and productevolution. We (1) present a proof of concept and demonstrate it with real product datafrom the Web, (2) provide some preliminary evidence that the product space in a sampledataset narrows down logarithmically by the number of product features used in a query,and (3) show that the usability of our approach is comparable with approaches withhard-wired product features, while improving the depth and breadth of product searchand comparison. Our findings suggest that an instance-driven faceted search system forLOD, which dynamically adapts to user requirements and patterns in data, is a promisingdirection for future search interfaces for e-commerce and other application domains, anda precondition for meaningful product search on the Web of Linked Open Data.



In the recent years, companies have increasingly added structured e-commerce data pub-lished as Microdata or RDFa markup to HTML Web pages [MP12; MB12; MPB14]. Suchproduct, store, and offer data is primarily based on the GoodRelations and schema.orgvocabularies and forms, while mainly provided for major search engines like Google, apromising data source for novel Web applications and services.

Unfortunately, the available means for exploring this giant RDF graph of e-commerceinformation are limited. The diversity of products and data sources, the inherentlearning effects during search, the heterogeneity in terms of data semantics with theresulting need to align data schema elements on the go, and the sparsity of the graph ofproduct information create special requirements for product comparison solutions thatare currently not met. On top of the technical challenges, products and services aretypically characterized by a vast variety of product features that influence the overallutility of a certain product, trade-offs between such features, and a significant variationin item prices. Consequently, product comparison includes multi-dimensional, non-lineartrade-off decisions. In essence, there are at least the following fundamental requirementsthat a product comparison solution on the Web should fulfill:

1. Multi-dimensional views on products. The complexity and dynamics of productsand services necessitate multi-parametric search models based on distinguishingproperties and attributes of product entities, which, on the Web of Data, can berealized by considering the structure of the available data. In other words, it isimportant to allow arbitrary paths for narrowing down the set of candidate productsinstead of hard-wired, sequential search processes, and to integrate the productfeature perspective with the product price perspective at any given time in thesearch task at hand.

2. Support learning about the option space. Search is an iterative, incremental learningprocess [e.g. MC10, p. 9] rather than a static, one-shot query. For example, usersgrasp new information about the option space in every single search turn [Bat89],possibly leading to changes in their price expectations of the ranking of productfeature preferences. Thus, users need a way to relax or refine their constraints andpreferences based on how those modify the size of the option space.

3. Facilitate incremental, user-driven schema alignment. For product search withincremental learning, it is not only vital to assist in navigating and pruning theoption space, but also to actively engage the user in extending and consolidating


the schema level of the underlying data. Since users are likely to learn aboutcorrespondences in the underlying product features during the user interaction, theapproximate alignment of conceptual elements should be integrated in the iterativesearch process, and be fed back to the graph. E.g., a user interface could ask theuser for approval of a possible match between two product features.

4. Take into account the popularity of conceptual elements in the instance data. Auser interface that is solely based on the schema elements defined in the underlyingontologies is likely inefficient, because the user lacks information about the avail-ability of matching data (e.g. whether a property is used at all) and the relevanceof a constraint on the option space (e.g. whether products differ in that property).Due to a sparsely populated graph of product information on the Web, efficientuser interfaces should thus adapt to the actual usage of schema elements in thedata rather than be based on predefined, rigid schema definitions.

5. Utilize metrics for the efficiency of the search process. An efficient search interfacepresents choices to the user that help to quickly narrow down the option space, e.g.by proposing discerning features that partition the option space in the best possibleway, or by suggesting properties that promise the highest utility to a given userneed. Note that such metrics should not simply look at the effect on the size of theoption space (efficient partitioning into subsets), but also at the effects with regardto the quality of the fit to the users’ preference structures (maximize the utility inthe economics sense).

Established search approaches fall short with deep product comparison at Web scale.Information retrieval (IR) (e.g. keyword searches over document collections) essentiallyflattens multi-dimensional product descriptions to a simple, one-dimensional term match.On the other extreme, query formulation like with the SPARQL Protocol and RDF QueryLanguage (SPARQL) is generally too complex for the average user and lacks mediationbetween the conceptual models of the data versus the mental models of human users.Other approaches suggested for browsing RDF data (e.g. Tabulator [Ber+06]) are verylow-level for serious product search. As a result of these shortcomings, consumers tend tonarrow down the set of candidate offers very early in the search process, which bears therisk that potentially interesting product offers are eliminated prematurely. Also, resultsare highly biased towards a single product or offer dimension (e.g. low prices) [Sac05].

In this chapter, we assume that faceted search [Tun09], a special form of exploratorysearch [Mar06], is an appropriate approximation for deep product comparison on the Web


of Data. Faceted search is well established both in practice (e.g. eBay1 and Amazon2) andin academia as a way to guide users through option spaces [e.g. Wei+13; FH11; ODD06].In a nutshell, it constitutes a multi-dimensional interaction paradigm based on facet-valuepairs, e.g. product dimensions, where facet views and the result set dynamically adaptwith the actual data. I.e., irrelevant options are hidden each time a user selects a facet orfacet value.

As the key contribution of our work, (1) we propose an instance-driven, adaptive facetedsearch interface for deep product comparison on the Web of Linked Data. In particular,we describe the main components of the faceted search interface, and detail the iterative,incremental search strategy applied on the product domain. In this regard, we alsointroduce an instance-based search filtering approach and highlight the role of userfeedback in RDF environments. We then evaluate our approach in the following ways:(2) We provide evidence that faceted product search leads to a logarithmic reduction ofthe result set; (3) we empirically study that our dynamic faceted search interface has nosignificant negative impact on usability; and (4) we showcase our approach using somereal e-commerce data that we have collected from the Web.

The rest of this chapter is structured as follows: Section 7.3 describes our faceted searchinterface over structured product data; in Section 7.4, we evaluate our approach; inSection 7.2, we compare our work with related works from the literature; and finally,Section 7.5 concludes our work and discusses future directions.


In this section, we summarize the theoretical background of faceted search and point torelated research on faceted search interfaces over structured data.

7.2.1 Faceted Search

Faceted search is a multi-dimensional search paradigm based on facet-value pairs. Itoffers interactive guidance to users via progressively refining a query against a collectionof items [Tun09, p. 23; Wei+13]. In practice, in every selection step a user sees only thosefacet-value pairs for which further interaction is reasonable [e.g. Wei+13]. Thus, facetedsearch effectively eliminates the risk of hitting dead ends, i.e. the return of empty resultsets [Tun09, p. 23].

1http://www.ebay.com/ (accessed on January 14, 2015)2http://www.amazon.com/ (accessed on January 14, 2015)

http://www.ebay.com/

http://www.amazon.com/


While the term faceted search is sometimes equated with faceted browsing or facetednavigation [e.g. Wei+13; MC10, p. 95], it is often understood as an interaction paradigmconsisting of faceted navigation complemented with keyword search functionality [e.g.Tun09, p. 24]. Substantial research related to faceted search was also conducted underthe term dynamic taxonomies [Sac00; Sac05; ST09].

A faceted search interface is based on facets and facet values, or terms [cf. Wei+13]. Facetscan be compared to mutually orthogonal categories (i.e. terms cannot appear in multiplecategories) [e.g. Yee+03; ODD06; Wei+13], whereas facet values are instances of thesecategories. In the context of product search, facets represent product dimensions (e.g.features “color” or “material”) and facet values correspond to instances of these dimensions(e.g. “brown” or “wood”). Products are usually represented by multiple facets and facetvalues. User selections in faceted search interfaces map to boolean expressions for thefiltering of the option space. While a selection of multiple facets generally leads to theirconjunction (e.g. color “brown” and material “wood”), multiple facet values are mostlycombined using disjunction (e.g. color “brown” or “red”) [cf. Hea+02]. In set-theoreticterms, conjunction corresponds to the intersection of items, and disjunction to theirunion.

Faceted search interfaces dynamically adapt to changes in selection. In other words, inresponse to user interaction, the facet views update on reducing or expanding the optionspace. Furthermore, the faceted navigation paradigm3 does not lead to dead ends orempty results [e.g. FH11], because a user is always presented facets for which instancedata is available. Unlike parametric searches, where a user is forced into a sequentialsearch order (e.g. choose camera type, then focal length, after that picture resolution, andfinally color), faceted search allows users to drill down the search space in any preferredorder that best suits their individual learning abilities. A few commercially availablefaceted search solutions include simple mechanisms for indicating the effect of a facet orfacet value on the size of the option space.


We position our work at the intersection of human-computer interaction (HCI), SemanticWeb, and e-commerce. Accordingly, we deem relevant three research directions, namely(1) adaptive faceted search interfaces, (2) faceted search over RDF data, and (3) facetedproduct search on the Semantic Web.

3Please note, however, that keyword searches as part of faceted search interfaces can clearly causeempty results.


7.2.2.1 Adaptive Faceted Search

In adaptive faceted search interfaces, user controls dynamically conform to the underlyingdata constrained by the current selection. An adaptive faceted search interface wasproposed in [Abe+11] to investigate content within Twitter streams. Facets and facetvalues are computed based on semantic enrichment of Twitter messages. The searchinterface adapts according to frequency, user profile, temporal context, and diversification.In [Tva11], the author combines approaches from the Semantic Web, the Adaptive Web,and the Social Web. The goal is to facilitate information access on the Web via anadaptive, exploratory search approach relying on multiple search paradigms like keywordsearch and faceted navigation. Facets are generated automatically and customized basedon the user’s individual relevance judgement. Another work related to personalizedfaceted search over Web document metadata was proposed in [KZL08], where the facetviews adapt according to user ratings.

7.2.2.2 Faceted Search over RDF Data

As an easy-to-use alternative for SPARQL querying, faceted search gained wide attractionas a search paradigm for RDF data. Faceted search as a means to navigate over arbitrarydatasets with structured data was formalized in [ODD06]. Unlike conventional facetedsearch, their approach introduces additional expressivity that allows to navigate differenttypes of resources (e.g. not only product offers). A similar approach develops a formalmodel for question answering (QA) based on faceted queries and regard also ontologicalreasoning [Are+14a]. The work in [FH11] combines the ease-of-use of faceted search withthe expressive power of the SPARQL query language. In comparison to the two otherworks that operate on set operations over resources, this approach provides navigationthrough query transformations at the syntactic level. Some large-scale faceted searchinterfaces over real RDF datasets were suggested in [Hah+10] and [Are+14b]. In [Hah+10],the authors built a faceted search interface over structured Wikipedia infobox data (i.e.DBPedia [Aue+07]). The work in [Are+14b] studies limitations of conventional facetedsearch systems, and presents a faceted search interface over Yago [cf. SKW07] thatincorporates full-text search on top of Lucene [cf. MHG10].

7.2.2.3 Faceted Search over Structured E-Commerce Data

A similar approach to ours was suggested in [VvDF12]. The authors demonstrate animplementation of a faceted product search interface over structured e-commerce data


from the Web. The data store4 presently contains a selection of product offers along withreview data from selected online stores. The authors claim that their RDF database canbe extended with additional products by submitting the respective Uniform ResourceIdentifiers (URIs) of Web pages featuring product data in RDFa or Microformats. Albeitconstituting a valuable contribution, their research does not address a couple of problemsthat we are tackling in our work. As the main difference, they only support basiccommercial properties of product offers, whereas we provide deep product comparison viaproduct features. Our faceted search interface is more directed towards user-friendlinessas it supports all search paradigms (i.e. keyword search, faceted browsing, and instance-based search filtering) throughout the whole search process. Furthermore, unlike theirapproach that categorizes products into a rigid category structure, our faceted searchinterface is fully instance-driven.

7.2.2.4 Our Approach

In summary, the work presented in this chapter differs from the aforementioned works inat least one, if not several, of the following dimensions:

• Our application area is the domain of products,

• the search user interface is instance-driven,

• our approach is versatile, i.e. it supports arbitrary e-commerce data in RDF, aslong as it adheres to the GoodRelations meta-model,

• the search user interface is designed to be fully iterative, i.e. user interaction feedsknowledge base augmentation and refinement, and

• the implementation is fully RDF- and SPARQL-1.1-based, i.e. constituting a nativeSemantic Web approach.

Some interesting properties that other works presented that we currently do not supportare among others personalization, and ontology-based adaptive facet generation.

7.3 Adaptive Faceted Search Interface for Product Offers

In the following, we describe an adaptive faceted search interface for product offers overRDF data.

4http://xploreproducts.com/ (accessed on December 30, 2014)

http://xploreproducts.com/

7.3 Adaptive Faceted Search Interface for Product Offers 253

7.3.1 Faceted Search User Interface

In Figure 7.1, we propose a mock-up of a general faceted search interface over e-commercedata. The faceted search interface combines the two interaction paradigms keyword searchand faceted navigation. As illustrated in the graphic, the keyword search field is placedprominently at the top of the search interface, and the boxes surrounding the result listin the middle represent the faceted navigation controls. The User Dialog box, displayedon the upper right part, can further serve as a means to non-intrusively incorporate userfeedback.

Keywords ...

Feature 2

Search

Feature 1

Value 1 Yes No

Feature 1 and Feature 2 seem to be equivalent. Shall I consolidate them?

Keywords ... Search

Category 1

Category 4Category 5

Category 2Category 3

Vendor 1

< 10

Filter results by products with images

Vendor 2

10-20

Value 2Feature 3

Vendors

Price

Order items by lowest price first

Product Details

Commercial Properties

Additional Config.s

Categories

User Dialog

1 222 3 ......

Results

ProductImage

ProductImage

Product 1

Description

Price

Product 2

Description

Price

Details

Details

Figure 7.1: Mock-up of a faceted search interface for e-commerce

Figure 7.2 depicts a screenshot of the faceted search prototype that we developed asthe main contribution of this chapter. The demonstrated results are based on toy datamodeled using the Vehicle Sales Ontology (VSO)5. Our online tool6 effectively integratesproduct details, product category information, and commercial properties related toproduct offers. Thereby, it is possible to conduct deep product comparison without havingto overly focus on the price.

The user interface in large part corresponds with the mock-up from Figure 7.1. For everyresult in the result list, a link is provided that, when clicked, opens a modal window wherethe full product details show up. Nonetheless, the most interesting facts such as productimage, name, description, and price are summarized in the result list. If the result set

5http://purl.org/vso/ (accessed on January 15, 2015)6http://www.ebusiness-unibw.org/tools/product-search/ (accessed on January 15, 2015)

http://purl.org/vso/

http://www.ebusiness-unibw.org/tools/product-search/


Price ranges

Product details link

SPARQL endpoint selection

Expandablefacets

Pagination

Normalizedprice values

Tooltips

Figure 7.2: Screenshot of our faceted product search prototype

exceeds the number of ten items, then the remaining results are outsourced to other resultpages that can be accessed via pagination controls [cf. MC10, pp. 110–116]. The part leftto the result list is dedicated to product-related details and commercial properties. Itmainly includes product features and prices, but also manufacturers, vendors, paymentoptions, or business functions. The right part features a category filter, along withadditional filter configurations such as to exclude product offers without images or torevert the result order. The search user interface makes heavy use of tooltips that helpthe user at better understanding the option space.

Our search interface is instance-driven. By that we mean that it is directly based on theconstraints imposed on the underlying RDF data, i.e. it dynamically adapts with theavailability of the data, such as the presence of product features or categories. Accordingly,there is no fixed schema necessary for generating the views, but rather the availability ofthe data determines the appearance. In other words, our approach is able to flexibly copewith structured product data from various data sources. Only the high-level product offerdata builds upon the GoodRelations meta-model [Hep08a], which has also been made


the core meta-model for e-commerce data in schema.org and is thus officially endorsed byGoogle, Microsoft, Yahoo!, and Yandex [Guh12]. This allows for useful guided navigationpaths even in the absence of richly axiomatized products.

7.3.1.1 Keyword Search

We incorporated two kinds of keyword searches into our faceted search prototype: A firstone for product offers, and a second one for searching within product categories. Thekeyword searches match terms within textual properties attached to objects, which forproduct offers includes names and descriptions of product offers, instances, and models,and for product categories labels and comments. An autocomplete feature assists the userin finding the right terminology. It is based on a light-weight SPARQL query executedover the product names and product category labels, respectively.

Simple keyword search functionality can be obtained using the SPARQL CONTAINSfunction [HS13], even though such queries are commonly costly for large datasets, as mostSPARQL implementations iterate over all relevant objects. Some SPARQL endpoints thussupport operations over optimized full-text indexes built from textual properties in theavailable data. Such search indexes like Apache Lucene [MHG10] even support wildcardqueries or fuzzy string matches based on a given threshold limit [MHG10, pp. 99–101].If supported by the underlying SPARQL endpoint, our prototype relies on Lucene forkeyword searches. Otherwise, it falls back to the much slower SPARQL CONTAINSfunction.

7.3.1.2 Faceted Navigation

The exploratory search capability is provided through the faceted navigation controls thatcomplement the keyword searches. Faceted navigation uses boolean constraint filteringbased on product dimensions, commercial properties of product offers, and product typeinformation.

As the product features facet view in Figure 7.2 suggests, product features are initiallydisplayed in compact form and expanded to their corresponding values as the user clickson them. The numbers given in square brackets indicate the quantity of instances inthe result set affected by applying the respective filter. A selection of multiple productfeatures leads to their conjunction (logical and), i.e. items to appear in the result listneed to match on every selected feature. For the remaining facet views, a disjunctiveapproach is used (logical or). If, for instance, a payment option has already been selected


before, then the user could add a second payment option. Matching candidate offers thenhave either to accept one or both of the selected payment options.

The facet values displayed in the faceted search interface correspond to qualitative, quan-titative, and datatype properties in the GoodRelations meta-model. Unlike qualitative ordatatype values that may be implemented with checkboxes that a user can click on, itworks differently for quantitative values. One possible alternative is to group quantitativevalues into classes given as range intervals (e.g. “$ 0–20”, “$ 20–50”, etc.). Our approach,however, takes a range slider as illustrated on the price view in Figure 7.2. To build upthe range slider, we need to generate a useful number of classes with each having thesame width. In order to obtain the class width, the interval between the minimum andmaximum value is divided by the number of classes, which number takes the amount ofvalues with a specified upper limit (e.g. a maximum of 30 classes). The height of the barsis calculated relative to the class with the highest frequency of values, where the scale islogarithmic and the maximum possible height is predefined. By displaying the frequencyof values in every class, a user can quickly gauge the possible outcomes of his decision.Even though this approach works well for point values, we need to rely on a heuristic forrange intervals: For closed intervals, we consider the lower bound for the classifying; andfor open intervals, we take whatever value is available.

7.3.2 Implementation

From a conceptual point of view, the front-end of the application is backed by data froman RDF store accessible through a SPARQL endpoint, which can be configured as perthe endpoint selection dropdown menu in Figure 7.2. As of writing this chapter, we hadtested our prototype with three different SPARQL endpoints compliant with the WorldWide Web Consortium (W3C) SPARQL 1.1 standard, namely Virtuoso Open SourceEdition7, the Jena-TDB-based Fuseki SPARQL server8, and Stardog9. Every single facetview of the user interface is generated by executing its own, unique SPARQL query, whereeach one takes into account the current set of constraints. Facet-value pairs in RDFare represented by properties and instances (or values). Similarly, conjunction of facetsand disjunction of facet values within a facet are realized in SPARQL using a sequenceof triple patterns and UNION clauses, respectively. The human-readable labels shownthroughout the user interface in place of URIs are, unless not available in the RDF store,extracted from product vocabularies and instance labels.

7http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/ (accessed on May 26, 2014)8http://jena.apache.org/documentation/serving_data/ (accessed on May 26, 2014)9http://stardog.com/ (accessed on May 26, 2014)

http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/

http://jena.apache.org/documentation/serving_data/

http://stardog.com/


The front-end of the application was realized using HTML 5 [Hic+14] for static content,JavaScript [Eic05] and JQuery10 for user interaction, and Twitter Bootstrap11 for pageresponsiveness. The back-end is based on the Python programming language and theJinja 212 templating engine for generating custom-tailored SPARQL queries, that canlater be submitted to the SPARQL endpoint. The application can be run on any general-purpose Apache Web server instance where the mod_python and mod_wsgi modulesare enabled. If supported by the SPARQL endpoint, keyword searches execute over afull-text search engine such as Lucene [cf. MHG10].

7.3.3 Incremental Search Strategy

At every search step, the user is presented with dynamic facet and result views thatrepresent the search space pruned according to the current selection. Product searchsupposedly ends as the user is satisfied with the results or not able to find a path leadinghim or her to better search results. Until then, the search process may repeatedly switchbetween multiple search paradigms, as illustrated in Figure 7.3.

SearchStart

SearchStop

Instance-based Search Filtering

KeywordSearch

FacetedNavigation

UserFeedback / Alignments

Figure 7.3: Incremental search cycle among multiple search paradigms

The user dialog in faceted search is fundamentally a decision tree problem, where theuser interaction steps are branches of the tree. Because the facets are orthogonal to eachother, the decision tree can be constructed in any order [ODD06]. However, if we wantto optimize the search efficiency for the user, we have to create and, if necessary, updatethe resulting tree based on a “best split” strategy known from decision tree research in10http://jquery.com/ (accessed on February 19, 2016)11http://getbootstrap.com/ (accessed on February 19, 2016)12http://jinja.pocoo.org/ (accessed on February 19, 2016)

http://jquery.com/

http://getbootstrap.com/

http://jinja.pocoo.org/


data mining [TSK05, p. 158]. Literature came up with popular algorithms to iterativelychoose attributes maximizing the information gain, above all the Iterative Dichotomizer 3(ID3) algorithm [Qui86] and its extensions C4.5 [Qui93] and ID5R [Utg89]. In thiscontext, [KZL08] mention some popular facet-pair suggestions strategies, namely relyingon frequency, probability, and the information gain. The authors in [VFK13] further givean overview over different metrics appropriate for product search to help decide whichfacets shall be presented to the user.

So, a user would ideally be presented in every search step the options (facets and facetvalues) that best possible partition the search space. At the time of writing this chapter,the splitting strategy employed by our prototype was relying on facets and facet valuespresented according to the descending frequency order in the data [cf. KZL08]. Althoughit is a very simple strategy, we are well aware of this methodology suffering from someimportant weaknesses. For example, if all or only very few of the available items exhibita certain feature or feature-value pair, then the current algorithm tends to over- orunderrate their value for the search progress. Similarly, the algorithm is easily misguidedas multiple instances of the same feature (or interrelated features) belong to a single item(e.g. the feature-value pairs “varnish: red” and “color: red”).

7.3.4 Instance-based Search Filtering

As a complement to faceted search filters over aggregated data, we herein present a novelidea for supporting the incremental search process, namely instance-based search filtering.Sometimes, a user might be looking at the product details of a particular product offer(see Figure 7.4) and discover features that he or she intends to consider for the nextfiltering steps in the search process. Now, by going back to the result list, the user wouldlose track of the respective features and find them only by chance in the list of displayedproduct features, namely if they are ranked high with respect to the current items inthe option space. Of course, the same holds true for feature values and individuals aswell. From a user interaction point of view, we can prevent this level of indirection byletting the user apply filters directly from the product details page (see Figure 7.4). Asa nice side effect, this solution facilitates the comparison among similar products viapivoting [Ste+09, pp. 83f.] across a collection of items that share the same features. As arequirement for this to work well, however, the properties in the RDF graph need to beconsolidated first.


Figure 7.4: Screenshot of a product details modal window with instance-based search filtering

7.3.5 User Feedback

For product search with incremental learning, it is not only vital to assist in navigatingand pruning the option space, but also to actively engage the user in the search processand to incorporate user feedback. As already depicted in the mock-up in Figure 7.1,a viable approach is to integrate a user dialog in the search interface. This approachgoes beyond the traditional relevance feedback methods known from IR (e.g. Rocchio[Roc71; SB90]), where a system would typically consider explicit user feedback (viaselection of relevant documents or click feedback) or implicit system feedback (via localor global analysis) to revise a query towards improving query results [BR11, pp. 178–180;cf. Hea+02].

Mentioning here all possibilities of providing explicit user feedback in search systems isoutside the scope of this work. Yet, a user interface could e.g. ask the user to approve apossible match between two features. The current status of our implementation takesinto account user feedback in the form of a dialog box that pops up informing the userabout potentially interesting super-concepts with respect to the concept in regard. Onthat account, a user is able to expand the search scope.


In an RDF environment, corresponding axioms to reflect that newly gathered knowledgecan be easily added to the existing graph of data as named RDF graphs [Car+05] –potentially managed on a per-user basis. This way, the newly created named graphs canpersist in the RDF store in order to reflect past user experiences, or be cleared at somepoint, e.g. if the materialized axioms were highly specific to a particular information needand a user intends to restart the search from scratch. The basic idea behind the datamanagement with RDF graphs was already explained in Chapter 6.

7.4 Evaluation

This chapter investigates the appropriateness of faceted search interfaces for the Web ofData. To test for two fundamental aspects of search interfaces, namely search efficiencyand usability, we first measure the impact of specificity in product search on the size ofthe result set using a simulation of random walks. Then, we conduct a usability studywhere we contrast our fully dynamic, data-driven faceted search interface to an alteredinstance of our search interface with hard-wired product features. Finally, we showcaseour faceted product search approach on real e-commerce data from the Web.

7.4.1 Impact of Search Specificity on the Size of the Result Set in ProductSearch

We simulated a number of product searches to find out how dispersed the search spacefor products is and how well a faceted search approach on average performs regardingpartitioning the option space.

7.4.1.1 Method

We took a random sample of 875 automobile offers13 from the mobile.de car listing Website. We extracted the product features from the respective Web pages and populated anRDF graph via mapping product features to properties from the VSO ontology14. Forthe sake of simplicity, we did not consider quantitative values for our simulation but onlyqualitative and datatype properties. The variety of qualitative and datatype propertiesover the whole dataset is shown in Table 7.1.13More precisely, we took random result page numbers between 1 and 100 for random price ranges

between 1 and 100, 000 Euros.14http://purl.org/vso/ns (accessed on October 19, 2014)

http://purl.org/vso/ns

7.4 Evaluation 261

Table 7.1: Variety of properties and values in an automotive dataset

Property Variety of Values

http://purl.org/vso/ns#bodyStyle 6http://purl.org/vso/ns#color 24http://purl.org/vso/ns#condition 5http://purl.org/vso/ns#feature 60http://purl.org/vso/ns#fuelType 10http://purl.org/vso/ns#meetsEmissionStandard 5http://purl.org/vso/ns#transmission 3

These numbers give a total of 113 possible property-value pairs. From this range ofpossible property-value combinations, we drew one entry at random and started fromthere a random walk simulating ten consecutive selection steps. After every selectionstep, we randomly picked a property-value pair from the reduced option space, which weobtained by issuing a SPARQL query.

7.4.1.2 Results

Figure 7.5 outlines the results of our simulation. At the beginning (step 0), the optionspace always entails the full range of 875 car offers. In search step 1, the median alreadygoes down to circa 150 results, i.e. in 50% of the cases the first filtering step sorts out anaverage of more than 700 out of 875 automobiles. After having selected three productfeatures, the median of the option space decreases to only three items.

For the sake of simplicity, our random walk does not include UNION clauses, i.e. thedisjunctive selection of multiple facet values which would expand the option space (e.g.select a car that offers either manual or automatic transmission). However, we arguethat in a real setting where users are seeking interesting product offers this operation isanyway rare.

7.4.1.3 Discussion

We can see clearly from the analysis that the space of possibly matching productsdecreases logarithmically with the number of features specified in a query. This confirmsour assumption that learning about the option space, i.e. how relaxing and refiningrequirements and preferences based on the set of remaining choices, is a critical part ofproduct search interaction. It also highlights that in specific branches of product searchand thus sparsely populated decision trees, a search interface can benefit from beingdynamically generated directly from the data about products and their characteristics.


Figure 7.5: Change of option space with 100 random walk iterations over a decision tree for 875automobile offers

Of course, the findings presented are currently based on a single sample data set of 875cars, albeit those have been selected randomly from a very significant real dataset froma car sales portal. The effect of the number of features might be less significant if wetook into account the correlation of features (e.g. that a stronger engine is likely to befound in combination with more seating capacity), which we deliberately abstractedfrom by selecting the features randomly. We would counter, however, that exactly thesecorrelations between product features are unknown ex ante to a person exploring aproduct space and thus stress the importance of the learning effect of iterative productsearch.

7.4.2 Usability Studies of Faceted Search Interfaces for Products

Faceted search interfaces have more recently attracted significant research interest. Variousdemonstrators, user studies, and evaluations repeatedly attest them superior usability incontrast with other search paradigms [e.g. Yee+03; KZL08; ODD06; FH11]. In a surveyin [Wei+13], the authors systematically compare faceted search with other popular searchparadigms.

In here, we conduct a usability study in order to find out whether our instance-driven search

7.4 Evaluation 263

interface has a negative impact on user satisfaction, because hard-wired, consolidated userinterfaces found in today’s commercial faceted search applications have the advantagethat the facets presented can be based on popular mental models of human users. Aninstance-driven, adaptive faceted search interface bears the risk of being confusing tousers, because the facets being presented and their names may change based on theavailable data.

7.4.2.1 Method

In order to evaluate that potentially negative effect, we prepared a second variant of oursearch interface that relies on hard-wired product features15. As the data to present inour search interfaces, we used a random subset of 25 car offers out of the random sampleof 875 car offers from mobile.de.

We set up a usability study according to the System Usability Scale (SUS) [Bro96] score.The questionnaire encompasses ten brief questions where each response is represented bya five-point Likert scale ranging from strongly agree to strongly disagree. SUS questionsare designed to alternate between positive and negative statements. In addition, weincluded a gold question to filter out unreliable candidates based on an incorrect response.We placed the gold question at the end of the questionnaire. Otherwise, we feared thatparticipants would potentially give up too early, because it required a bit of effort tolook at the information displayed in the search interface. Finally, we asked for optionalfeedback, which turned out valuable for interpreting results in a later analysis. We putthe questionnaire online so that participants could test the search interface and answerto questions remotely.

We conducted two separate usability studies. The first one we ran with undergraduatestudents from our university, who specialize in business management or related fields.They were asked to assess the usability of the original search interface and, later, torepeat the same task with the amended search interface. Our second experiment washarnessing crowd workforce from the CrowdFlower platform, similar to the experimentalsetup in [Liu+12]. As compared to the students experiment, we ran the usability test forboth search interfaces in parallel with two distinct groups of participants.

15http://www.ebusiness-unibw.org/tools/product-search-static/ (accessed on February 20,2015)

http://www.ebusiness-unibw.org/tools/product-search-static/


7.4.2.2 Results

In the following, we report on the empirical results obtained from the two usabilitystudies, as summarized in Table 7.2.

Table 7.2: Results of SUS experiments

Students CrowdsourcingA B A B

No. participants 39 29 50 50No. incorrect answers 5 3 13 9No. answers considered 39 29 37 41Avg. SUS score 66.54 72.59 65.00 68.75

Usability Experiment with Students For students’ rating, we did not eliminate incorrectanswers based on the gold question, because after closely investigating their individualresponses we found out that they did not fall into the trap of the alternating pattern ofthe SUS questions. The task completion rate [cf. SL05] for students was thus 34/39 = 87%

for search interface A, and 26/29 = 90% for search interface B. From the feedback ofthe students, we further conclude that they had sometimes too high expectations to ouracademic search interface as car sales portals like mobile.de are already very mature.

Our dynamic search interface (search interface A) achieved an average SUS score of66.54, which is slightly below the average of 6816, which was the mean SUS score among500 system usability studies. Taking on the qualitative, “adjective” rating introducedin [BKM09], the search interface is considered “good” (SUS score close to 71.4). Bycomparison, the alternative, static search interface (search interface B) obtained anaverage SUS score of 72.59. We stated the following null hypothesis to test the differencein the usability scores for significance:

Null hypothesis. There is no difference among SUS scores for search interfaces A andB obtained through two samples from the same population of students.

A Shapiro-Wilk test [SW65] revealed that we cannot assume that both SUS score samplesare normally distributed (p-values of 0.03 and 0.06), thus we compared the two samplesusing a non-parametric statistical test, i.e. the Wilcoxon rank-sum test [Wil45].

The average usability scores assigned by our students to search interface A (median =

70.00) did not differ significantly from usability scores assigned to search interface B(median = 75.00), W = �1.45, p = 0.15, r = �0.18.16http://www.measuringu.com/sus.php (accessed on January 29, 2015)

http://www.measuringu.com/sus.php

7.4 Evaluation 265

Usability Experiment with Crowdsourcing Unlike in the previous experiment, we didonly accept contributions by crowd workers who correctly answered the gold question.The task completion rate [cf. SL05] for crowd workers was 37/50 = 74% for searchinterface A, and 41/50 = 82% for search interface B.

Search interface A achieved an average SUS score of 65.00, which is below 68, but still“good” according to [BKM09]. Search interface B obtained an average SUS score of 68.75.The null hypothesis below was used to test whether the two usability scores significantlydiffer:

Null hypothesis. There is no difference among SUS scores for search interfaces A andB obtained through two different samples of crowd workers.

A Shapiro-Wilk test [SW65] revealed that we cannot assume that both SUS score samplesare normally distributed (p-values of 0.13 and 0.01), thus we compared the two samplesusing a non-parametric statistical test, i.e. the Wilcoxon rank-sum test [Wil45].

The average usability scores assigned by the first group of crowd workers to searchinterface A (median = 65.00) did not differ significantly from usability scores assigned tosearch interface B by the second group of crowd workers (median = 73.75), W = �1.30,p = 0.19, r = �0.15.

7.4.2.3 Discussion

This analysis shows that, in principle, a fully dynamic search interface directly basedon product features found in the data, is not systematically less satisfying for usersthan one based on established, hard-wired product features used in existing car portals.However, we see a small negative effect in usability, which we expected, because thestatic, hard-wired set of search dimensions allows a higher degree of users’ familiaritywith the terminology and conceptual model of a search interface. We conclude from thatsmall negative effect that a data-driven search interface for products comes at a cost,which must be compensated for by additional gains in precision, recall, and eventuallythe utility of the finally selected product.

We would also like to stress that a usability-based evaluation of novel search interfaces hasa systematic weakness, because it only analyzes how well a user can handle the interface,but not the quality of the choices eventually made (e.g. how well the finally selectedproduct meets the user’s needs). As we have shown in the first part of the evaluation, thesparsity and heterogeneity of the product space indicates that a more precise navigationin the option space can return much better product matches.


7.4.3 Proof of Concept with Real Product Data from the Web

In this section, we provide a proof of concept by presenting a use case of our facetedproduct search interface with real e-commerce data from the Web.

For setting up our product search demo, we conducted another focused Web crawl overWeb shops featuring household appliances and added the data to our crawl dataset fromChapter 3. More precisely, we selected shops from the Rakuten Deutschland platform17

that were classified into categories related to the general topic household. By that wecould obtain 23 Web shops, from which 17 shops contained GoodRelations markup.Thanks to our earlier research on BMEcat catalogs (see Chapter 4), we are already inthe possession of high-quality, structured product model master data by BSH18. We usedthat data source to augment the product offer data from the Web crawl with productfeatures. Overall, we found matching product models and product offers in 15 Web shops,as illustrated in Table 7.3, where we list all graph names (i.e. Uniform Resource Names(URNs) of named graphs [Car+05]) in the RDF store along with the number of matchingitems in each Web shop. Twelve shops were already contained in the big crawl, whereasthree shops were added using the household crawl (their provenance is discernible basedon the slightly different graph naming pattern that we used).

Table 7.3: Web shops with number of matching items

Graph Name Number of Matches

urn:www.outdoorfurniture.ie 1urn:kitchenking.tradoria-shop.de 1urn:futterkiste.tradoria-shop.de 2urn:www.european-gate.com 1urn:elektrotresen.tradoria-shop.de 20urn:www.ay-versand.de 3urn:fairplaysport.tradoria-shop.at 1urn:marketplace.b2b-discount.de 3urn:data-filter-direkt.rakuten-shop.de.rdfa.nt 4urn:heimundbuero.tradoria-shop.de 8urn:data-portens.rakuten-shop.de.rdfa.nt 3urn:computeronlineshop.tradoria.de 2urn:data-www.staubsaugerbedarf24.de.rdfa.nt 1urn:www.megashop-express.de 3urn:top-und-preiswert.tradoria-shop.de 50

After that, we executed cleansing and consolidation rules on the data such as expandinggr:includes shortcuts to gr:includesObject patterns (see Section 6.2.5), or converting all17Rakuten Deutschland was formerly known as Tradoria and has been acquired by Rakuten, a Japanese

e-commerce company. The platform offers software as a service (SaaS), actually shop software as aservice.

18BSH Hausgeräte GmbH, a manufacturer specializing on household appliances.

7.5 Conclusion 267

price specifications to a common currency “EUR” (see Section 6.3.8.2). Our demonstratoris accessible online19 via selecting the SPARQL endpoint that contains the data from thehousehold crawl. Figure 7.6 depicts a screenshot snippet of our search interface comprisingreal Web data from our crawl of household appliances. It becomes immediately apparentthat the product categories are missing for this dataset, because neither the Web crawlnor the BMEcat catalog for BSH feature a product classification. Furthermore, thegraphic indicates that four shops are selling the vacuum cleaner with the EuropeanArticle Number (EAN) “4242003551202”. This example nicely shows the great potential offaceted search interfaces and their appropriateness for deep product comparison, becauseproduct offers can be compared directly based on their features, which we obtained byintegrating product model data from a manufacturer.

Figure 7.6: Screenshot of the search interface with real data from a household crawl

7.5 Conclusion

In this chapter, we have proposed an instance-driven, adaptive faceted search interface asa way to navigate over the sparse graph of LOD for e-commerce on the Web with explicitsupport for user learning about the option space.

To support the viability of our approach, we have provided evidence that the selectionsteps in faceted search interfaces drill down the option space logarithmically, and we haveshown that the usability loss of our dynamic, data-driven approach in comparison toan alternative with hard-wired product features is insignificant. Furthermore, we have19http://www.ebusiness-unibw.org/tools/product-search/ (accessed on January 15, 2015)



demonstrated a proof of concept of our solution with real product data collected fromthe Web.

The small-scale usability study in this chapter also indicates that users seem to havegotten used to search interfaces that expose rigid navigation structures optimized forindividual application domains. While this technique works well at a smaller scale, it isnot feasible for e-commerce at Web scale over LOD, where diverse and dynamic productdomains need to be consolidated. From the insights that we have gained in this chapter,we think that the ideal solution would consist of a search interface that, instead of beingtied to the specification of the system or data structure, strives to adapt to user needs ona data-driven basis in the best possible way.

As future work, we envision to enhance our work by more accurate and context-sensitiveuser dialogs, personalization and diversification of facet and result views (e.g. personalizedresult ranking that goes beyond the simple sorting of prices), and better algorithmsfor partitioning the option space that eventually would lead to fewer search iterations.The current way of our search interface to present facet suggestions according to theirfrequency is still not optimal and needs further elaboration.

8 Discussion and Conclusion

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

8.2 Contributions and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

8.3 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

8.4 Critical Review and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

8.4.1 Prevalence and Validity of Web Ontologies for Products . . . . . . . . . 274

8.4.2 Product Data Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

8.4.3 Representativeness of Data Sources . . . . . . . . . . . . . . . . . . . . . 275

8.4.4 Faceted Search Interaction and Evaluation . . . . . . . . . . . . . . . . . 275

8.4.5 Scalability of SPARQL Queries over Large RDF Datasets . . . . . . . . 276

8.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

8.5.1 Data Collection with External Data Sources . . . . . . . . . . . . . . . . 277

8.5.2 Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

8.5.3 Personalization and Recommendation . . . . . . . . . . . . . . . . . . . 278

8.5.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

8.5.5 Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

In this thesis, we have studied product search over structured e-commerce data publishedonline. Our research hypothesis stated that structured product data is appropriate forrealizing deep product comparison on the Web. In other words, the Semantic Web promisesbetter integration of distributed and heterogeneous content via resource identifiers andshared, partly formalized conceptual data models than state-of-the-art solutions, andoffers data formats and vocabularies for the fine-grained description of products andproduct offers. This way, the multi-dimensional nature of products can be accommodated,which is a necessary condition for deep product comparison.

In the following, we summarize the main achievements of our work. Then, we outlinethe individual contributions along with the core findings. Finally, we point to knownlimitations of our approach, and conclude with a prospective view on open challenges forfuture work.

269


8.1 Summary

In the introduction of this thesis, we have characterized the main bottlenecks associatedwith current product searches on the Web, which includes the lack of precision of IR-basedapproaches over a large body of distributed, unstructured, and heterogeneous Web content,the complex and changing information needs of product searchers, and the insufficientsupport for user interaction. On that premise, we have put forward the argument thatSemantic Web technologies can help to address these problems. The increasing popularityand the growing interest in e-commerce over the Web has recently generated a considerableamount of useful structured product data piggybacked as Resource Description Frameworkin Attributes (RDFa) and Microdata in Hypertext Markup Language (HTML) Web pages.This structured data unlocks great potential for multi-parametric, exploratory searchparadigms like faceted search.

However, as we have learned in this thesis, it is difficult to obtain a consolidated viewover product information on the Web as required by product search. First of all, we haveprovided evidence that the large graph of structured product information on the Web issparsely populated. In order to fill these gaps to some extent, we have suggested (a) tointegrate product model master data derived from manufacturer datasheets based onstrong identifiers, and (b) to foster the annotation of products with product categoriesfrom product classification systems. This altogether adds numerous product featuresand product type information that facilitate the comparison of product offers. Anotherimportant problem is the heterogeneity in the product data, e.g. there is little consensuson naming of product features or the usage of standards for product type information. Inthis context, we have proposed a generic and light-weight method to cleanse and enrichResource Description Framework (RDF) data directly in a SPARQL Protocol and RDFQuery Language (SPARQL) endpoint using a set of SPARQL CONSTRUCT rules andgraph names. Finally, we have developed an adaptive, instance-driven faceted searchinterface for product offers over structured e-commerce data. In comparison with state-of-the-art search solutions that often suffer from lack of precision (e.g. keyword searchestypically offer high recall but at the cost of precision) and no support for incrementallearning, faceted search interfaces empower users to learn about the option space viauser interaction along several product dimensions. Hence, by the detailed comparison ofproducts on a much wider range of product dimensions, faceted search helps to reducethe risk of a searcher to prune the option space too early in the search process. Moreover,a faceted search interface that can adapt to the actual data is able to deal with theconceptual gaps as commonly found in the large graph of product information on theWeb. In respect of future search paradigms for e-commerce data on the Web, we further

8.2 Contributions and Findings 271

envision search scenarios where consolidation rules are triggered by user feedback on thego and immediately fed back into the graph of product data.

8.2 Contributions and Findings

In the following, we summarize the core contributions and findings of our thesis:

Contribution 1 (Web Crawl of Product Offer Data). We have designed and implementedan e-commerce spider, which enabled us to conduct a substantial Web crawl over structurede-commerce data starting from a positive list of seed Uniform Resource Identifiers (URIs).

Findings. Our analysis of the crawl of 2, 628 Web shops revealed that the graph of productinformation is sparsely populated, i.e. it exhibits very few product features and is thusof little help for deep product comparison on the Web. Furthermore, by comparison ofour dataset with an existing, open-source Web crawl (Web Data Commons (WDC)), wecould find out that harvesting product data from Web pages is not straightforward, as theproduct details are typically located in the deep Web pages of a Web site. Our conclusionswere drawn based on the fact that for the same pay-level domains our depth-first crawlwas able to collect significantly more structured product data than available in the WDCdataset.

Contribution 2 (BMEcat Converter for Product Model Master Data). We have definedmappings and developed a command-line tool for converting product catalogs in BMEcatExtensible Markup Language (XML) format [SLK05b] to GoodRelations-compliant[Hep08a] product catalogs for the Semantic Web. We have tested and evaluated ourconversion tool with product datasheets from two large manufacturers, i.e. Weidmüllerand BSH, where the latter one was further used to analyze the overlap with data fromthe Web crawl.

Findings. By the analyses of the two BMEcat-derived datasets, we could discover thatproduct model master data by manufacturers is – at least at the time of writing – muchmore granular and complete than product data found on the Web. In order to fill theexisting gap of sparse and low-quality product details supplied by vendors, we haveevaluated the feasibility of equipping retailer product data with high-quality productmodel master data from manufacturers, and have provided a preliminary estimate ofits potential leverage. Based on our insights, our recommendation is to rely on strongidentifiers such as European Article Numbers (EANs), Universal Product Codes (UPCs),Global Trade Item Numbers (GTINs), or combinations of brand names and manufacturerpart numbers (MPNs), to integrate product model master data into the sparse graph ofproduct data from the Web.


Contribution 3 (Product Type Information from Product Classification Systems).We have proposed a semi-automatic tool for generating Web ontologies from productcategorization standards and proprietary product category systems. We have supportedour contribution by converting 13 product classification systems of different sizes, scopes,and structures, where some of them either comprise or are available in multiple languages.These classification systems provide up to ten thousands of classes, properties, andinstances that can readily be used for annotating and categorizing products.

Findings. The statistics that we have reported on the converted product classificationsystems indicate a rich array of more or less specific product categories, which to maintainmanually would be prohibitively expensive. Due to the inherent context-sensitivity ofclassification systems, the plain adoption of the subsumption hierarchy from the original,narrow context to the generic context of products and services is usually discouraged,unless no inconsistencies are introduced. In case it is feasible, this would offer theopportunity for reasoning over products. We have illustrated how product annotations byvirtue of product classes render products more visible and discernible on the Web, whichultimately promotes multi-parametric product searches. Also, many of the availablestandards provide multiple translations in various languages and/or synonyms, which areuseful features for improving recall in product searches.

Contribution 4 (Cleansing and Enrichment). We have developed a typology of obstaclesfor product search, and have sketched techniques to overcome them by mainly relying onSPARQL CONSTRUCT rules. We have incorporated these rules into a prototype fordata management with RDF graphs. We have complemented our contribution with somestatistics on the prevalence of the obstacles in our Web crawl.

Findings. Our conclusions drawn from the statistics on our Web crawl suggest an urgentneed for data cleansing on the e-commerce Web of Data, as product models are definedredundantly, units of measurement and currency codes used incorrectly, links betweenproducts and product models missing, product type information not indicated, or wrongdatatypes assigned. Accordingly, substantial preliminary cleansing effort is necessary inorder to enable product search over RDF data.

Contribution 5 (Faceted Product Search). We have suggested an adaptive, facetedsearch interface over structured e-commerce data as an appropriate approximation for theessential requirements of product search, where the user is ideally able to learn about theoption space during the search process. We have used a sample dataset to demonstratethe discriminatory value of faceted search terms using a random walk simulation, andhave tested the usability of our search interface using student participants and crowd

8.3 Impact 273

workers. As a proof of concept, we have shown that our prototype can master realstructured product data from the Web.

Findings. Our results indicate that faceted search interfaces are well suitable for quicklynarrowing down the option space. After only three iterations (i.e. filtering steps), theoption space of previously 875 items shrinks down to three items on average. Furthermore,our proposed dynamic, instance-driven search interface has shown to be user-friendly,albeit the user survey pointed at some issues that would need to be solved on migratingfrom an academic prototype to a commercial product, e.g. the technical terminology usedon the interface, the excessive load time, or the suboptimal presentation of facet-valuepairs in the search interface.

8.3 Impact

As has become apparent from our Web crawl, a great portion of Semantic Web adoptersin the product domain are small- to medium-sized Web shops that represent the longtail of the market for products and services [cf. And04]. With standard Web searches,it is very hard for them to communicate their value proposition over the Web giventhe limited precision of keyword searches and simultaneously the enormous marketingexpenses of big competitors. By contrast, the semantic annotation of product offers withgranular product details from manufacturers helps the average Web shops to increasethe visibility of their products almost for free – instead, they can concentrate more onemphasizing the unique value propositions of their product offers. In addition to that,fine-grained product descriptions enhance the comparability of product items offered bydifferent retailers along multiple product dimensions, which after all can be demonstratedon the principle of how faceted search interfaces work.

On the basis of the foregoing considerations, we see an important, positive impact of ourresearch on end customers and product vendors. An instance-driven, faceted productsearch approach essentially helps to bring the value proposition of each vendor more easilyto consumers’ attention. Nonetheless, we think that other stakeholders will indirectlybenefit as well, e.g. manufacturers and wholesalers through additional sales of theirproducts.

From an economic point of view, successful product searches can help to mitigate searchfrictions in the market and eventually improve overall economic output by a reductionof price dispersion [cf. GQ02]. To be more precise, product comparison over structureddata along multiple product dimensions is able to drastically reduce the search effort andsearch costs generally needed to find good-fitting candidate offers on the Web.


8.4 Critical Review and Limitations

This thesis has contributed a novel, instance-driven faceted search approach for e-commerce over the Web of Data. Nonetheless, there still remain some shortcomings thatwe will summarize in the following.

8.4.1 Prevalence and Validity of Web Ontologies for Products

Despite the fact that we have presented in this thesis a powerful method to derive Webontologies for products and services from product categorization standards, and deliveredready-to-use online deployments1 of product ontologies for Classification of Products byActivity (CPA), Common Procurement Vocabulary (CPV), Global Product Classifica-tion (GPC), and Klassifikation der Wirtschaftszweige (Engl.: German Classification ofEconomic Activities) (WZ), their adoption on the Web remains scarce. Products on theWeb are – if at all – primarily classified according to light-weight product ontologies, e.g.the Product Types Ontology (PTO)2, or informally with proprietary category structures,most notably the Google product taxonomy [Goo13]. Based on our findings, we firmlybelieve that product ontologies derived from product categorization standards will notvery soon become popular on the Web, unless BMEcat catalogs, which make extensiveuse of product categorization standards, are increasingly exposed to the Web of LinkedData.

It was further outside the scope of this thesis to carry out a systematic, in-depth analysistowards finding out whether or to what extent the subsumption relationships betweenclasses in the original context also hold for the domain of products and services (seeChapter 5).

8.4.2 Product Data Dynamics

In general, product data tends to become outdated very quickly. Not only the quantity ofproduct offers is fluctuating steadily, but also related terms and conditions might changeon a daily, if not on an hourly basis. These dynamics of product data on the Web poseserious challenges on the data management part of deep product comparison engines.

Furthermore, considerable parts of our work refer to a Web crawl dating back to late2011/early 2012, when the distribution of Microdata markup with schema.org was still in

1http://www.ebusiness-unibw.org/ontologies/pcs2owl/ (accessed on September 16, 2014)2http://www.productontology.org/ (accessed on May 8, 2014)



8.4 Critical Review and Limitations 275

its infancy. In late 2012, the concepts and properties from the GoodRelations vocabularyhave been largely integrated into schema.org [Guh12], which has led to an acceleratedpublication of product data using schema.org in Microdata. Because of this, the reporteddata in our evaluations has meanwhile become a bit outdated, as it becomes evidentwhen looking at recent statistics about the deployment of structured data on the Web[MPB14].

8.4.3 Representativeness of Data Sources

The proof of concept of our approach was demonstrated using real data collected fromthe Web, which was matched against the conversion results of a single BMEcat catalogfrom BSH. However, this small-scale experiment is not unconditionally representativewith respect to the general feasibility of our approach, at least for two reasons: First, thedata structure and quality of the Web crawl are arguably a bit biased as most data iscreated by Web shop extensions in an almost uniform fashion. This means that certaindata quality problems may not surface until more generators of structured data becomeavailable on the Web. Second, as the BMEcat catalog matches merely 94 product offersfrom the Web crawl, we can only provide preliminary evidence for our approach to workat large scale. For further strengthening the credibility of our approach, we would have togather additional BMEcat catalogs from different manufacturers and match them againstdata from a Web crawl.

8.4.4 Faceted Search Interaction and Evaluation

We have proposed a faceted search interface for product search over structured e-commercedata on the Web. Faceted search essentially allows to progressively narrow down the optionspace relying on multi-dimensional product descriptions as can be provided by the Web ofData. For this purpose, faceted search paradigms use a boolean filtering mechanism wherequeries are refined and relaxed while navigating the option space. However, productshave high-dimensional utility functions with strong non-linear components. For example,an eco-conscious car buyer would well trade ten percent of a car’s engine power for half itsfuel consumption. Unfortunately, this leads to problems, because classical faceted searchcannot meet such requirements. At the same time, it is in the nature of matchmaking(see Chapter 2) to assume multi-dimensional, non-linear utility functions that requirethe parallel consideration of many dimensions. The result of matchmaking is a rankedlist of potentially interesting candidates that to some extent fulfill the requirements.Unlike faceted search, matchmaking is able to return items that are underspecified with


respect to a given demand. According to this, a search for red convertibles with a mileageless than 50, 000 kilometers and a price lower than 10, 000 Euros would most likelymatch black convertibles that ran 20, 000 kilometers and cost 9, 000 Euros. Similarly,matchmaking would yield a red convertible that matches on all dimensions but has aprice of 10, 001 Euros, where faceted search is doomed to fail in that case. Nevertheless,we would hold against the popular understanding of matchmaking that it requiressignificant annotation effort in the form of machine-understandable specifications, it iscomputationally expensive, and it describes an automatic, largely autonomous processwithout human intervention, which altogether limit its practicality. Mainly for thesereasons, we regard faceted search as an appropriate approximation for product search.

A crucial requirement for user interaction during product searches is to partition theoption space according to a “best split” strategy (see Chapter 7). Our current solutionpresents facet-value pairs based on the frequency of instance matches in the data, which issuboptimal. Another related challenge is to find a ranking strategy for presenting resultsto the users, which necessitates the exploration of different ranking strategies beyondthe order of prices. One option would be to rank results based on the match degree asreviewed for matchmaking (see Chapter 2).

Thanks to the implementation of a truly operational software artifact, we were able toshowcase how faceted search interfaces can be used for advanced user interaction overproduct data in RDF. In a user study, we gained insights about the usability of our tool,but what was missing is to measure user satisfaction, in particular to judge whether thepresented ranking places relevant results first with respect to a given information need.Established, objective evaluation metrics for retrieval algorithms that could be useful forthis task include precision, recall, F1-measure, binary preference (BPREF), or similarmeasures (see Chapter 2).

8.4.5 Scalability of SPARQL Queries over Large RDF Datasets

Faceted search over RDF data often translates into expensive SPARQL SELECT queries,for instance in the presence of multiple filter constraints that are applied in parallel.But also SPARQL CONSTRUCT queries, which we used to materialize results of datacleansing or consolidation rules as RDF data, can lead to scalability issues, especially ifexecuted over large data sets. The setting presented in our thesis required to load alldata from various Web shops into a central, consolidated SPARQL endpoint. From aperformance point of view, it would be wiser to distribute the load, e.g. to use severalfederated SPARQL endpoints with portions of the data. Alternatively, once the technology

8.5 Future Work 277

stack of Linked Data Fragments (LDFs) [Ver+14] becomes more pervasive on the Web,and it can properly cope with even very complex SPARQL queries, we could think ofsubmitting queries relying on the LDF client-server architecture, which takes care ofsplitting them into chunks of triple patterns, thus hitting a SPARQL endpoint with manycheap, light-weight queries rather than with a single complex one at great cost.

8.5 Future Work

In addition to the more concrete limitations mentioned in the previous section, our workhas produced several, more general ideas for future work. We currently use standardmethods and relatively simple heuristics for the fulfillment of our tasks, yet takinginto account more advanced techniques of other research areas could bring considerableimprovements.

8.5.1 Data Collection with External Data Sources

Within the scope of this work, we have integrated diverse data sources to create auniform view over e-commerce data on the Web. What is still missing are novel waysto gather additional, granular data for product comparison. This could further increasethe visibility of product offers, and pave the way for more advanced features like productrecommendations. In particular, we envision three alternative, although complementarymethods for data collection:

1. Extend the current data with external knowledge bases, e.g. review data3 [HM07],Freebase4 [Bol+08], Open Icecat5, DBPedia6 [Aue+07], etc.

2. Develop and apply a range of simple, yet powerful heuristics for data lifting in orderto extract price details or other quantitative information out of raw text.

3. Employ natural language processing (NLP) techniques, e.g. named entity recognition(NER) and relation extraction to extract product features from unstructured orsemi-structured text, tabular data, etc.

If available, it does also make sense to utilize the multi-language support provided bymany data sources, especially met in the context of product ontologies.

3http://revyu.com/ (accessed on May 12, 2014)4https://www.freebase.com/ (accessed on May 12, 2014)5http://www.icecat.biz/ (accessed on April 10, 2015)6http://dbpedia.org/ (accessed on May 12, 2014)

http://revyu.com/

https://www.freebase.com/

http://www.icecat.biz/

http://dbpedia.org/


8.5.2 Ontology Matching

Ontology matching tackles the problems of the semantic heterogeneity of ontologies onthe Semantic Web. It computes alignments between related concepts, which can helpto improve the effectiveness of product searches. Despite significant research effort inthe past, ontology matching is still an open research challenge. For instance, whileautomatic alignments work fairly well for some application domains, they still performpoorly when applied to large-scale practical applications [ORG15]. Also the complexity ofthe matching problem grows proportionally with the size of the ontologies [SE13]. Thus,additional research has to be devoted to improving the matching quality and efficiencythat are necessary to cope with large classification systems from the product domain.

8.5.3 Personalization and Recommendation

Even if we have attached importance to personalized and context-aware searches inthis thesis, we did not further elaborate on this topic. The consideration of additionalmetadata can greatly enhance the search experience, for instance by augmenting productsearch engines with recommender systems that can act on behalf of users. In practice, suchrelevant background knowledge for recommendation algorithms can be obtained throughinformation about similar item descriptions, user profiles, and user demographics, or his-torical data about previous searches or past purchases (see Chapter 2). The e-commercevocabularies on the Web, namely GoodRelations and schema.org, already offer ways tomaterialize useful axioms for item-based recommendations [TH14]. GoodRelations e.g., de-fines a rich variety of convenient object properties to express relationships between relatedproducts (gr:isSimilarTo, gr:isVariantOf, gr:predecessorOf, gr:successorOf ), accessoriesand spare parts (gr:isAccessoryOrSparePartFor), and consumables (gr:isConsumableFor)[cf. Hep11]. These properties could serve as starting points for product recommendationswithin product search systems.

8.5.4 Machine Learning

Machine learning in the context of product data is another interesting line of research forthe future. Despite the manifold areas of application, one possible approach would be toassist in the classification of unlabeled products relying on supervised learning methodsthat use training data, or alternatively to employ unsupervised learning to find candidateclusters for new product classes within a collection of products. Machine learning is alsoa promising enabler for better ontology matching [cf. ORG15]. Yet another application

8.6 Conclusion 279

domain for machine learning involves learning to rank, a relatively novel approach withthe goal to learn ranking models that can be used for ranking items [e.g. Li11].

8.5.5 Crowdsourcing

Finally, while many of the aforementioned tasks can be done automatically or semi-automatically, human intelligence could be greatly beneficial for improving the outcomesof data cleansing (e.g. solve common data quality problems), NLP (e.g. manually labelnamed entities or relationships), machine learning (e.g. prepare training data for supervisedclassification), or ontology matching (e.g. approve a potential match, or evaluate themapping accuracy between two concepts [cf. SE13]). While human intelligence can alsobe provided by a community or interest group, it is often much easier to temporarilyrecruit workers from crowdsourcing platforms at a moderate payment. On that premise,many evaluation tasks in our thesis could be enhanced with crowdsourcing, e.g. by askingcrowd workers to label product classes as appropriate for a product or not, to test Websites and user interfaces for usability [Liu+12], as we also did in Chapter 7 for our searchinterface, or to make relevance judgements for a list of search results given to a query.

8.6 Conclusion

Deep product comparison at Web scale promises to be one of the next big milestones fore-commerce given the increasing availability of online structured markup for products andservices in the recent years. However, as it has been shown in this thesis, product searchis a non-trivial task that poses special challenges on data management and the design ofsearch interfaces. At the time of writing this thesis, the contribution presented hereinhas been the first serious attempt to develop a faceted search interface for deep productsearches with user interaction over a large set of real structured e-commerce data on theWeb. Despite originally intended as a proposal for e-commerce, selected parts of ourapproach could just as well be applied to other application domains where searches overstructured online data are in high demand, e.g. events, recipes, online library catalogs,and the like.

A User Survey

A.1 System Usability Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

A.1.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

A.1.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

A.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A.2.1 Student Invitation E-Mails . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A.2.1.1 Variant 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A.2.1.2 Variant 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

A.2.2 Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

A.3.1 Student Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

A.3.1.1 Variant 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

A.3.1.2 Variant 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

A.3.2 Crowd Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

A.3.2.1 Variant 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

A.3.2.2 Variant 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

To assess the usability of our search system, we prepared two slightly different variants ofit:

1. Variant 1: Instance-driven search interface.

2. Variant 2: Rigid search interface with hard-wired product features.

For setting up the experiment, some preliminary work was necessary to prepare the datato show up in the demonstrators:

1. Randomly select 50 result pages from mobile.de with each consisting of 20 results(Uniform Resource Identifier (URI) pattern template with random page, min price,and max price parameters).

2. Crawl individual product pages. We obtained data from 875 (out of 1,000) productpages from mobile.de.

281

A User Survey 282

3. Pick 25 automobile offers randomly for making them available in the data manage-ment interface, and load them into the Resource Description Framework (RDF)stores of the two demonstrators.

A.1 System Usability Scale

A.1.1 Experiment Setup

We conducted the same experiment with two different groups of people. Experiment 1was conducted with students from our university, and experiment 2 with crowd workersfrom the CrowdFlower crowdsourcing platform.

In experiment 1, we asked the same group of students to first evaluate the usability ofthe first kind of our search interface (variant 1), and then to evaluate a second variant ofour search system (variant 2).

In experiment 2, we designed a user study with crowd workers:

• The same conditions applied as for the students.

• We simultaneously assigned 50 tasks for variant 1 and 50 tasks for variant 2 to twoindependent groups of crowd workers.

• We paid ten dollar cent per judgment.

• We only accepted responses by workers form German-speaking countries (i.e. Ger-many, Austria, and Switzerland).

• We only accepted high-quality judgments in terms of the CrowdFlower platform.

• We only took into account judgements where the golden question (see below) wasanswered correctly.

A.1.2 Questionnaire

A System Usability Scale (SUS) questionnaire was used to assess the usability of oursearch systems, which consisted of the following ten questions (in German1):

1. Ich denke, dass ich dieses System gerne regelmäßig nutzen würde.

2. Ich fand das System unnötig komplex.1http://minds.coremedia.com/2013/09/18/sus-scale-an-improved-german-translation-questionnaire/ (accessed on January 28, 2015)

http://minds.coremedia.com/2013/09/18/sus-scale-an-improved-german-translation-questionnaire/

http://minds.coremedia.com/2013/09/18/sus-scale-an-improved-german-translation-questionnaire/

A.2 Instructions 283

3. Ich denke, das System war leicht zu benutzen.

4. Ich denke, ich würde die Unterstützung einer fachkundigen Person benötigen, umdas System benutzen zu können.

5. Ich fand, die verschiedenen Funktionen des Systems waren gut integriert.

6. Ich halte das System für zu inkonsistent.

7. Ich glaube, dass die meisten Menschen sehr schnell lernen würden, mit dem Systemumzugehen.

8. Ich fand das System sehr umständlich zu benutzen.

9. Ich fühlte mich bei der Nutzung des Systems sehr sicher.

10. Ich musste viele Dinge lernen, bevor ich mit dem System arbeiten konnte.

Golden question for variant 1

11. Welche Leistung (in Kilowatt) hat das teuerste Produkt in unserem System?

Golden question for variant 2

11. Welchen Kilometerstand hat der einzige Jahreswagen im System?

Furthermore, an optional twelfth question asked for feedback regarding future enhance-ments to the usability of our search systems (see Section A.3).

A.2 Instructions

A.2.1 Student Invitation E-Mails

To reach a critical mass of students, we sent out invitation e-mails to students mailinglists of our university (in total, covering roughly 150 students).

A.2.1.1 Variant 1

For the first round, we prepared an e-mail with the following content (in German).

A User Survey 284

From: Alex Stolz <[email protected]>

Subject: Evaluation eines Produktsuchsystems

Date: Tue, 27 Jan 2015 16:18:39 +0100

To: <<students>>

Sehr geehrte Studierende,

ich lade Sie hiermit ein, an einer kurzen Umfrage zur Bedienbarkeit

eines Systems für die Produktsuche teilzunehmen.

Im Rahmen meiner Forschungsarbeit habe ich einen Prototypen zur

Produktsuche über das World Wide Web entwickelt, dessen Bedienbar-

keit wir nun gerne an Nutzern testen würden. Da das Tool online

verfügbar ist, kann ich diese Evaluation entsprechend aus der Ferne

durchführen.

Ich würde mich freuen, wenn Sie zahlreich an der Umfrage teilnehmen

würden. Der Fragebogen kostet Sie nicht mehr als 5-10 Minuten.

Gleichzeitig können Sie aber einen wertvollen wissenschaftlichen Bei-

trag leisten. Die Instruktionen und den Fragebogen finden Sie unter

folgender Adresse:

<<url>>

Vorab schon vielen Dank für Ihre Mithilfe!

Mit freundlichen Grüßen

Alex Stolz

A.2.1.2 Variant 2

For the second round, we sent the e-mail presented below to the same students (inGerman). Please note that even though we reached 29 of the 39 students from thefirst round, there is no one-hundred-percent guarantee that only students from the firstround participated in the second round, since responses were accepted anonymously.Nonetheless, we trust in our results mainly for two reasons: First, since we selectedstudents from past courses so that we already knew each other; and second, because thenumber of participants decreased from the assessments of search interface variant 1 tosearch interface variant 2. Observing the opposite would have been highly suspiciousinstead.

A.2 Instructions 285

From: Alex Stolz <[email protected]>

Subject: Re: Evaluation eines Produktsuchsystems

Date: Fri, 30 Jan 2015 11:53:52 +0100

To: <<students>>

Sehr geehrte Studierende,

ich bedanke mich für die letztendlich zahlreiche Teilnahme an meiner

Umfrage zur Bedienbarkeit des Produktsuchsystems und für all die

konstruktive Kritik, die ich erhalten habe.

Falls Sie sich bereits an der vorherigen Studie beteiligt haben (und bitte

nur dann!), würde ich Sie bitten, sich nochmals 5 Minuten Zeit zu nehmen.

Ich habe die Benutzeroberfläche des Produktsuchsystems nun etwas um-

gebaut, sodass mir Ihre Meinung zur Bedienbarkeit des geänderten Systems

sehr wichtig wäre.

Instruktionen sowie den Fragebogen finden Sie nun unter folgender Web-

Adresse:

<<url>>

Vielen Dank für die erneute Mithilfe!

Mit freundlichen Grüßen

Alex Stolz

PS: Mit der Evaluation eines dritten Suchsystems werde ich Sie nicht mehr

belästigen, versprochen :-)

A.2.2 Crowdsourcing

For the crowdsourcing experiment, we provided task instructions upfront. The instructionspresented to crowd workers are indicated below (in German):

Bitte beantworten Sie nachfolgend 11 kurze Fragen, um die Benutzer-

freundlichkeit unseres Systems zur Produktsuche zu bewerten.

Besuchen Sie zunächst die unten angegebene Webseite, wo Sie unseren

Suchprototypen vorfinden. Derzeit beinhaltet der Produktkatalog aus-

schließlich Fahrzeugangebote aus mobile.de (daher sind die Daten --

anders als die Sprache der Benutzeroberfläche -- auf Deutsch), jedoch sind

A User Survey 286

prinzipiell Angebote jeden Produkttyps möglich. Bitte beachten Sie, dass

es sich um einen akademischen Prototypen handelt, der sich momentan

noch in der Entwicklung befindet. Entsprechend können die Abfragen

etwas lange dauern sowie manche Informationen inkorrekt dargestellt

werden. An den Verbesserungen dieser Punkte werden wir zukünftig noch

arbeiten.

In dieser kurzen Umfrage möchten wir Ihre Meinung zur Bedienbarkeit des

Systems erfahren. Nachdem Sie sich mit der Benutzeroberfläche vertraut

gemacht haben, wie würden Sie die Bedienung des Suchsystems ein-

schätzen?

A.3 Results

The optional twelfth question,

12. Haben Sie sonst noch Anregungen, um die Bedienbarkeit des Systems weiter zuverbessern? (optional)

led to valuable critiques and feedback for possible improvements to our search interfaces.

A.3.1 Student Feedback

In the following, we outline the responses obtained by the students for search interfacevariant 1 and variant 2.

A.3.1.1 Variant 1

Auf Mobiltelefon sind die Funktionen (Vollständige Anzeige der Features, Preissuche) nur eingeschränktnutzbar.

Eine Menüführung auf nur einer Seite. Auf beiden Seiten einstellbare Optionen wirken unübersichtlich/ungewohnt. Bei mehreren Wählbaren Optionen (⇠ >5) gerne mit dropdown Menü. Eine Funktion umalle Ergebnisse anzuzeigen, gerne auch mit “sortier-reitern”(Preis aufsteigend /absteigend) etc. DieGraphische Darstellung der Verteilung von Angeboten innerhalb der Preisspanne finde ich sehr gut.

Unten sollte sich eine Spanne von Seitenzahlen befinden und nicht nur die Zahl 1. So erweckt das leichtden Eindruck als gäbe es nur eine Seite

die einzelnen Säulen für die Preisverteilung direkt auswählen zu können. “rundere” / benutzerfreundlichere/ ansprechendere Oberfläche

Ein Botten mit dem man alle auswahlmodifikatoren auf einmal löschen kann.

A.3 Results 287

Vielleicht liegt es daran, dass mobile.de bereits ein ausgereiftes und bekanntes Suchsystem hat (daherwar eventuell die Quelle eher unvorteilhaft). Denn ich fand das UI eher kompliziert bzw. nicht intuitiv.Die Merkmale, nach denen man suchen kann, sind manchmal etwas unpassend zum Thema Autogewählt. Beispielsweise der Hubraum. Hier macht der Regler seltsame Sprünge von 1871,466667ccm auf2542,15385 ccm, was nicht wirklich passt und das Ganze kompliziert erscheinen lässt. Die Kurzinfo, diebei der Übersicht der Ergebnisse angezeigt wird, hilft nicht sehr bei der weiteren Auswahl. PS, Hubraum,Verbraucht fehlt gänzlich in der Übersicht. Die Auswahlkriterien sind zum Teil überflüssig. Wer suchtAutos, in dem man nach Autohäusern sucht? Das Design ist sehr nüchtern, ähnelt mehr einer klassischenSuchmaschine und nicht einer attraktiven Möglichkeit, “mein Auto” zu suchen.

Man müsste einen Button haben, wo man die Ergebnisse nach verschiedenen Kategorien sortieren kann.Habe es auf den ersten Blick nicht gefunden.

Wenn man den Preisfilter neu setzt, sollten auch nur Fahrzeug im angegebenen Intervall auftauchen. Esändert sich aber nichts.

Ich kann damit leider absolut gar nicht umgehen, die Einstellungen und Optionen sind zu tief vergraben,die Eingabe und das ständige Neuladen der Seitenelemente sind verwirrend, die Suche reagiert nicht aufeinfache Suchbegriffe wie “bmw”. Unterm Strich nicht wirklich benutzbar wie ich finde.

Eine Funktion, mit der man die Suchergebnisse nach bestimmten vorlieben Sortieren kann wäre angenehm.Bsp Preis oder Hubraum auf oder absteigend Sortierbar

das ständige neuladen der produktdetails hat mich beim benutzen sehr gestört. eine kleine änderung, undalles muss aktualisiert werden...wenn es schnell gehen würde kein problem. ich habe 5 sekunden gezählt,während ich gewartet habe! (product features) das ist zu lange (aus dem uninetz, von außerhalb dauert esvielleicht noch länger). und mann sollte sich auf eine sprache einigen, die überschriften sind eng und dieoptionen deu und die auswahl von zb leistung hat wiederum keine deutsche einheit, wie PS oder so.

Es wäre sinnvoll, die Suchvorschläge nach verschiedenen Kriterien sortieren zu können (Alphabetisch,Teuerste zuerst, günstigste zuerst, häufig gesuchte Themen etc.)

Die Einstellung verschiedener und dann dementsprechend einheitlicher Sprachen für das Systems wärewünschenswert. Momentan hat man einen Mix aus Deutsch und Englisch. Auch denke ich, können diemeisten nichts mit der “SPARQL” Zeile anfangen, aber ich nehme an, die dient eh nur dem Testzweck.Soll aber verdeutlichen, dass man nicht zu spezfisiche Begriffe benutzen sollte.

Gute Idee! Eigene Autosuche vor Kurzem beendet. Bisher auf keiner bekannten Plattform Suche nachAnzahl der Zylinder möglich, wäre ein klasse Feature. Gerade im Bezug auf Downsizing Trend generierteine Suche nach bspw. 6 Zylinder und Diesel oder >4Zylinder echten Mehrwert.Weiter so.

Eine Übersicht über die Anzahl der SeitenWenn man auf die Preisübersicht auf die Balken klicken könnte, um zum Fahrzeug zu kommen wäre eseinfacher.Eine Möglichkeit nach Farben zu sortieren.Eine Sortierung nach absteigendem Preis ermöglichen.

A User Survey 288

Ich empfinde es als umständlich, dass nach jeder Einstellungsänderungen die gesamte Seite lädt und mandann erst eine Eingrenzung eingestellt werden kann. Um bspw. die Leistungsspanne einzustellen mussich zwei mal warten bis die gesamte seite neu geladen ist.Das Suchfeld funktioniert evtl. nicht optimal. Wenn man dort nach “Porsche” sucht, erscheint keinProdukt obwohl ja anscheinend einer in der Liste vorhanden ist (s. leistungsstärkstes Fahrzeug).

Mit dem Handy kann man das System nicht wirklich nutzen. Preisleiste lässt sich nur ganz umständlichverschieben, und die kompletten Details zu dem Dodge konnte man auch nicht einsehen.

- beim Preis sollte sich zur besseren Übersichtlichkeit ein Punkt nach der Eintausenderstelle befinden(25.000,00€)- bei der Einstellung der Price-Range sind gelegentlich bis zu 5 Nachkommastellen aufgetaucht (2 solltenreichen ;-) )- hat man die untere Grenze der Price-Range mit der oberen Grenze gleichgesetzt (z.B. zur Beantwortungder Frage Nr. 11), sind die beiden Grenzen anschließend nicht mehr trenn- oder verstellbar- der Einstellungsbalken der Price-Range ist auf den ersten Blick leicht zu übersehen, gleiches gilt für dasEingabefeld des Preises unterhalb (Erkennbarkeit als Schriftfeld nicht gegeben)Trotz dieser kleinen Mängel halte ich das System für eine interessante und angenehm zu bedienendeLösung. Ich hoffe, die Entwicklung wird fortgesetzt & wünsche viel Erfolg dabei :-)

1. Den Verkäufern/Vendors würde ich persönlich weniger Platz zukommen lassen und lieber den Umkreis(per Entfernung in km) entscheiden lassen. Dies spielt bei einer ernsthaften Fahrzeugsuche mMn eineviel größere Rolle -> Aus eigenen Erfahrungen!2. Das Balkendiagramm bei den Preisen verstört mich ein wenig. Hierzu zwei Anmerkungen: a) diejeweilige Anzahl über jeden einzelnen Balken sichtbar machen b) Balkendiagramm streichen, da mMndiese Aussage rein über den Preis bei einer exakten Suche nach einem Modell keine Aussagekraft bietet– es sei denn, man gestaltet die Suche so genau, dass nur noch genau ein Modell mit genau der selbenAusstattung verglichen wird. Bsp hierzu ein Klick auf die Kategorie “Cabrio/Roadster” bei dem links imBalken dieses erscheint – ich finde es nicht aussagekräftig: http://www.fotos-hochladen.net/view/be

ispielcp4x321bj8.jpg2 (accessed on March 31, 2015)3. Durch meine mehrjährige Suche nach Fahrzeugen, besonders Motorrädern, hier meine ständigenSuchkriterien um diesen ggf mehr Beachtung zukommen zu lassen: a) Preis b) Marke c) Kilometerleistungd) Leistung e) Entfernung f) Hubraum (Motorrad) g) Marke h) Zusatzausstattung (Abs)Ich hoffe ich konnte ein wenig helfen.Viel Erfolg bei der weiteren Arbeit

linkes Unterfenster “vendors” für mich uninteressant. würde ich weglassen stichpunktartige detailübersichtbei jedem Fahrzeug gut!!

2

http://www.fotos-hochladen.net/view/beispielcp4x321bj8.jpg

http://www.fotos-hochladen.net/view/beispielcp4x321bj8.jpg

A.3 Results 289

A.3.1.2 Variant 2

Deutliche Verbesserung, die vielen Schieberegler links verwirren mittlerweile etwas, aber insgesamt fürmich deutlich benutzbarer geworden.

Kilometerzahl konnte ich auf dem Mobiltelfon nicht ablesen, da die Anzeige, bzw das Scrollen in derAnzeige auf dem iPhone immer noch nicht funktioniert.

System zwar praktisch aber nichts “grundlegend neues”

Beim Öffnen eines Fahrzeuges, beispielsweise des Jahreswagens sind viele Extras des Fahrzeugs zu lesen.Aufgrund der menge und der schlichten Auflistung ist es schwer dort einen überblick zu verschaffen.Desshalb schlage ich vor die Reihenfolge zu ändern. Erst die wichtigsten Eckdaten (Leistung, km-Stand,etc.) und erst dann im Folgenden die Sonderausstattungen übersichtlicher darzustellen. Eine Möglichkeithierfür wäre es die Sonderausstattungen in Kategorien einzuteilen. Bsp. Innen, Sicherheit, Armaturenbrettetc. Dies gibt einen Besseren Überblick.Im gesamten finde ich das System dennoch sehr hilfreich und einfach in der Handhabung.

Ein Button, oben rechts über der Anzeigetafel der Autos, mit dem man nach den relevantesten Kategoriensortieren kann, wäre hilfreich. Da ich normalerweise zuerst einstellen würde, welche Eigenschaften ichmir für das Auto wünsche und diese danach nach dem Preis, der Entfernung zum Wohnort, Alter usw.nach der derzeitigen Bedürfnissen sortieren lassen würde.

Ich sehe noch kein unmittelbares, überzeugendes Argument, was dieses System von anderen Seiten abhebt– kann natürlich auch an mir liegen.

Bei dem “Kilometerstandschieber” keine Zahlen mit teilweise 5 Nachkommastellen (nur glatte Zahlen,evtl. 100er-Schritte,o.ä.)

Wenn ich die leistung mit dem schieberegler eingrenze und nur noch einen wagen zur verfügung habe,dann kann ich die leistung nicht mehr verändern. also zb wieder nach unten regeln. Ich kann nurirgendwas anwählen wie zb abs und wieder abwählen, damit alles auf die ausgangssituation gesetzt wird.ist die leistungseingrenzung einmal ausgewählt, dann kann die nicht wieder abgewählt werden, wie zbausstattungen. die performance hat sich gegenüber der ersten version sehr verbessert!

- Einstellung des gewünschten Bereichs der Motorleistung auch in PS ermöglichen - Datum/Zeitraumder Erstzulassung ebenfalls mit Schieberegler festlegen lassen - zur besseren Lesbarkeit einen Punkt nachder Tausender-Stelle beim Preis anbringen (Bsp.: 25.000,00 €)

Evtl. Einführung eines “Zurück Buttons” das die Auswahl im jeweiligen Feld wie “Ausstattung” oder“Kraftstoff” zurücksetzt und alle Haken herausnimmt. Ansonsten sehr schön .

Die Umsetzung, dass in der Liste zur Eingrenzung nun ein Klick genügt, finde ich sehr gut! Dadurch istdas System bedienungsfreundlicher geworden.

Die Schaltfläche “Jahreswagen” erschien erst, nachdem “Vorführwagen” angekreuzt war.

bessere Übersichtlichkeit der Ausstattung anstatt einer sehr langen Liste

fände es gut wenn man die bilder in solchen fahrzeugplattformen zoomen könnte.

A User Survey 290

1. Statt ausschließlich den “Close”-Button benutzen zu müssen um ein aufgerufenes Inserat (Bspwder gesuchte Jahreswagen) zu Schließen, würde ich zusätzlich die Funktion mit dem Klick neben dieaufgerufene Seite einbauen – Steigert die Nutzerfreundlichkeit2. Sollte das System in einer großen Bandbreite getestet/inbetrieb genommen werden, so erscheint mirdie einfache Auflistung nach “Jahren” bei der Erstzulassung am sinnvollsten – ggf würde auch ein weiteresBalkendiagramm passen.3. “Fahrzeugart” (Jahres, Gebraucht, Neu) würde ich auf die rechte Seite unter das Feld “Kategorie”anordnen4. Wichtig: Das einzige wesentliche Merkmal welches man beim Betrachten der Mittleren Spalte(Suchergebnisse) erkennt, sind die Autos. Aber leider wird hier aufgrund von falsch angebrachter(oder aber absichtlicher) Informationen nur die Marke und das Modell dargestellt. Auf den ersten Blickist es nicht möglich zu erkennen: Erstzulassung, Km, Leistung, ... Stattdessen wird willkürlich dieZusatzausstattung des Autos angezeigt und dies auch ohne Vorgabe; Sprich manchmal steht Klimaanlageganz oben, manchmal weiter unten etc = Man kann es nicht vergleichen. –> Für meinen Geschmacksollte man also die Ausstattung mit den Fahrzeugeigenschaften tauschen und diese statt der Ausstattungunter dem Fahrzeug anzeigenViel Erfolg weiterhin :)

Erste Version war übersichtlicher und leichter zu bedienen! Man hat sich schneller zurecht gefunden alsin der jetzigen zweiten Version.

Datumsformat besser TT.MM.JJJJ

A.3.2 Crowd Workers

In the following, we outline the responses obtained by crowd workers for search interfacevariant 1 and variant 2.

A.3.2.1 Variant 1

Einfachere Filterabwahl: Bei Mehrfachfiltern müssen diese einzeln wieder “rausgeklickt” werden. Nor-malerweise kann man die Filter (evtl. je Kategorie) wieder löschen/ zurücksetzen.

Quick Tips, die erscheinen, wenn man den Cursor über die Funktion zieht

ein wenig schlicht.

nein

nein, alles is in ordnung

“Gebraucht-” bzw. “Neuwagen” sollten eigene Auswahlpunkte nicht nur Unterkategorien sein. Auch dieeinzelnen Marken hätte ich gerne sofort zur Auswahl gehabt – ging selbst mit Auswahl von “Manufacturers”nicht. Die Balken beim Preisregler waren zuerst auch etwas irritierend.

nejin

A.3 Results 291

Weniger Informationen auf einmal.

nope

Die Schaltflächen (Punkte zum Ankreuzen) sind extrem klein und daher anstrengend zu lesen.

zu umständlich

nein

mehr bilder, nicht so static und unübersichtlich

die Sprachen sollten beide gleich sein. die Auswahlmöglichkeiten sollten alle auf der gleichen Seite stehen,nicht links ein paar und rechts ein paar.

Nein, es ist Ok wie es bisher ist

Das teuerste Produkt hat eine Leistung von 85.00 [KWT], was bei Frage 11 aber gar nicht zu Auswahlsteht.

Seite muss insgesamt noch vielbesser struktoriert werden.

nein

Eine Möglichkeit des Ordnens. Von anfangs billig und teurerer werdend oder ähnlich.

A.3.2.2 Variant 2

nein

nein

Nein, gutes System!!!

keine

Diese Seite erklärt nicht, was sie macht. Dies fand ich ziemlich verwirrend, als das erste mal die Seitesah.

Bei Öffnen der Artikelseite: zum Schließen einfach außerhalb des Artikelkastens klicken wäre eine guteFunktion.

gut

bitte alles verbessren

zu wenig Inhalt

nein, keine!

nichts

Balkendiagramm nach Unten verschieben

Ein Schließen-Knopf sollte immer sichtbar sein auf einer Detailseite. Aktuell ist weder oben noch untenein Knopf zu finden, wenn man etwas gescrollt hat. Man muss so erst wieder ganz rauf oder ganz runterscrollen, um das Fenster schließen zu können.

A User Survey 292

darstellung komplett überarbeiten

keine

Bis auf die noch relativ unansprechende optische Ausarbeitung fand ich das Suchsystem sehr angenehmund innovativ.

efw

nein

Alles so okay!!!!!!!!

Suche funktionrt nicht!

Nein, gefällt mit gut

Ich würde es schätzen, sofort sehen zu können, auf wie viele Seiten sich die Angebote verteilen.

nein

15380.00 [KMT]

Ich finde es so wie es momentan ist ganz ok

B Index of DVD Contents

The attached DVD contains source code, data and other materials. Subsequently, wereproduce the table of contents of the index file (INDEX.txt) provided on the DVD.

1 datasets

(This directory contains all datasets created and used in the thesis.)

1.1 bmecat

1.1.1 bsh

1.1.2 weidmueller

1.2 crawl

1.2.1 big crawl

1.2.2 household crawl

1.3 pcs

1.4 product search

1.4.1 bsh and household crawl overlap

1.4.2 mobile.de samples

1.4.3 vso toy example

1.5 unit conversion

1.5.1 currencies

1.5.2 units of measure

2 demos

(This directory contains the prototypes and demos developed and presented in the thesis.)

2.1 bmecat conversions

2.1.1 bmecat2goodrelations

2.1.2 bp-feelthedifference

2.1.3 bsh

2.1.4 weidmueller

2.2 grcrawler household

2.3 pcs2owl conversions

2.4 pcs2owl landing page

2.5 product search interfaces

2.5.1 data supply for household crawl demo

2.5.2 product-search

2.5.3 product-search-static

293

B Index of DVD Contents 294

3 evaluations

(This directory contains all source code and data related to the evaluations of the thesis.)

3.1 bmecat leverage

3.2 cleansing

3.3 crawl

3.4 lib

3.5 listings

3.6 pcs2owl

3.6.1 reverse engineering approach

3.6.2 subsumption relationship evaluation

3.7 product search interfaces

3.7.1 decision tree analysis

3.7.2 sus score

3.7.3 usability raw data

3.8 release sizes eclass

4 thesis

(This directory contains the thesis document along with supplementary material.)

4.1 authorship declarations

4.2 publications

5 tools and libraries

(This directory contains the source code of tools and libraries implemented in the thesis.)

5.1 thesis contributions

5.1.1 bmecat2goodrelations

5.1.2 currency2currency

5.1.3 grcrawler

5.1.4 grsnippetgen logreader

5.1.5 mobile.de scraper

5.1.6 pcs2owl

5.1.7 product-search

5.1.8 rdf-translator

5.1.9 rdflib serializers

5.2 third-party software packages

C Online Tools and Web Resources

The work on this thesis led to the publication of numerous tools, services, demonstrators,documentations, and source code repositories on the Web, which links are itemized below(accessed on April 15, 2015):

• http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler

• https://bitbucket.org/alexstolz/grcrawler

• http://wiki.goodrelations-vocabulary.org/Tools/BMEcat2GR

• https://github.com/alexstolz/bmecat2goodrelations

• https://github.com/alexstolz/bmecat2goodrelations/wiki/Usage

• http://www.ebusiness-unibw.org/projects/bmecat2goodrelations/example/

• http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL

• http://www.ebusiness-unibw.org/ontologies/pcs2owl/

• http://www.ebusiness-unibw.org/ontologies/pcs2owl/evaluation/

• https://bitbucket.org/alexstolz/pcs2owl/wiki/Home

• http://www.currency2currency.org/

• https://bitbucket.org/alexstolz/currency2currency

• http://www.ebusiness-unibw.org/tools/product-search/

• http://www.ebusiness-unibw.org/tools/product-search-static/

• https://bitbucket.org/alexstolz/product-search

• http://rdf-translator.appspot.com/

• http://www.stalsoft.com/publications/rdf-translator-TR.pdf

• https://bitbucket.org/alexstolz/rdf-translator

• https://github.com/alexstolz/rdflib-rdfa-serializer

• https://github.com/alexstolz/rdflib-microdata-serializer

295

http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler

https://bitbucket.org/alexstolz/grcrawler

http://wiki.goodrelations-vocabulary.org/Tools/BMEcat2GR

https://github.com/alexstolz/bmecat2goodrelations

https://github.com/alexstolz/bmecat2goodrelations/wiki/Usage




http://www.ebusiness-unibw.org/ontologies/pcs2owl/evaluation/

https://bitbucket.org/alexstolz/pcs2owl/wiki/Home

http://www.currency2currency.org/

https://bitbucket.org/alexstolz/currency2currency


http://www.ebusiness-unibw.org/tools/product-search-static/

https://bitbucket.org/alexstolz/product-search

http://rdf-translator.appspot.com/

http://www.stalsoft.com/publications/rdf-translator-TR.pdf

https://bitbucket.org/alexstolz/rdf-translator

https://github.com/alexstolz/rdflib-rdfa-serializer

https://github.com/alexstolz/rdflib-microdata-serializer

Bibliography

Note: Missing publication year information is indicated by superscript “ND” (no date) inthe citation label.

[Abe+11] F. Abel, I. Celik, G.-J. Houben, and P. Siehndel: “Leveraging the Semanticsof Tweets for Adaptive Faceted Search on Twitter”. In: Proceedings of the10th International Semantic Web Conference (ISWC 2011). Bonn, Germany,2011, pp. 1–17.

[ABH11] B. Adida, M. Birbeck, and I. Herman: “Semantic Annotation and Retrieval:Web of Hypertext – RDFa and Microformats”. In: Handbook of Semantic WebTechnologies. Ed. by J. Domingue, D. Fensel, and J. A. Hendler. SpringerBerlin Heidelberg, 2011. Chap. 5, pp. 157–190.

[Abr14] J. Abraham: Product Information Management: Theory and Practice. SpringerInternational Publishing Switzerland, 2014.

[Adi+08] B. Adida, M. Birbeck, S. McCarron, and S. Pemberton: RDFa in XHTML:Syntax and Processing: A Collection of Attributes and Processing Rules forExtending XHTML to Support RDF. W3C Recommendation 14 October 2008.2008. url: http://www.w3.org/TR/2008/REC-rdfa-syntax-20081014/(accessed on May 16, 2014).

[Adi+13] B. Adida, M. Birbeck, S. McCarron, and I. Herman: RDFa Core 1.1 – SecondEdition: Syntax and Processing Rules for Embedding RDF through Attributes.W3C Recommendation 22 August 2013. 2013. url: http://www.w3.org/TR/2013/REC-rdfa-core-20130822/ (accessed on May 16, 2015).

[AH11] D. Allemang and J. Hendler: Semantic Web for the Working Ontologist. 2nd

ed. Morgan Kaufmann Publishers, 2011.

[Ake70] G. A. Akerlof: “The Market for “Lemons”: Quality Uncertainty and theMarket Mechanism”. In: Quarterly Journal of Economics 84 (3) (1970),pp. 488–500.

297

http://www.w3.org/TR/2008/REC-rdfa-syntax-20081014/

http://www.w3.org/TR/2013/REC-rdfa-core-20130822/

http://www.w3.org/TR/2013/REC-rdfa-core-20130822/

Bibliography 298

[AL05] S. Agarwal and S. Lamparter: “SMART – A Semantic Matchmaking Portalfor Electronic Markets”. In: Proceedings of the Seventh IEEE InternationalConference on E-Commerce Technology (CEC 2005). Munich, Germany,2005, pp. 405–408.

[Ama+09] B. R. Amarnath, T. S. Somasundaram, M. Ellappan, and R. Buyya: “Ontology-based Grid Resource Management”. In: Software Practice and Experience39 (17) (2009), pp. 1419–1438.

[Ama16] Amazon: Was sind EANs, UPCs, ISBNs und ASINs? 2016. url: http://www.amazon.de/gp/seller/asin-upc-isbn-info.html (accessed onFebruary 3, 2016).

[And04] C. Anderson: “The Long Tail”. In: Wired Magazine 12 (10) (2004), pp. 170–177.

[Ara+01] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepke, and S. Raghavan: “Searchingthe Web”. In: ACM Transactions on Internet Technology 1 (1) (2001), pp. 2–43.

[Are+14a] M. Arenas, B. Cuenca Grau, E. Kharlamov, S. Marciuska, and D. Zheleznyakov:“Faceted Search over Ontology-enhanced RDF Data”. In: Proceedings ofthe 23rd ACM International Conference on Information and KnowledgeManagement (CIKM 2014). Shanghai, China, 2014, pp. 939–948.

[Are+14b] M. Arenas, B. Cuenca Grau, E. Kharlamov, S. Marciuska, and D. Zheleznyakov:“Towards Semantic Faceted Search”. In: Poster Proceedings of the 23rd In-ternational World Wide Web Conference (WWW 2014), Companion Volume.Seoul, Korea, 2014, pp. 219–220.

[Ari16] Ariba: cXML User’s Guide. Version 1.2.029. 2016.

[Arr63] K. J. Arrow: “Uncertainty and the Welfare Economics of Medical Care”. In:The American Economic Review 53 (5) (1963), pp. 941–973.

[AT05] G. Adomavicius and A. Tuzhilin: “Toward the Next Generation of Recom-mender Systems: A Survey of the State-of-the-Art and Possible Extensions”.In: IEEE Transactions on Knowledge and Data Engineering 17 (6) (2005),pp. 734–749.

[Aue+07] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives:“DBpedia: A Nucleus for a Web of Open Data”. In: Proceedings of the6th International Semantic Web Conference and 2nd Asian Semantic WebConference (ISWC 2007 + ASWC 2007). Busan, Korea, 2007, pp. 722–735.

http://www.amazon.de/gp/seller/asin-upc-isbn-info.html

http://www.amazon.de/gp/seller/asin-upc-isbn-info.html

Bibliography 299

[AvH08] G. Antoniou and F. van Harmelen: A Semantic Web Primer. 2nd ed. TheMIT Press, 2008.

[Bak98] Y. Bakos: “The Emerging Role of Electronic Marketplaces on the Internet”.In: Communications of the ACM 41 (8) (1998), pp. 35–42.

[Bat89] M. J. Bates: “The Design of Browsing and Berrypicking Techniques forthe Online Search Interface”. In: Online Information Review 13 (5) (1989),pp. 407–424.

[Bat95] M. Bates: “Models of Natural Language Understanding”. In: Proceedings ofthe National Academy of Sciences of the United States of America 92 (22)(1995), pp. 9977–9982.

[BC04] T. Berners-Lee and D. Connolly: Delta: An Ontology for the Distribution ofDifferences between RDF Graphs. Technical Report. MIT Computer Scienceand Artificial Intelligence Laboratory, 2004.

[BC11] T. Berners-Lee and D. Connolly: Notation3 (N3): A Readable RDF Syntax.W3C Team Submission 28 March 2011. 2011. url: http://www.w3.org/TeamSubmission/2011/SUBM-n3-20110328/ (accessed on May 16, 2014).

[BD08] A. A. Batabyal and G. J. DeAngelo: “To Match or Not to Match: Aspects ofMarital Matchmaking under Uncertainty”. In: Operations Research Letters36 (1) (2008), pp. 94–98.

[Ben23] J. Bentham: An Introduction to the Principles of Morals and Legislation.Clarendon Press, Oxford, 1823.

[Ber+06] T. Berners-Lee, Y. Chen, L. Chilton, D. Connolly, R. Dhanaraj, J. Hol-lenbach, A. Lerer, and D. Sheets: “Tabulator: Exploring and AnalyzingLinked Data on the Semantic Web”. In: Proceedings of the 3rd InternationalSemantic Web User Interaction Workshop (SWUI 2006). Athens, GA, USA,2006.

[Ber05] T. Berners-Lee: Notation 3 Logic: An RDF Language for the Semantic Web.2005. url: http://www.w3.org/DesignIssues/Notation3 (accessed onMay 15, 2014).

[Ber06] T. Berners-Lee: Linked Data – Design Issues. 2006. url: http://www.w3.org/DesignIssues/LinkedData.html (accessed on May 8, 2014).

[Ber98] T. Berners-Lee: Cool URIs Don’t Change. 1998. url: http://www.w3.org/Provider/Style/URI (accessed on May 13, 2014).

http://www.w3.org/TeamSubmission/2011/SUBM-n3-20110328/

http://www.w3.org/TeamSubmission/2011/SUBM-n3-20110328/

http://www.w3.org/DesignIssues/Notation3

http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/Provider/Style/URI

http://www.w3.org/Provider/Style/URI

Bibliography 300

[BF99] T. Berners-Lee and M. Fischetti: Weaving the Web: The Original Designand Ultimate Destiny of the World Wide Web by Its Inventor. HarperCollinsPublishers, 1999.

[BFM05] T. Berners-Lee, R. T. Fielding, and L. Masinter: Uniform Resource Identifier(URI): Generic Syntax. Request for Comments 3986. 2005. url: http://www.ietf.org/rfc/rfc3986.txt (accessed on May 7, 2014).

[BG14] D. Brickley and R. V. Guha: RDF Schema 1.1. W3C Recommendation 25February 2014. 2014. url: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ (accessed on May 20, 2014).

[BGG01] R. Bapna, P. Goes, and A. Gupta: “Insights and Analyses of Online Auctions”.In: Communications of the ACM 44 (11) (2001), pp. 42–50.

[Bha+08] R. Bhagdev, S. Chapman, F. Ciravegna, V. Lanfranchi, and D. Petrelli:“Hybrid Search: Effectively Combining Keywords and Semantic Searches”.In: Proceedings of the 5th European Semantic Web Conference (ESWC 2008).Tenerife, Spain, 2008, pp. 554–568.

[BHB09] C. Bizer, T. Heath, and T. Berners-Lee: “Linked Data – The Story SoFar”. In: International Journal on Semantic Web and Information Systems(IJSWIS) 5 (3) (2009), pp. 1–22.

[BHL01] T. Berners-Lee, J. Hendler, and O. Lassila: “The Semantic Web”. In: ScientificAmerican 284 (5) (2001), pp. 34–43.

[Bif+05] A. Bifet, C. Castillo, P.-A. Chirita, and I. Weber: “An Analysis of FactorsUsed in Search Engine Ranking”. In: Proceedings of the First InternationalWorkshop on Adversarial Information Retrieval on the Web (AIRWeb 2005).Chiba, Japan, 2005, pp. 48–57.

[Biz+13] C. Bizer, K. Eckert, R. Meusel, H. Mühleisen, M. Schuhmacher, and J.Völker: “Deployment of RDFa, Microdata, and Microformats on the Web –A Quantitative Analysis”. In: Proceedings of the 12th International SemanticWeb Conference (ISWC 2013). Sydney, Australia, 2013, pp. 17–32.

[BK12] F. Bauer and M. Kaltenböck: Linked Open Data: The Essentials. Vienna,Austria: edition mono/monochrom, 2012.

[BKL09] S. Bird, E. Klein, and E. Loper: Natural Language Processing with Python:Analyzing Text with the Natural Language Toolkit. O’Reilly Media, 2009.

[BKM09] A. Bangor, P. Kortum, and J. Miller: “Determining What Individual SUSScores Mean: Adding an Adjective Rating Scale”. In: Journal of UsabilityStudies 4 (3) (2009), pp. 114–123.

http://www.ietf.org/rfc/rfc3986.txt


http://www.w3.org/TR/2014/REC-rdf-schema-20140225/

http://www.w3.org/TR/2014/REC-rdf-schema-20140225/

Bibliography 301

[BLP98] J. R. Bettman, M. F. Luce, and J. W. Payne: “Constructive ConsumerChoice Processes”. In: Journal of Consumer Research 25 (3) (1998), pp. 187–217.

[BM08] D. Beneventano and D. Montanari: “Ontological Mappings of Product Cata-logues”. In: Poster Proceedings of the 3rd International Workshop on OntologyMatching (OM 2008). Karlsruhe, Germany, 2008.

[BM10] M. Birbeck and S. McCarron: CURIE Syntax 1.0: A Syntax for ExpressingCompact URIs. W3C Working Group Note 16 December 2010. 2010. url:https://www.w3.org/TR/2010/NOTE-curie-20101216/ (accessed onFebruary 9, 2016).

[BM14] D. Brickley and L. Miller: FOAF Vocabulary Specification 0.99. NamespaceDocument 14 January 2014 - Paddington Edition. 2014. url: http://xmlns.com/foaf/spec/20140114.html (accessed on April 20, 2015).

[Bob+13] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez: “RecommenderSystems Survey”. In: Knowledge-Based Systems 46 (2013), pp. 109–132.

[Bol+07] H. Boley, M. Kifer, P.-L. Patranjan, and A. Polleres: “Rule Interchange onthe Web”. In: Reasoning Web. Ed. by G. Antoniou, U. Aßmann, C. Baroglio,S. Decker, N. Henze, P.-L. Patranjan, and R. Tolksdorf. Vol. 4636. LectureNotes in Computer Science. Springer Berlin Heidelberg, 2007, pp. 269–309.

[Bol+08] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor: “Freebase:A Collaboratively Created Graph Database for Structuring Human Knowl-edge”. In: Proceedings of the 2008 ACM SIGMOD International Conferenceon Management of Data (SIGMOD 2008). Vancouver, BC, Canada, 2008,pp. 1247–1250.

[Bor97] W. N. Borst: “Construction of Engineering Ontologies for Knowledge Sharingand Reuse”. PhD thesis. University of Twente, Enschede, The Netherlands,1997.

[BP08] D. Berrueta and J. Phipps: Best Practice Recipes for Publishing RDF Vo-cabularies. W3C Working Group Note 28 August 2008. 2008. url: https://www.w3.org/TR/2008/NOTE-swbp-vocab-pub-20080828/ (accessedon February 19, 2016).

[BP98] S. Brin and L. Page: “The Anatomy of a Large-Scale Hypertextual WebSearch Engine”. In: Proceedings of the Seventh International World WideWeb Conference (WWW 1998). Brisbane, Australia, 1998, pp. 107–117.

https://www.w3.org/TR/2010/NOTE-curie-20101216/

http://xmlns.com/foaf/spec/20140114.html

http://xmlns.com/foaf/spec/20140114.html

https://www.w3.org/TR/2008/NOTE-swbp-vocab-pub-20080828/

https://www.w3.org/TR/2008/NOTE-swbp-vocab-pub-20080828/

Bibliography 302

[BPJ02] S. Balasubramanian, R. A. Peterson, and S. L. Jarvenpaa: “Exploring theImplications of M-Commerce for Markets and Marketing”. In: Journal ofthe Academy of Marketing Science 30 (4) (2002), pp. 348–361.

[BR11] R. A. Baeza-Yates and B. A. Ribeiro-Neto: Modern Information Retrieval:The Concepts and Technology behind Search. 2nd ed. Addison-Wesley, 2011.

[Bra+08] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau:Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommen-dation 26 November 2008. 2008. url: http://www.w3.org/TR/2008/REC-xml-20081126/ (accessed on May 15, 2014).

[Bra14] T. Bray: The JavaScript Object Notation (JSON) Data Interchange Format.Request for Comments 7159. 2014. url: http://www.ietf.org/rfc/rfc7159.txt (accessed on May 16, 2014).

[Bra83] R. J. Brachman: “What IS-A Is and Isn’t: An Analysis of Taxonomic Linksin Semantic Networks”. In: IEEE Computer 16 (10) (1983), pp. 30–36.

[Bri06] British Computer Society: Isn’t It Semantic? Interview. 2006. url: http://www.bcs.org/content/conWebDoc/3337 (accessed on May 16, 2014).

[Bri79] E. Brill: “A Simple Rule-based Part of Speech Tagger”. In: Proceedings ofthe Third Conference on Applied Natural Language Processing (ANLC 1992).Trento, Italy, 1979, pp. 152–155.

[Bro96] J. Brooke: “SUS – A Quick and Dirty Usability Scale”. In: Usability Eval-uation in Industry. Ed. by P. Jordan, B. Thomas, B. Weerdmeester, andI. McClelland. Taylor & Francis, 1996, pp. 189–194.

[Bru+07] J.-S. Brunner, L. Ma, C. Wang, L. Zhang, D. C. Wolfson, Y. Pan, andK. Srinivas: “Explorations in the Use of Semantic Web Technologies forProduct Information Management”. In: Proceedings of the 16th InternationalWorld Wide Web Conference (WWW 2007). Banff, Alberta, Canada, 2007,pp. 747–756.

[BS00] E. Brynjolfsson and M. Smith: “Frictionless Commerce? A Comparison ofInternet and Conventional Retailers”. In: Management Science 46 (4) (2000),pp. 563–585.

[BSG99] P. Bingi, M. K. Sharma, and J. K. Godla: “Critical Issues Affecting anERP Implementation”. In: Information Systems Management 16 (3) (1999),pp. 7–14.

[BSV12] F. Branco, M. Sun, and J. M. Villas-Boas: “Optimal Search for ProductInformation”. In: Management Science 58 (11) (2012), pp. 2037–2056.

http://www.w3.org/TR/2008/REC-xml-20081126/

http://www.w3.org/TR/2008/REC-xml-20081126/



http://www.bcs.org/content/conWebDoc/3337

http://www.bcs.org/content/conWebDoc/3337

Bibliography 303

[Bui+13] C. Buil-Aranda, M. Arenas, O. Corcho, and A. Polleres: “Federating Queriesin SPARQL 1.1: Syntax, Semantics and Evaluation”. In: Web Semantics:Science, Services and Agents on the World Wide Web 18 (1) (2013), pp. 1–17.

[Bur07] R. Burke: “Hybrid Web Recommender Systems”. In: The Adaptive Web.Ed. by P. Brusilovsky, A. Kobsa, and W. Nejdl. Vol. 4321. Lecture Notes ofComputer Science. Springer Berlin Heidelberg, 2007. Chap. 12, pp. 377–408.

[Bus45] V. Bush: “As We May Think”. In: The Atlantic Monthly 176 (1) (1945),pp. 101–108.

[BV04] C. Buckley and E. M. Voorhees: “Retrieval Evaluation with IncompleteInformation”. In: Proceedings of the 27th Annual International ACM SIGIRConference on Research and Development in Information Retrieval (SIGIR2004). Sheffield, UK, 2004, pp. 25–32.

[Car+05] J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler: “Named Graphs”. In:Journal of Web Semantics 3 (4) (2005), pp. 247–267.

[Car14] G. Carothers: RDF 1.1 N-Quads: A Line-based Syntax for RDF Datasets.W3C Recommendation 25 February 2014. 2014. url: http://www.w3.org/TR/2014/REC-n-quads-20140225/ (accessed on May 16, 2014).

[Car97] J. M. Carroll: “Human-Computer Interaction: Psychology as a Science ofDesign”. In: Annual Review of Psychology 48 (1) (1997), pp. 61–83.

[Cas+04] C. Castillo, M. Marin, A. Rodriguez, and R. Baeza-Yates: “SchedulingAlgorithms for Web Crawling”. In: Proceedings of the Joint Conference 10thBrazilian Symposium on Multimedia and the Web & 2nd Latin AmericanWeb Congress (WebMedia & LA-Web 2004). Ribeirao Preto-SP, Brazil, 2004,pp. 10–17.

[Cas+11] S. Castano, A. Ferrara, S. Montanelli, and G. Varese: “Ontology and In-stance Matching”. In: Knowlege-Driven Multimedia Information Extractionand Ontology Evolution. Ed. by G. Paliouras, C. D. Spyropoulos, and G.Tsatsaronis. Vol. 6050. Lecture Notes in Computer Science. Springer BerlinHeidelberg, 2011, pp. 167–195.

[CC03] M. Chau and H. Chen: “Personalized and Focused Web Spiders”. In: WebIntelligence. Ed. by N. Zhong, J. Liu, and Y. Yao. Springer Berlin Heidelberg,2003. Chap. 10, pp. 197–217.

[Çel14] T. Çelik: h-product. 2014. url: http://microformats.org/wiki/h-product (accessed on February 9, 2016).

http://www.w3.org/TR/2014/REC-n-quads-20140225/

http://www.w3.org/TR/2014/REC-n-quads-20140225/

http://microformats.org/wiki/h-product

http://microformats.org/wiki/h-product

Bibliography 304

[CFG03] O. Corcho, M. Fernandez-Lopez, and A. Gomez-Perez: “Methodologies, Toolsand Languages for Building Ontologies. Where is Their Meeting Point?” In:Data & Knowledge Engineering 46 (1) (2003), pp. 41–64.

[CG01] O. Corcho and A. Gomez-Perez: “Solving Integration Problems of E-CommerceStandards and Initiatives through Ontological Mappings”. In: Proceedings ofthe Workshop on Ontologies and Information Sharing. Seattle, Washington,USA, 2001, pp. 131–140.

[Cha09a] D. Chaffey: E-Business and E-Commerce Management: Strategy, Implemen-tation and Practice. 4th ed. Prentice Hall, 2009.

[Cha09b] C.-Y. C. Chang: “Does Price Matter? How Price Influences Online ConsumerDecision-Making”. In: Japanese Journal of Administrative Science 22 (3)(2009), pp. 245–254.

[Cle67] C. Cleverdon: “The Cranfield Tests on Index Language Devices”. In: ASLIBProceedings 19 (6) (1967), pp. 173–194.

[Coa37] R. Coase: “The Nature of the Firm”. In: Economica 4 (16) (1937), pp. 386–405.

[Coa60] R. Coase: “The Problem of Social Cost”. In: The Journal of Law & Economics3 (1) (1960), pp. 1–44.

[Col+06] S. Colucci, T. Di Noia, E. Di Sciascio, F. M. Donini, A. Ragone, and R. Rizzi:“A Semantic-based Fully Visual Application for Matchmaking and QueryRefinement in B2C E-Marketplaces”. In: Proceedings of the 8th InternationalConference on Electronic Commerce (ICEC 2006). Fredericton, Canada,2006, pp. 174–184.

[Cre+98] F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell: “"Is ThisDocument Relevant? ... Probably": A Survey of Probabilistic Models inInformation Retrieval”. In: ACM Computing Surveys 30 (4) (1998), pp. 528–552.

[CRF03] W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg: “A Comparison of StringDistance Metrics for Name-Matching Tasks”. In: Proceedings of IJCAI-03Workshop on Information Integration on the Web (IIWeb 2003). Acapulco,Mexico, 2003, pp. 73–78.

[CS14a] G. Carothers and A. Seaborne: RDF 1.1 N-Triples: A Line-based Syntaxfor an RDF Graph. W3C Recommendation 25 February 2014. 2014. url:http://www.w3.org/TR/2014/REC-n-triples-20140225/ (accessed onMay 16, 2014).

http://www.w3.org/TR/2014/REC-n-triples-20140225/

Bibliography 305

[CS14b] G. Carothers and A. Seaborne: RDF 1.1 TriG: RDF Dataset Language.W3C Recommendation 25 February 2014. 2014. url: http://www.w3.org/TR/2014/REC-trig-20140225/ (accessed on May 16, 2014).

[Cul+07] A. Culotta, M. Wick, R. Hall, M. Marzilli, and A. McCallum: “Canoni-calization of Database Records Using Adaptive Similarity Measures”. In:Proceedings of the 13th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining (KDD 2007). San Jose, California, USA,2007, pp. 201–209.

[Cun02] H. Cunningham: “GATE, a General Architecture for Text Engineering”. In:Computers and the Humanities 36 (2002), pp. 223–254.

[CvBD99] S. Chakrabarti, M. van den Berg, and B. Dom: “Focused Crawling: A NewApproach to Topic-specific Web Resource Discovery”. In: Computer Networks31 (11-16) (1999), pp. 1623–1640.

[CWL14] R. Cyganiak, D. Wood, and M. Lanthaler: RDF 1.1 Concepts and AbstractSyntax. W3C Recommendation 25 February 2014. 2014. url: http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/ (accessed on May 14,2014).

[Cyg+08] R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, and G. Tummarello: “Se-mantic Sitemaps: Efficient and Flexible Access to Datasets on the SemanticWeb”. In: Proceedings of the 5th European Semantic Web Conference (ESWC2008). Tenerife, Spain, 2008, pp. 690–704.

[dBru+05] J. de Bruijn, A. Polleres, R. Lara, and D. Fensel: “OWL DL vs. OWL Flight:Conceptual Modeling and Reasoning for the Semantic Web”. In: Proceedingsof the 14th International World Wide Web Conference (WWW 2005). Chiba,Japan, 2005, pp. 623–632.

[Dee+90] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A.Harshman: “Indexing by Latent Semantic Analysis”. In: Journal of theAmerican Society for Information Science 41 (6) (1990), pp. 391–407.

[DFH11] J. Domingue, D. Fensel, and J. A. Hendler: “Introduction to the SemanticWeb Technologies”. In: Handbook of Semantic Web Technologies. Ed. byJ. Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011.Chap. 1, pp. 3–41.

[DH05] A. Doan and A. Y. Halevy: “Semantic-Integration Research in the DatabaseCommunity”. In: AI Magazine 26 (1) (2005), pp. 83–94.

http://www.w3.org/TR/2014/REC-trig-20140225/

http://www.w3.org/TR/2014/REC-trig-20140225/

http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/

http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/

Bibliography 306

[DHI12] A. Doan, A. Halevy, and Z. Ives: Principles of Data Integration. MorganKaufmann Publishers, 2012.

[Di +03] T. Di Noia, E. Di Sciascio, F. M. Donini, and M. Mongiello: “A System forPrincipled Matchmaking in an Electronic Marketplace”. In: Proceedings ofthe 12th International World Wide Web Conference (WWW 2003). Budapest,Hungary, 2003, pp. 321–330.

[Dij82] E. W. Dijkstra: “EWD 447: On the Role of Scientific Thought”. In: SelectedWritings on Computing: A Personal Perspective. Springer New York, 1982,pp. 60–66.

[Din+04] L. Ding, T. Finin, A. Joshi, Y. Peng, R. Scott Cost, J. Sachs, R. Pan, P.Reddivari, and V. Doshi: “Swoogle: A Semantic Web Search and MetadataEngine”. In: Proceedings of the 13th ACM International Conference onInformation and Knowledge Management (CIKM 2004). Washington, DC,USA, 2004, pp. 652–659.

[Din+05] L. Ding, T. Finin, A. Joshi, Y. Peng, R. Pan, and P. Reddivari: “Search onthe Semantic Web”. In: IEEE Computer 38 (10) (2005), pp. 62–69.

[DLS01] F.-D. Dorloff, J. Leukel, and V. Schmitz: “Standards für den Austausch vonelektronischen Produktkatalogen”. In: WISU 30 (11) (2001), pp. 1528–1536.

[DM11] M. D’Aquin and E. Motta: “Watson, More Than a Semantic Web SearchEngine”. In: Semantic Web 2 (1) (2011), pp. 55–63.

[Dod06] L. Dodds: Slug: A Semantic Web Crawler. 2006. url: http://www.ldodds.com/projects/slug/slug-a-semantic-web-crawler.pdf (accessed onJuly 23, 2014).

[Don+14] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T.Strohmann, S. Sun, and W. Zhang: “Knowledge Vault: A Web-scale Ap-proach to Probabilistic Knowledge Fusion”. In: Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining(KDD 2014). New York, NY, USA, 2014, pp. 601–610.

[DR10] T. Di Noia and A. Ragone: “Electronic Markets, a Look Behind the Curtains:How Can Semantic Matchmaking and Negotiation Boost E-Commerce?” In:Proceedings of the 11th International Conference on Electronic Commerceand Web Technologies (EC-Web 2010). Bilbao, Spain, 2010, pp. 241–252.

[Dre+08] A. Dreibelbis, E. Hechler, I. Milman, M. Oberhofer, P. van Run, and D. Wolf-son: Enterprise Master Data Management: An SOA Approach to ManagingCore Information. IBM Press, 2008.

http://www.ldodds.com/projects/slug/slug-a-semantic-web-crawler.pdf

http://www.ldodds.com/projects/slug/slug-a-semantic-web-crawler.pdf

Bibliography 307

[DS04] M. Dean and G. Schreiber: OWL Web Ontology Language Reference. W3CRecommendation 10 February 2004. 2004. url: http://www.w3.org/TR/2004/REC-owl-ref-20040210/ (accessed on May 20, 2014).

[DS05] M. Dürst and M. Suignard: Internationalized Resource Identifiers (IRIs).Request for Comments 3987. 2005. url: http://www.ietf.org/rfc/rfc3987.txt (accessed on May 7, 2014).

[Du+04] R. Du, E. Foo, C. Boyd, and B. Fitzgerald: “Defining Security Services forElectronic Tendering”. In: Proceedings of the Second Australasian InformationSecurity Workshop (AISW 2004). Dunedin, New Zealand, 2004, pp. 43–52.

[EBa15] EBay: What Is a Manufacturer Part Number? 2015. url: http://www.ebay.com/gds/What-Is-a-Manufacturer-Part-Number-/10000000177404842/

g.html (accessed on February 16, 2016).

[EClND] ECl@ss e. V.: eCl@ss Classification and Product Description. url: http://www.eclass.de/ (accessed on May 16, 2014).

[ECl14] ECl@ss e. V.: Category:Products – wiki.eclass.eu. 2014. url: http://wiki.eclass.eu/wiki/Category:Products (accessed on September 17, 2014).

[Ehr07] M. Ehrig: Ontology Alignment: Bridging the Semantic Gap. Springer US,2007.

[Eic05] B. Eich: “JavaScript at Ten Years”. In: Proceedings of the Tenth ACMSIGPLAN International Conference on Functional Programming (ICFP2005). Tallinn, Estonia, 2005, pp. 129–129.

[Eit+01] T. Eiter, D. Veit, J. P. Müller, and M. Schneider: “Matchmaking for Struc-tured Objects”. In: Proceedings of the Third International Conference on DataWarehousing and Knowledge Discovery (DaWaK 2001). Munich, Germany,2001, pp. 186–194.

[Eks+11] M. D. Ekstrand, M. Ludwig, J. A. Konstan, and J. T. Riedl: “Rethinkingthe Recommender Research Ecosystem: Reproducibility, Openness, andLensKit”. In: Proceedings of the Fifth ACM Conference on RecommenderSystems (RecSys 2011). Chicago, IL, USA, 2011, pp. 133–140.

[EleND] Electronic Commerce Code Management Association: Why eOTD? url:http://www.eccma.org/whyeotd.php (accessed on May 8, 2014).

[ES07] J. Euzenat and P. Shvaiko: Ontology Matching. Springer Berlin Heidelberg,2007.

http://www.w3.org/TR/2004/REC-owl-ref-20040210/

http://www.w3.org/TR/2004/REC-owl-ref-20040210/



http://www.ebay.com/gds/What-Is-a-Manufacturer-Part-Number-/10000000177404842/g.html





http://wiki.eclass.eu/wiki/Category:Products

http://wiki.eclass.eu/wiki/Category:Products

http://www.eccma.org/whyeotd.php

Bibliography 308

[Eur08a] European Commission: “Commission Regulation (EC) No 213/2008 of 28November 2007 Amending Regulation (EC) No 2195/2002 of the EuropeanParliament and of the Council on the Common Procurement Vocabulary(CPV) and Directives 2004/17/EC and 2004/18/EC of the European Parlia-ment”. In: Official Journal of the European Union L 74 (2008).

[Eur08b] European Commission: “Regulation (EC) No 451/2008 of the EuropeanParliament and of the Council of 23 April 2008 Establishing a New Statis-tical Classification of Products by Activity (CPA) and Repealing CouncilRegulation (ECC) No 3696/93”. In: Official Journal of the European UnionL 145 (2008).

[Eva07] M. P. Evans: “Analysing Google Rankings Through Search Engine Opti-mization Data”. In: Internet Research 17 (1) (2007), pp. 21–37.

[Fac12] Facebook Inc.: The Open Graph Protocol. 2012. url: http://ogp.me/(accessed on May 16, 2014).

[FCB12] D. C. Faye, O. Cure, and G. Blin: “A Survey of RDF Storage Approaches”.In: ARIMA Journal 15 (2012), pp. 11–35.

[Fei+13] L. Feigenbaum, G. T. Williams, K. G. Clark, and E. Torres: SPARQL1.1 Protocol. W3C Recommendation 21 March 2013. 2013. url: http :

//www.w3.org/TR/2013/REC-sparql11-protocol-20130321/ (accessedon May 26, 2014).

[Fen+01] D. Fensel, Y. Ding, B. Omelayenko, E. Schulten, G. Botquin, M. Brown,and A. Flett: “Product Data Integration in B2B E-Commerce”. In: IEEEIntelligent Systems 16 (4) (2001), pp. 54–59.

[FFS07] A. Felfernig, G. Friedrich, and L. Schmidt-Thieme: “Recommender Systems”.In: IEEE Intelligent Systems 22 (3) (2007), pp. 18–21.

[FH10] C. Fürber and M. Hepp: “Using Semantic Web Resources for Data Qual-ity Management”. In: Proceedings of the 17th International Conference onKnowledge Engineering and Knowledge Management (EKAW 2010). Lisbon,Portugal, 2010, pp. 211–225.

[FH11] S. Ferré and A. Hermann: “Semantic Search: Reconciling Expressive Queryingand Exploratory Search”. In: Proceedings of the 10th International SemanticWeb Conference (ISWC 2011). Bonn, Germany, 2011, pp. 177–192.

http://ogp.me/

http://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/

http://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/

Bibliography 309

[FH13] C. Fantapié Altobelli and D. Hilger: “F-Commerce – Möglichkeiten undGrenzen von Facebook als Vertriebskanal am Beispiel von Dienstleistern”.In: Dienstleistungsmanagement und Social Media: Potenziale, Strategienund Instrumente. Ed. by M. Bruhn and K. Hadwich. Springer FachmedienWiesbaden, 2013. Chap. 6, pp. 469–489.

[Fie+99] R. T. Fielding, J. Gettys, J. C. Mogul, H. F. Nielsen, L. Masinter, P. J. Leach,and T. Berners-Lee: Hypertext Transfer Protocol – HTTP/1.1. Request forComments 2616. 1999. url: http://www.ietf.org/rfc/rfc2616.txt(accessed on May 7, 2014).

[Fie00] R. T. Fielding: “Architectural Styles and the Design of Network-basedSoftware Architectures”. PhD thesis. University of California, Irvine, 2000.

[FLR14] R. T. Fielding, Y. Lafon, and J. Reschke: Hypertext Transfer Protocol(HTTP/1.1): Range Requests. Request for Comments 7233. 2014. url: http://www.ietf.org/rfc/rfc7233.txt (accessed on February 5, 2016).

[FNR14] R. T. Fielding, M. Nottingham, and J. Reschke: Hypertext Transfer Protocol(HTTP/1.1): Caching. Request for Comments 7234. 2014. url: http://www.ietf.org/rfc/rfc7234.txt (accessed on February 5, 2016).

[FR14a] R. T. Fielding and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1):Authentication. Request for Comments 7235. 2014. url: http://www.ietf.org/rfc/rfc7235.txt (accessed on February 5, 2016).

[FR14b] R. T. Fielding and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1):Conditional Requests. Request for Comments 7232. 2014. url: http://www.ietf.org/rfc/rfc7232.txt (accessed on February 5, 2016).

[FR14c] R. T. Fielding and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1):Message Syntax and Routing. Request for Comments 7230. 2014. url: http://www.ietf.org/rfc/rfc7230.txt (accessed on February 5, 2016).

[FR14d] R. T. Fielding and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1):Semantics and Content. Request for Comments 7231. 2014. url: http://www.ietf.org/rfc/rfc7231.txt (accessed on February 5, 2016).

[FS85] J. Farrell and G. Saloner: “Standardization, Compatibility, and Innovation”.In: Rand Journal of Economics 16 (1) (1985), pp. 70–83.

[Fuh92] N. Fuhr: “Probabilistic Models in Information Retrieval”. In: The ComputerJournal 35 (3) (1992), pp. 243–255.














Bibliography 310

[FW98] E. C. Freuder and R. J. Wallace: “Suggestion Strategies for Constraint-basedMatchmaker Agents”. In: Proceedings of the 4th International Conference onPrinciples and Practice of Constraint Programming (CP 1998). Pisa, Italy,1998, pp. 192–204.

[Gan+11] F. L. Gandon, R. Krummenacher, S.-K. Han, and I. Toma: “SemanticAnnotation and Retrieval: RDF”. In: Handbook of Semantic Web Technologies.Ed. by J. Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg,2011. Chap. 4, pp. 117–155.

[Gar04] L. M. Garshol: “Metadata? Thesauri? Taxonomies? Topic Maps! MakingSense of It All”. In: Journal of Information Science 30 (4) (2004), pp. 378–391.

[GGH09] K. Goel, R. V. Guha, and O. Hansson: Introducing Rich Snippets. GoogleWebmaster Central Blog. 2009. url: http://googlewebmastercentral.blogspot.de/2009/05/introducing-rich-snippets.html (accessedon August 11, 2014).

[GI12] GS1 Germany GmbH and Institut der deutschen Wirtschaft Köln ConsultGmbH: Economic Success Thanks to eBusiness Standards: EntrepreneursShow How It Works. Cologne, Germany, 2012.

[GL02] M. Gruninger and J. Lee: “Ontology Applications and Design”. In: Commu-nications of the ACM 45 (2) (2002), pp. 39–41.

[GMM03] R. V. Guha, R. McCool, and E. Miller: “Semantic Search”. In: Proceedingsof the Twelfth International World Wide Web Conference (WWW 2003).Budapest, Hungary, 2003, pp. 700–709.

[Gol08] A. Goldfarb: “Electronic Commerce”. In: The New Palgrave Dictionary ofEconomics. Ed. by S. N. Durlauf and L. E. Blume. 2nd ed. Basingstoke:Palgrave Macmillan, 2008.

[Gol76] V. P. Goldberg: “Regulation and Administered Contracts”. In: The BellJournal of Economics 7 (2) (1976), pp. 426–448.

[GooND] Google: About Unique Product Identifiers. url: https://support.google.com / merchants / answer / 160161 ? hl = en % 7B % 5C & %7Dref _ topic =

6244294 (accessed on February 16, 2016).

[Goo13] Google: The Google Product Taxonomy. Google Merchant Center Help. 2013.url: https://www.google.com/basepages/producttype/taxonomy.en-US.txt (accessed on May 16, 2014).

http://googlewebmastercentral.blogspot.de/2009/05/introducing-rich-snippets.html

http://googlewebmastercentral.blogspot.de/2009/05/introducing-rich-snippets.html

https://support.google.com/merchants/answer/160161?hl=en%7B%5C&%7Dref_topic=6244294



https://www.google.com/basepages/producttype/taxonomy.en-US.txt

https://www.google.com/basepages/producttype/taxonomy.en-US.txt

Bibliography 311

[Goo15a] Google: About schema.org. 2015. url: https://developers.google.com/structured-data/schema-org (accessed on February 10, 2016).

[Goo15b] Google: Content API for Shopping – Best Practices. Google Developers. 2015.url: https://developers.google.com/shopping-content/v2/best-practices (accessed on February 19, 2016).

[Goo15c] Google: Inside Search: Algorithms. 2015. url: https://www.google.com/insidesearch/howsearchworks/algorithms.html (accessed onJanuary 26, 2016).

[Goo16] Google: Rich Snippets. 2016. url: https://developers.google.com/structured-data/rich-snippets/ (accessed on February 10, 2016).

[GOS09] N. Guarino, D. Oberle, and S. Staab: “What Is an Ontology?” In: Handbookon Ontologies. Ed. by S. Staab and R. Studer. 2nd ed. Springer BerlinHeidelberg, 2009, pp. 1–17.

[GPP13] P. Gearon, A. Passant, and A. Polleres: SPARQL 1.1 Update. W3C Recom-mendation 21 March 2013. 2013. url: http://www.w3.org/TR/2013/REC-sparql11-update-20130321/ (accessed on May 24, 2014).

[GQ02] T. Gupta and A. Qasem: “Reduction of Price Dispersion through Semantic E-Commerce: A Position Paper”. In: Proceedings of the Semantic Web Workshop.Hawaii, USA, 2002, pp. 1–2.

[GR98] J. C. Giarratano and G. D. Riley: Expert Systems: Principles and Program-ming. Boston, Massachusetts, USA: PWS Publishing Company, 1998.

[Gri+11] S. Grimm, A. Abecker, J. Völker, and R. Studer: “Ontologies and theSemantic Web”. In: Handbook of Semantic Web Technologies. Ed. by J.Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011.Chap. 13, pp. 507–579.

[Gri03] M. Grieger: “Electronic Marketplaces: A Literature Review and a Call forSupply Chain Management Research”. In: European Journal of OperationalResearch 144 (2) (2003), pp. 280–294.

[Gru12] J. Grudin: “A Moving Target – The Evolution of Human-Computer Interac-tion”. In: The Human-Computer Interaction Handbook. Ed. by J. A. Jacko.Vol. 3. CRC Press, 2012. Chap. Introducti, pp. xxvii–lxi.

[Gru93] T. R. Gruber: “A Translation Approach to Portable Ontology Specifications”.In: Knowledge Acquisition 5 (2) (1993), pp. 199–220.

[GS1ND] GS1 AISBL: The Value and Benefits of the GS1 System of Standards. Brus-sels, Belgium: GS1.

https://developers.google.com/structured-data/schema-org

https://developers.google.com/structured-data/schema-org

https://developers.google.com/shopping-content/v2/best-practices

https://developers.google.com/shopping-content/v2/best-practices

https://www.google.com/insidesearch/howsearchworks/algorithms.html

https://www.google.com/insidesearch/howsearchworks/algorithms.html

https://developers.google.com/structured-data/rich-snippets/

https://developers.google.com/structured-data/rich-snippets/

http://www.w3.org/TR/2013/REC-sparql11-update-20130321/

http://www.w3.org/TR/2013/REC-sparql11-update-20130321/

Bibliography 312

[GS105] GS1 AISBL: Global Product Classification (GPC): The Global Language forClassifying Goods. 3rd ed. GS1, 2005.

[GS115] GS1 AISBL: Global Trade Item Number (GTIN). GS1, 2015.

[GS116] GS1 AISBL: GS1 General Specifications. GS1, 2016.

[GS14] F. Gandon and G. Schreiber: RDF 1.1 XML Syntax. W3C Recommendation25 February 2014. 2014. url: http://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/ (accessed on May 15, 2014).

[GTB01] J. Gonzalez-Castillo, D. Trastour, and C. Bartolini: Description Logics forMatchmaking of Services. Technical Report HPL–2001–265. HP LaboratoriesBristol, 2001.

[Gua+15] M. Guay, C. Pang, C. Hestermann, and N. Montgomery: Magic Quadrant forSingle-Instance ERP for Product-Centric Midmarket Companies. ResearchReport. Stamford: Gartner, 2015.

[Guh12] R. V. Guha: Good Relations and Schema.org. 2012. url: http://blog.schema.org/2012/11/good-relations-and-schemaorg.html (accessedon April 15, 2015).

[GW02] N. Guarino and C. Welty: “Evaluating Ontological Decisions with OntoClean”.In: Communications of the ACM 45 (2) (2002), pp. 61–65.

[GW09] N. Guarino and C. Welty: “An Overview of OntoClean”. In: Handbook onOntologies. Ed. by S. Staab and R. Studer. 2nd ed. Springer Berlin Heidelberg,2009, pp. 201–220.

[Haa+04] P. Haase, J. Broekstra, A. Eberhart, and R. Volz: “A Comparison of RDFQuery Languages”. In: Proceedings of the Third International Semantic WebConference (ISWC 2004). Hiroshima, Japan, 2004, pp. 502–517.

[Haa+11] K. Haas, P. Mika, P. Tarjan, and R. Blanco: “Enhanced Results for WebSearch”. In: Proceedings of the 34th International ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR 2011). Beijing,China, 2011, pp. 725–734.

[Hah+10] R. Hahn, C. Bizer, C. Sahnwaldt, C. Herta, S. Robinson, M. Bürgle, H.Düwiger, and U. Scheel: “Faceted Wikipedia Search”. In: Proceedings of the13th International Conference on Business Information Systems (BIS 2010).Berlin, Germany, 2010, pp. 1–11.

http://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/

http://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/

http://blog.schema.org/2012/11/good-relations-and-schemaorg.html

http://blog.schema.org/2012/11/good-relations-and-schemaorg.html

Bibliography 313

[Hak+06] S. Hakkarainen, L. Hella, D. Strasunskas, and S. Tuxen: “A Semantic Trans-formation Approach for ISO 15926”. In: Proceedings of the First InternationalWorkshop on Ontologizing Industrial Standards (OIS 2006). Tucson, Arizona,USA, 2006, pp. 281–290.

[Hal05] A. Y. Halevy: “Why Your Data Won’t Mix”. In: ACM Queue 3 (8) (2005),pp. 50–58.

[Han07] O. Handle: “Konzeption und Realisierung eines branchenübergreifendenProduktklassifikationssystems für das Bauwesen unter Nutzung der pro-duktspezifischen Fachkompetenz der Baustoffindustrie”. Master thesis. MCIManagement Center Innsbruck, Innsbruck, Austria, 2007.

[Han14] Handelsverband Deutschland: E-Commerce-Umsätze. 2014. url: http://www.einzelhandel.de/index.php/presse/zahlenfaktengrafiken/

item/110185-e-commerce-umsaetze (accessed on February 26, 2015).

[Har+04] A. Harth, S. Decker, Y. He, H. Tanmunarunkit, and C. Kesselman: “ASemantic Matchmaker Service on the Grid”. In: Poster Proceedings of the13th International World Wide Web Conference (WWW 2004), AlternateTrack. New York, NY, USA, 2004, pp. 326–327.

[Has+11] B. Haslhofer, E. Momeni, B. Schandl, and S. Zander: Europeana RDF StoreReport: The Results of Qualitative and Quantitative Study of Existing RDFStores in the Context of Europeana. Technical Report. EuropeanaConnect,2011.

[HB11] T. Heath and C. Bizer: Linked Data: Evolving the Web into a Global DataSpace. 1st ed. Synthesis Lectures on the Semantic Web: Theory and Tech-nology. Morgan & Claypool, 2011.

[HBS09] A. Hertel, J. Broekstra, and H. Stuckenschmidt: “RDF Storage and RetrievalSystems”. In: Handbook on Ontologies. Ed. by S. Staab and R. Studer. 2nd

ed. Springer Berlin Heidelberg, 2009, pp. 489–508.

[HdB07] M. Hepp and J. de Bruijn: “GenTax: A Generic Methodology for DerivingOWL and RDF-S Ontologies from Hierarchical Classifications, Thesauri,and Inconsistent Taxonomies”. In: Proceedings of the 4th European SemanticWeb Conference (ESWC 2007). Innsbruck, Austria, 2007, pp. 129–144.

[Hea+02] M. A. Hearst, A. Elliott, J. English, R. Sinha, K. Searingen, and K.-P. Yee:“Finding the Flow in Web Site Search”. In: Communications of the ACM45 (9) (2002), pp. 42–49.

http://www.einzelhandel.de/index.php/presse/zahlenfaktengrafiken/item/110185-e-commerce-umsaetze



Bibliography 314

[Hea09] M. A. Hearst: “The Design of Search User Interfaces”. In: Search UserInterfaces. Cambridge University Press, 2009. Chap. 1.

[Hea11] M. A. Hearst: “User Interfaces for Search”. In: Modern Information Retrieval:The Concepts and Technology behind Search. Vol. 2. Addison-Wesley, 2011.Chap. 2, pp. 21–55.

[Hed10] H. Hedden: The Accidental Taxonomist. Information Today, 2010.

[Hen10] J. Hendler: “Web 3.0: The Dawn of Semantic Search”. In: Computer 43 (1)(2010), pp. 77–80.

[Hep03] M. Hepp: “Güterklassifikation als semantisches Standardisierungsproblem”.PhD thesis. Universität Würzburg, Würzburg, Germany, 2003.

[Hep05a] M. Hepp: “A Methodology for Deriving OWL Ontologies from Products andServices Categorization Standards”. In: Proceedings of the 13th EuropeanConference on Information Systems (ECIS 2005). Regensburg, Germany,2005, pp. 1–12.

[Hep05b] M. Hepp: “eClassOWL: A Fully-fledged Products and Services Ontology inOWL”. In: Poster and Demo Proceedings of the 4th International SemanticWeb Conference (ISWC 2005). Galway, Ireland, 2005.

[Hep06] M. Hepp: “Products and Services Ontologies: A Methodology for DerivingOWL Ontologies from Industrial Categorization Standards”. In: InternationalJournal on Semantic Web and Information Systems (IJSWIS) 2 (1) (2006),pp. 72–99.

[Hep07a] M. Hepp: “Possible Ontologies: How Reality Constrains the Development ofRelevant Ontologies”. In: IEEE Internet Computing 11 (1) (2007), pp. 90–96.

[Hep07b] M. Hepp: “ProdLight: A Lightweight Ontology for Product DescriptionBased on Datatype Properties”. In: Proceedings of the 10th InternationalConference on Business Information Systems (BIS 2007). Poznan, Poland,2007, pp. 260–272.

[Hep08a] M. Hepp: “GoodRelations: An Ontology for Describing Products and ServicesOffers on the Web”. In: Proceedings of the 16th International Conferenceon Knowledge Engineering and Knowledge Management (EKAW 2008).Acritezza, Italy, 2008, pp. 329–346.

[Hep08b] M. Hepp: GoodRelations: An Ontology for Describing Web Offerings. Tech-nical Report pp. 2008–05–15. SEBIS, 2008.

Bibliography 315

[Hep11] M. Hepp: GoodRelations Language Reference. V 1.0, Release 2011-10-01.2011. url: http://www.heppnetz.de/ontologies/goodrelations/v1.html (accessed on May 22, 2014).

[Hep12a] M. Hepp: GoodRelations for Manufacturers of Commodities. 2012. url:http://wiki.goodrelations-vocabulary.org/GoodRelations_for_

manufacturers (accessed on November 12, 2015).

[Hep12b] M. Hepp: “The Web of Data for E-Commerce in Brief”. In: Proceedings ofthe 12th International Conference on Web Engineering (ICWE 2012). Berlin,Germany, 2012, pp. 510–511.

[Hep13] M. Hepp: Useful Rules, Axioms, and Mappings for GoodRelations. 2013.url: http://wiki.goodrelations-vocabulary.org/Axioms (accessedon February 20, 2016).

[Hep15a] M. Hepp: GoodRelations as Part of Schema.org. 2015. url: http://wiki.goodrelations-vocabulary.org/Cookbook/Schema.org (accessed onFebruary 19, 2016).

[Hep15b] M. Hepp: “The Web of Data for E-Commerce: Schema.org and GoodRelationsfor Researchers and Practitioners”. In: Proceedings of the 15th InternationalConference on Web Engineering (ICWE 2015). Rotterdam, The Netherlands,2015, pp. 723–727.

[Her+04] S. C. Herring, L. A. Scheidt, S. Bonus, and E. Wright: “Bridging the Gap: AGenre Analysis of Weblogs”. In: Proceedings of the 37th Hawaii InternationalConference on System Sciences (HICCS 2004). Big Island, Hawaii, USA,2004.

[Her+13] I. Herman, B. Adida, M. Sporny, and M. Birbeck: RDFa 1.1 Primer – SecondEdition: Rich Structured Data Markup for Web Documents. W3C WorkingGroup Note 22 August 2013. 2013. url: http://www.w3.org/TR/2013/NOTE-rdfa-primer-20130822/ (accessed on May 17, 2014).

[Hev+04] A. R. Hevner, S. T. March, J. Park, and S. Ram: “Design Science in In-formation Systems Research”. In: MIS Quarterly 28 (1) (2004), pp. 75–105.

[HGR09] M. Hepp, R. García, and A. Radinger: “RDF2RDFa: Turning RDF intoSnippets for Copy-and-Paste”. In: Poster and Demo Proceedings of the 8thInternational Semantic Web Conference (ISWC 2009). Washington, DC,USA, 2009.

http://www.heppnetz.de/ontologies/goodrelations/v1.html

http://www.heppnetz.de/ontologies/goodrelations/v1.html

http://wiki.goodrelations-vocabulary.org/GoodRelations_for_manufacturers

http://wiki.goodrelations-vocabulary.org/GoodRelations_for_manufacturers

http://wiki.goodrelations-vocabulary.org/Axioms



http://www.w3.org/TR/2013/NOTE-rdfa-primer-20130822/

http://www.w3.org/TR/2013/NOTE-rdfa-primer-20130822/

Bibliography 316

[HH00] J. Hefflin and J. Hendler: “Searching the Web with SHOE”. In: ArtificialIntelligence for Web Search. Papers from the AAAI Workshop. Menlo Park,CA, USA, 2000, pp. 35–40.

[Hic+14] I. Hickson, R. Berjon, S. Faulkner, T. Leithead, E. Doyle Navara, E. O’Connor,and S. Pfeiffer: HTML5: A Vocabulary and Associated APIs for HTMLand XHTML. W3C Recommendation 28 October 2014. 2014. url: http://www.w3.org/TR/2014/REC-html5-20141028/ (accessed on April 13,2015).

[Hic13] I. Hickson: HTML Microdata. W3C Working Group Note 29 October 2013.2013. url: http://www.w3.org/TR/2013/NOTE-microdata-20131029/(accessed on May 16, 2014).

[Hil99] P. Hill: “Tangibles, Intangibles and Services: A New Taxonomy for theClassification of Output”. In: The Canadian Journal of Economics 32 (2)(1999), pp. 426–446.

[Hjø08] B. Hjørland: “What is Knowledge Organization (KO)?” In: Knowledge Or-ganization. International Journal devoted to Concept Theory, Classification,Indexing and Knowledge Representation 35 (2/3) (2008), pp. 86–101.

[HLS07] M. Hepp, J. Leukel, and V. Schmitz: “A Quantitative Analysis of ProductCategorization Standards: Content, Coverage, and Maintenance of eCl@ss,UNSPSC, eOTD, and the RosettaNet Technical Dictionary”. In: Knowledgeand Information Systems 13 (1) (2007), pp. 77–114.

[HM07] T. Heath and E. Motta: “Revyu.com: A Reviewing and Rating Site forthe Web of Data”. In: Proceedings of the 6th International Semantic WebConference and 2nd Asian Semantic Web Conference (ISWC 2007 + ASWC2007). Busan, Korea, 2007, pp. 895–902.

[Hod+14] R. Hodgson, P. J. Keller, J. Hodges, and J. Spivak: QUDT – Quantities, Units,Dimensions and Data Types Ontologies. 2014. url: http://qudt.org/(accessed on October 30, 2014).

[Hod00] G. Hodge: Systems of Knowledge Organization for Digital Libraries: BeyondTraditional Authority Files. Washington, DC, USA: Council on Library andInformation Resources, 2000.

[Hof+00] Y. Hoffner, C. Facciorusso, S. Field, and A. Schade: “Distribution Issues inthe Design and Implementation of a Virtual Market Place”. In: ComputerNetworks 32 (6) (2000), pp. 717–730.

http://www.w3.org/TR/2014/REC-html5-20141028/

http://www.w3.org/TR/2014/REC-html5-20141028/

http://www.w3.org/TR/2013/NOTE-microdata-20131029/

http://qudt.org/

Bibliography 317

[Hog+10] A. Hogan, A. Polleres, J. Umbrich, and A. Zimmermann: “Some EntitiesAre More Equal than Others: Statistical Methods to Consolidate LinkedData”. In: Proceedings of the Workshop on New Forms of Reasoning forthe Semantic Web: Scalable & Dynamic (NeFoRS 2010). Heraklion, Greece,2010.

[Hog+11] A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres, and S. Decker:“Searching and Browsing Linked Data with SWSE: The Semantic WebSearch Engine”. In: Web Semantics: Science, Services and Agents on theWorld Wide Web 9 (4) (2011), pp. 365–401.

[Hol79] B. Holmström: “Moral Hazard and Observability”. In: The Bell Journal ofEconomics 10 (1) (1979), pp. 74–91.

[Hop08] E. Hopkins: “Price Dispersion”. In: The New Palgrave Dictionary of Eco-nomics. Ed. by S. N. Durlauf and L. E. Blume. 2nd ed. Basingstoke: PalgraveMacmillan, 2008.

[Hor+04] I. Horrocks, P. F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, and M.Dean: SWRL: A Semantic Web Rule Language Combining OWL and RuleML.W3C Member Submission 21 May 2004. 2004. url: http://www.w3.org/Submission/2004/SUBM-SWRL-20040521/ (accessed on May 26, 2014).

[HP11] I. Horrocks and P. F. Patel-Schneider: “KR and Reasoning on the Seman-tic Web: OWL”. In: Handbook of Semantic Web Technologies. Ed. by J.Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011.Chap. 9, pp. 365–398.

[HR04] M. Huth and M. Ryan: Logic in Computer Science: Modelling and Reasoningabout Systems. 2nd ed. Cambridge University Press, 2004.

[HR09] M. Hepp and A. Radinger: “SKOS2OWL: An Online Tool for Deriving OWLand RDF-S Ontologies from SKOS Vocabularies”. In: Poster and DemoProceedings of the 8th International Semantic Web Conference (ISWC 2009).Washington, DC, USA, 2009.

[HRO06] A. Y. Halevy, A. Rajaraman, and J. J. Ordille: “Data Integration: TheTeenage Years”. In: Proceedings of the 32nd International Conference onVery Large Data Bases (VLDB 2006). Seoul, Korea, 2006, pp. 9–16.

[HS00] C. Hümpel and V. Schmitz: “BMEcat – An XML Standard for ElectronicProduct Data Interchange”. In: Proceedings of the First German ConferenceXML 2000. Heidelberg, Germany, 2000, pp. 1–11.

http://www.w3.org/Submission/2004/SUBM-SWRL-20040521/

http://www.w3.org/Submission/2004/SUBM-SWRL-20040521/

Bibliography 318

[HS13] S. Harris and A. Seaborne: SPARQL 1.1 Query Language. W3C Recommen-dation 21 March 2013. 2013. url: http://www.w3.org/TR/2013/REC-sparql11-query-20130321/ (accessed on May 23, 2014).

[HT05] M. Hepp and R. Thome: “XML-Spezifikationen und Standards für den Date-naustausch”. In: Electronic Commerce und Electronic Business. Mehrwertdurch Integration und Automation. Ed. by R. Thome, H. Schinzer, and M.Hepp. 3rd ed. Vahlen, München, 2005, pp. 191–216.

[HUD06] A. Harth, J. Umbrich, and S. Decker: “MultiCrawler: A Pipelined Archi-tecture for Crawling and Indexing Semantic Web Data”. In: Proceedings ofthe 5th International Semantic Web Conference (ISWC 2006). Athens, GA,USA, 2006, pp. 258–271.

[Hue00] C. Huemer: “XML vs. UN/EDIFACT or Flexibility vs. Standardisation”. In:Proceedings of the 13th International Bled Electronic Commerce Conference.Bled, Slovenia, 2000.

[IBM11] IBM: IBM100 – E-Business. 2011. url: http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/ebusiness/transform/ (accessed onMay 16, 2014).

[II94] International Organization for Standardization and International Electrotech-nical Commission: Information Technology – Open Systems Interconnection– Basic Reference Model: The Basic Model. ISO/IEC 74. 1994.

[Inf15] Informationsstelle für Arzneispezialitäten: Technische Hinweise zur PZN-Codierung im Code 39. 2015.

[Inm02] W. H. Inmon: Building the Data Warehouse. 3rd ed. John Wiley & Sons,Inc., 2002.

[IntND] International Organization for Standardization: Language Codes – ISO 639.url: http://www.iso.org/iso/home/standards/language_codes.htm (accessed on May 16, 2014).

[Int02a] International Organization for Standardization: ISO 10303-21:2002: Indus-trial Automation Systems and Integration – Product Data Representationand Exchange – Part 21: Implementation Methods: Clear Text Encoding ofthe Exchange Structure. 2002.

[Int02b] International Organization for Standardization: ISO 639-1:2002: Codes forthe Representation of Names of Languages – Part 1: Alpha-2 Code. 2002.

http://www.w3.org/TR/2013/REC-sparql11-query-20130321/

http://www.w3.org/TR/2013/REC-sparql11-query-20130321/

http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/ebusiness/transform/

http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/ebusiness/transform/

http://www.iso.org/iso/home/standards/language_codes.htm

http://www.iso.org/iso/home/standards/language_codes.htm

Bibliography 319

[Int04] International Organization for Standardization: ISO 10303-11:2004: Indus-trial Automation Systems and Integration – Product Data Representationand Exchange – Part 11: Description Methods: The EXPRESS LanguageReference Manual. 2004.

[Int05] International Organization for Standardization: ISO 2108:2005: Informationand Documentation – International Standard Book Number (ISBN). 2005.

[Int07a] International Organization for Standardization: ISO 10303-28:2007: Indus-trial Automation Systems and Integration – Product Data Representationand Exchange – Part 28: Implementation Methods: XML Representations ofEXPRESS Schemas and Data, Using XML Schemas. 2007.

[Int07b] International Organization for Standardization: ISO 639-3:2007: Codes forthe Representation of Names of Languages – Part 3: Alpha-3 Code forComprehensive Coverage of Languages. 2007.

[Int08] International Organization for Standardization: ISO 4217:2008: Codes forthe Representation of Currencies and Funds. 2008.

[Int11] International Organization for Standardization: ISO 25964-1:2011: Informa-tion and Documentation – Thesauri and Interoperability with Other Vocabu-laries – Part 1: Thesauri for Information Retrieval. 2011.

[Int13a] International Organization for Standardization: ISO 25964-2:2013: Informa-tion and Documentation – Thesauri and Interoperability with Other Vocabu-laries – Part 2: Interoperability with Other Vocabularies. 2013.

[Int13b] International Organization for Standardization: ISO 3166-1:2013: Codes forthe Representation of Names of Countries and Their Subdivisions – Part 1:Country Codes. 2013.

[Int13c] International Organization for Standardization: ISO 3166-2:2013: Codes forthe Representation of Names of Countries and Their Subdivisions – Part 2:Country Subdivision Code. 2013.

[Int13d] International Organization for Standardization: ISO 3166-3:2013: Codes forthe Representation of Names of Countries and Their Subdivisions – Part 3:Code for Formerly Used Names of Countries. 2013.

[Int15] International Data Corporation: As Tablets Slow and PCs Face OngoingChallenges, Smartphones Grab an Ever-Larger Share of the Smart ConnectedDevice Market Through 2019, According to IDC. IDC Press Release. 2015.url: http://www.idc.com/getdoc.jsp?containerId=prUS25500515(accessed on November 5, 2015).

http://www.idc.com/getdoc.jsp?containerId=prUS25500515

Bibliography 320

[Int88] International Organization for Standardization: ISO 8601:1988: Data Ele-ments and Interchange Formats – Information Interchange – Representationof Dates and Times. 1988.

[Int98] International Organization for Standardization: ISO 639-2:1998: Codes forthe Representation of Names of Languages – Part 2: Alpha-3 Code. 1998.

[Ise+10] R. Isele, J. Umbrich, C. Bizer, and A. Harth: “LDSpider: An Open-SourceCrawling Framework for the Web of Linked Data”. In: Poster and DemoProceedings of the 9th International Semantic Web Conference (ISWC 2010).Shanghai, China, 2010.

[Jac12] P. Jaccard: “The Distribution of the Flora in the Alpine Zone”. In: NewPhytologist 11 (1912), pp. 37–50.

[JJD11] J. Jimenez-Rodriguez, G. Jimenez-Diaz, and B. Diaz-Agudo: “Matchmakingand Case-based Recommendations”. In: Proceedings of the Workshop onCase-based Reasoning for Computer Games. Greenwich, London, UK, 2011,pp. 53–62.

[JM09] D. Jurafsky and J. H. Martin: Speech and Language Processing: An Introduc-tion to Natural Language Processing, Computational Linguistics, and SpeechRecognition. 2nd ed. Prentice Hall, 2009.

[JM76] M. C. Jensen and W. H. Meckling: “Theory of the Firm: Managerial Behavior,Agency Costs and Ownership Structure”. In: Journal of Financial Economics3 (4) (1976), pp. 305–360.

[Joa02] T. Joachims: “Optimizing Search Engines Using Clickthrough Data”. In:Proceedings of the Eighth ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining (KDD 2002). Edmonton, Alberta, Canada,2002, pp. 133–142.

[JW04] I. Jacobs and N. Walsh: Architecture of the World Wide Web, Volume One.W3C Recommendation 15 December 2004. 2004. url: http://www.w3.org/TR/2004/REC-webarch-20041215/ (accessed on May 13, 2014).

[Kar+05] T. Karlsson, C. Kuttainen, L. Pitt, and S. Spyropoulou: “Price as a Variablein online Consumer Trade-offs”. In: Marketing Intelligence & Planning 23 (4)(2005), pp. 350–358.

[KD11] A. Kiryakov and M. Damova: “Storing the Semantic Web: Repositories”. In:Handbook of Semantic Web Technologies. Ed. by J. Domingue, D. Fensel,and J. A. Hendler. Springer Berlin Heidelberg, 2011. Chap. 7, pp. 231–297.

http://www.w3.org/TR/2004/REC-webarch-20041215/

http://www.w3.org/TR/2004/REC-webarch-20041215/

Bibliography 321

[Kel14] G. Kellogg: Microdata to RDF – Second Edition: Transformation fromHTML+Microdata to RDF. W3C Interest Group Note 16 December 2014.2014. url: https://www.w3.org/TR/2014/NOTE- microdata- rdf-20141216/ (accessed on February 19, 2016).

[Ker+00] S. Kerridge, C. Halaris, G. Mentzas, and S. Kerridge: “Virtual Tendering andBidding in the Construction Sector”. In: Proceedings of the First InternationalConference on Electronic Commerce and Web Technologies (EC-Web 2000).London, UK, 2000, pp. 379–388.

[KH96] D. Kuokka and L. Harada: “Integrating Information via Matchmaking”. In:Journal of Intelligent Information Systems: Integrating Artificial Intelligenceand Database Technologies (JIIS) 6 (2-3) (1996), pp. 261–279.

[Kha06] R. Khare: “Microformats: The Next (Small) Thing on the Semantic Web?”In: IEEE Internet Computing 10 (1) (2006), pp. 68–75.

[Kle02] M. Klein: DAML+OIL and RDF Schema Representation of UNSPSC. 2002.url: http://www.cs.vu.nl/%7B~%7Dmcaklein/unspsc/ (accessed onFebruary 19, 2016).

[Knu09] H. Knublauch: Currency Conversion with the Units Ontology, SPARQL-Motion and SPIN. 2009. url: http://composing-the-semantic-web.blogspot.it/2009/09/currency-conversion-with-units-ontology.

html (accessed on September 30, 2014).

[Knu13] H. Knublauch: Defining SPARQL Functions with SWP. 2013. url: http://composing-the-semantic-web.blogspot.de/2013/06/defining-

sparql-functions-with-swp.html (accessed on November 17, 2014).

[Knu84] D. E. Knuth: “Literate Programming”. In: The Computer Journal 27 (2)(1984), pp. 97–111.

[Koh+09] R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne: “ControlledExperiments on the Web: Survey and Practical Guide”. In: Data Mining andKnowledge Discovery 18 (1) (2009), pp. 140–181.

[Kos07] M. Koster: A Standard for Robot Exclusion. The Web Robots Pages. 2007.url: http://www.robotstxt.org/orig.html (accessed on February 18,2016).

[Kos95] M. Koster: Robots in the Web: Threat or Treat? The Web Robots Pages. 1995.url: http://www.robotstxt.org/threat-or-treat.html (accessed onFebruary 18, 2016).

https://www.w3.org/TR/2014/NOTE-microdata-rdf-20141216/

https://www.w3.org/TR/2014/NOTE-microdata-rdf-20141216/

http://www.cs.vu.nl/%7B~%7Dmcaklein/unspsc/

http://composing-the-semantic-web.blogspot.it/2009/09/currency-conversion-with-units-ontology.html



http://composing-the-semantic-web.blogspot.de/2013/06/defining-sparql-functions-with-swp.html



http://www.robotstxt.org/orig.html

http://www.robotstxt.org/threat-or-treat.html

Bibliography 322

[KR03] R. Kalakota and M. Robinson: “Electronic Commerce”. In: Encyclopedia ofComputer Science. Vol. 4. Chichester, UK: John Wiley and Sons Ltd., 2003,pp. 628–634.

[KS85] M. L. Katz and C. Shapiro: “Network Externalities, Competition, andCompatibility”. In: The American Economic Review 75 (3) (1985), pp. 424–440.

[KW97] R. Kalakota and A. B. Whinston: Electronic Commerce: A Manager’s Guide.Addison-Wesley, 1997.

[KZL08] J. Koren, Y. Zhang, and X. Liu: “Personalized Interactive Faceted Search”. In:Proceedings of the 17th International World Wide Web Conference (WWW2008). Beijing, China, 2008, pp. 477–485.

[Law00] S. Lawrence: “Context in Web Search”. In: IEEE Data Engineering Bulletin23 (3) (2000), pp. 25–32.

[LC01] B. Leuf and W. Cunningham: The Wiki Way: Quick Collaboration on theWeb. Addison-Wesley, 2001.

[Leh92] F. Lehmann: “Semantic Networks”. In: Computers & Mathematics withApplications 23 (2-5) (1992), pp. 1–50.

[Len02] M. Lenzerini: “Data Integration: A Theoretical Perspective”. In: Proceedingsof the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems (PODS 2002). Madison, Wisconsin, USA, 2002, pp. 233–246.

[Lev66] V. I. Levenshtein: “Binary Codes Capable of Correcting Deletions, Insertions,and Reversals”. In: Soviet Physics Doklady 10 (8) (1966), pp. 707–710.

[LG12] M. Lanthaler and C. Gütl: “On Using JSON-LD to Create Evolvable RESTfulServices”. In: Proceedings of the Third International Workshop on RESTfulDesign (WS-REST 2012). Lyon, France, 2012, pp. 25–32.

[LH03] L. Li and I. Horrocks: “A Software Framework for Matchmaking Based onSemantic Web Technology”. In: Proceedings of the Twelfth InternationalWorld Wide Web Conference (WWW 2003). Budapest, Hungary, 2003,pp. 331–339.

[Li11] H. Li: “A Short Introduction to Learning to Rank”. In: IEICE Transactionson Information and Systems E94-D (10) (2011), pp. 1854–1862.

Bibliography 323

[Liu+09] W. Liu, Y. Zeng, M. Maletz, and D. Brisson: “Product Lifecycle Manage-ment: A Review”. In: Proceedings of the ASME 2009 International DesignEngineering Technical Conferences and Computers & Information in Engi-neering Conference (IDETC/CIE 2009). San Diego, California, USA, 2009,pp. 1213–1225.

[Liu+12] D. Liu, R. G. Bias, M. Lease, and R. Kuipers: “Crowdsourcing for UsabilityTesting”. In: Proceedings of the 75th Annual Meeting of the Association forInformation Science and Technology (ASIST 2012) 49 (1) (2012).

[LKH14] B. Lika, K. Kolomvatsos, and S. Hadjiefthymiades: “Facing the Cold StartProblem in Recommender Systems”. In: Expert Systems with Applications41 (4) (2014), pp. 2065–2073.

[LMS07] M. Lenders, J. Müller, and G. Schuh: “PLM mit Modellcharakter”. In:CADplus Business+Engineering 46 (6) (2007), pp. 32–35.

[Los09] D. Loshin: Master Data Management. Morgan Kaufmann Publishers, 2009.

[Lov68] J. B. Lovins: “Development of a Stemming Algorithm”. In: MechanicalTranslation and Computational Linguistics 11 (1/2) (1968), pp. 22–31.

[LR05] S. A. Ludwig and S. Reyhani: “Introduction of Semantic Matchmaking toGrid Computing”. In: Journal of Parallel and Distributed Computing 65 (12)(2005), pp. 1533–1541.

[LS96] D. D. Lewis and K. Sparck Jones: “Natural Language Processing for Infor-mation Retrieval”. In: Communications of the ACM 39 (1) (1996), pp. 92–101.

[LSR96] S. Luke, L. Spector, and D. Rager: “Ontology-based Knowledge Discoveryon the World-Wide Web”. In: Proceedings of the Workshop on Internet-based Information Systems at the 13th National Conference on ArtificialIntelligence (AAAI-96). 1996, pp. 96–102.

[Luh57] H. P. Luhn: “A Statistical Approach to Mechanized Encoding and Searchingof Literary Information”. In: IBM Journal of Research and Development1 (4) (1957), pp. 309–317.

[Mad03] S. E. Madnick: “Oh, so That Is What You Meant! The Interplay of DataQuality and Data Semantics”. In: Proceedings of the 22nd InternationalConference on Conceptual Modeling (ER 2003). Chicago, IL, USA, 2003,pp. 3–13.

Bibliography 324

[Man+11] J. Manweiler, S. Agarwal, M. Zhang, R. Roy Choudhury, and P. Bahl:“Switchboard: A Matchmaking System for Multiplayer Mobile Games”. In:Proceedings of the 9th International Conference on Mobile Systems Appli-cations and Services (MobiSys 2011). Bethesda, MD, USA, 2011, pp. 71–84.

[Man98] B. Manaris: “Natural Language Processing: A Human-Computer InteractionPerspective”. In: Advances in Computers. Ed. by M. V. Zelkowitz. Vol. 47.Academic Press, 1998, pp. 1–66.

[Mar04] K. Marx: A Contribution to the Critique of Political Economy. Chicago, IL,USA: Charles H. Kerr & Company, 1904.

[Mar06] G. Marchionini: “Exploratory Search: From Finding to Understanding”. In:Communications of the ACM 49 (4) (2006), pp. 41–46.

[Mas+11] J. E. Masters, I. Polikoff, R. Hodgson, and D. Mekonnen: QUDT UnitsVocabulary (without Dimensions) Version 1.1. Turtle File. 2011. url: http://qudt.org/1.1/vocab/OVG_units-qudt-(v1.1).ttl (accessed onDecember 18, 2015).

[Mat09] M. Mattern: “Transforming BMEcat Catalogs into Semantic Web Annota-tion Data for Offerings”. Master thesis. University of Innsbruck, Innsbruck,Austria, 2009.

[MB09] A. Miles and S. Bechhofer: SKOS Simple Knowledge Organization SystemReference. W3C Recommendation 18 August 2009. 2009. url: http://www.w3.org/TR/2009/REC-skos-reference-20090818/ (accessed on May 22,2014).

[MB12] H. Mühleisen and C. Bizer: “Web Data Commons – Extracting StructuredData from Two Large Web Corpora”. In: Proceedings of the WWW2012Workshop on Linked Data on the Web (LDOW 2012). Lyon, France, 2012.

[MC10] P. Morville and J. Callender: Search Patterns. O’Reilly Media, 2010.

[McD05] M. McDermott: “Knowledge Workers: You Can Gauge Their Effectiveness”.In: Leadership Excellence 22 (10) (2005), pp. 15–17.

[McK12] W. McKinney: Python for Data Analysis: Data Wrangling with Pandas,NumPy, and IPython. O’Reilly Media, 2012.

[MH09] R. Möller and V. Haarslev: “Tableau-based Reasoning”. In: Handbook onOntologies. Ed. by S. Staab and R. Studer. 2nd ed. Springer Berlin Heidelberg,2009, pp. 509–528.

http://qudt.org/1.1/vocab/OVG_units-qudt-(v1.1).ttl

http://qudt.org/1.1/vocab/OVG_units-qudt-(v1.1).ttl

http://www.w3.org/TR/2009/REC-skos-reference-20090818/

http://www.w3.org/TR/2009/REC-skos-reference-20090818/

Bibliography 325

[MHG10] M. McCandless, E. Hatcher, and O. Gospodnetic: Lucene in Action. 2nd ed.Manning Publications Co., 2010.

[MicND] Microsoft: Bing Rich Captions. Bing Ads. url: http://advertise.bingads.microsoft.com/en-us/bing-rich-captions (accessed on February 17,2016).

[Mik08] P. Mika: “Anatomy of a SearchMonkey”. In: Nodalities Magazine (2008),pp. 1–7.

[Mik11] P. Mika: Microformats and RDFa Deployment across the Web. 2011. url:http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-

across-the-web/ (accessed on May 16, 2014).

[Mil06] J. S. Mill: Utilitarianism. University of Chicago Press, 1906.

[Mil95] G. A. Miller: “WordNet: A Lexical Database for English”. In: Communicationsof the ACM 38 (11) (1995), pp. 39–41.

[MK60] M. E. Maron and J. L. Kuhns: “On Relevance, Probabilistic Indexing andInformation Retrieval”. In: Journal of the ACM 7 (3) (1960), pp. 216–244.

[MM04] F. Manola and E. Miller: RDF Primer. W3C Recommendation 10 February2004. 2004. url: http://www.w3.org/TR/2004/REC- rdf- primer-20040210/ (accessed on May 16, 2014).

[MMB14] R. Meusel, P. Mika, and R. Blanco: “Focused Crawling for Structured Data”.In: Proceedings of the 23rd ACM International Conference on Informationand Knowledge Management (CIKM 2014). Shanghai, China, 2014, pp. 1039–1048.

[MP12] P. Mika and T. Potter: “Metadata Statistics for a Large Web Corpus”. In:Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW2012). Lyon, France, 2012.

[MP15] R. Meusel and H. Paulheim: “Heuristics for Fixing Common Errors inDeployed schema.org Microdata”. In: Proceedings of the 12th EuropeanSemantic Web Conference (ESWC 2015). Portoroz, Slovenia, 2015, pp. 152–168.

[MPB14] R. Meusel, P. Petrovski, and C. Bizer: “The WebDataCommons Microdata,RDFa and Microformat Dataset Series”. In: Proceedings of the 13th Interna-tional Semantic Web Conference (ISWC 2014). Riva del Garda, Trentino,Italy, 2014, pp. 277–292.

http://advertise.bingads.microsoft.com/en-us/bing-rich-captions

http://advertise.bingads.microsoft.com/en-us/bing-rich-captions

http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/

http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/

http://www.w3.org/TR/2004/REC-rdf-primer-20040210/

http://www.w3.org/TR/2004/REC-rdf-primer-20040210/

Bibliography 326

[MPP12] B. Motik, P. F. Patel-Schneider, and B. Parsia: OWL 2 Web Ontology Lan-guage: Structural Specification and Functional-Style Syntax (Second Edition).W3C Recommendation 11 December 2012. 2012. url: http://www.w3.org/TR/2012/REC-owl2-syntax-20121211/ (accessed on October 1, 2014).

[MR06] P. Morville and L. Rosenfeld: Information Architecture for the World WideWeb. 3rd ed. O’Reilly Media, 2006.

[MRN14] A. Moro, A. Raganato, and R. Navigli: “Entity Linking Meets Word SenseDisambiguation: A Unified Approach”. In: Transactions of the Associationfor Computational Linguistics (TACL) 2 (2014), pp. 231–244.

[MRR12] K. Mauge, K. Rohanimanesh, and J.-D. Ruvini: “Structuring E-CommerceInventory”. In: Proceedings of the 50th Annual Meeting of the Associationfor Computational Linguistics (ACL 2012). Jeju, Republic of Korea, 2012,pp. 805–814.

[MRS09] C. D. Manning, P. Raghavan, and H. Schütze: An Introduction to InformationRetrieval. Cambridge University Press, 2009.

[MS99] C. D. Manning and H. Schütze: Foundations of Statistical Natural LanguageProcessing. The MIT Press, 1999.

[MSM93] M. Marcus, B. Santorini, and M. A. Marcinkiewicz: Building a Large Anno-tated Corpus of English: The Penn Treebank. Technical Report MS–CIS–93–87. University of Pennsylvania, Department of Computer & InformationScience, 1993.

[MV05] H. D. Morris and D. Vesset: Managing Master Data for Business PerformanceManagement: The Issues and Hyperion’s Solution. IDC White Paper. IDC,2005.

[MYB87] T. W. Malone, J. Yates, and R. I. Benjamin: “Electronic Markets andElectronic Hierarchies”. In: Communications of the ACM 30 (6) (1987),pp. 484–497.

[Mye98] B. A. Myers: “A Brief History of Human-Computer Interaction Technology”.In: Interactions 5 (2) (1998), pp. 44–54.

[Nar+04] B. A. Nardi, D. J. Schiano, M. Gumbrecht, and L. Swartz: “Why We Blog”.In: Communications of the ACM 47 (12) (2004), pp. 41–46.

[NAS10] NASA: The NASA Quantity – Unit – Dimension – Type Ontology. OntologyDocumentation. 2010. url: http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html (accessed on January 22, 2015).

http://www.w3.org/TR/2012/REC-owl2-syntax-20121211/

http://www.w3.org/TR/2012/REC-owl2-syntax-20121211/

http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html

http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html

Bibliography 327

[Nat05] National Information Standards Organization: Guidelines for the Construc-tion, Format, and Management of Monolingual Controlled Vocabularies.Technical Report ANSI/NISO Z39.19–2005 (R2010). National InformationStandards Organization, 2005.

[Nat08] National Institute of Standards and Technology (NIST): The InternationalSystem of Units (SI) – NIST Special Publication 330. Ed. by B. N. Taylorand A. Thompson. Gaithersburg, MD, USA: National Institute of Standardsand Technology, 2008.

[Nav09] R. Navigli: “Word Sense Disambiguation: A Survey”. In: ACM ComputingSurveys 41 (2) (2009), 10:1–10:69.

[Nel65] T. Nelson: “Complex Information Processing: A File Structure for theComplex, the Changing and the Indeterminate”. In: Proceedings of the 20thACM National Conference. Cleveland, Ohio, USA, 1965, pp. 84–100.

[Nel70] P. Nelson: “Information and Consumer Behavior”. In: Journal of PoliticalEconomy 78 (2) (1970), pp. 311–329.

[NO95] C. F. Naiman and A. M. Ouksel: “A Classification of Semantic Conflicts inHeterogeneous Database Systems”. In: Journal of Organizational Computing5 (2) (1995), pp. 167–193.

[NS08] W. Nicholson and C. Snyder: Microeconomic Theory: Basic Principles andExtensions. 10th ed. Thomson South-Western, 2008.

[NS10] P. Nowakowski and H. Stuckenschmidt: “Ontology-based Product Cata-logues: An Example Implementation”. In: Proceedings of MultikonferenzWirtschaftsinformatik (MKWI 2010). Göttingen, Germany, 2010, pp. 15–25.

[NW01] M. Najork and J. L. Wiener: “Breadth-First Search Crawling Yields High-Quality Pages”. In: Proceedings of the Tenth International World Wide WebConference (WWW 2001). Hong Kong, China, 2001, pp. 114–118.

[ODD06] E. Oren, R. Delbru, and S. Decker: “Extending Faceted Navigation for RDFData”. In: Proceedings of the 5th International Semantic Web Conference(ISWC 2006). Athens, GA, USA, 2006, pp. 559–572.

[OHS09] A. Oulasvirta, J. P. Hukkinen, and B. Schwartz: “When More is Less:The Paradox of Choice in Search Engine Use”. In: Proceedings of the 32ndInternational ACM SIGIR Conference on Research and Development inInformation Retrieval (SIGIR 2009). Boston, Massachusetts, USA, 2009,pp. 516–523.

Bibliography 328

[OJ93] V. L. O’Day and R. Jeffries: “Orienteering in an Information Landscape:How Information Seekers Get from Here to There”. In: Proceedings of the IN-TERCHI Conference on Human Factors in Computing Systems (INTERCHI1993). Amsterdam, The Netherlands, 1993, pp. 438–445.

[Ora11] Oracle: Master Data Management. Oracle White Paper. Oracle, 2011.

[Ore+08] E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tum-marello: “Sindice.com: A Document-oriented Lookup Index for Open LinkedData”. In: International Journal of Metadata, Semantics and Ontologies(IJMSO) 3 (1) (2008), pp. 37–52.

[ORe05] T. O’Reilly: What is Web 2.0: Design Patterns and Business Models forthe Next Generation of Software. 2005. url: http://oreilly.com/web2/archive/what-is-web-20.html (accessed on January 25, 2016).

[ORe07] T. O’Reilly: “What Is Web 2.0: Design Patterns and Business Models for theNext Generation of Software”. In: International Journal of Digital Economics65 (2007), pp. 17–37.

[ORG15] L. Otero-Cerdeira, F. J. Rodriguez-Martinez, and A. Gomez-Rodriguez: “On-tology Matching: A Literature Review”. In: Expert Systems with Applications42 (2) (2015), pp. 949–971.

[PAA08] L. Polo Paredes, J. M. Alvarez Rodriguez, and E. R. Azcona: “PromotingGovernment Controlled Vocabularies for the Semantic Web: The EUROVOCThesaurus and the CPV Product Classification System”. In: Proceedingsof the Semantic Interoperability in the European Digital Library Workshop(SIEDL 2008). Tenerife, Spain, 2008, pp. 111–122.

[Pag+98] L. Page, S. Brin, R. Motwani, and T. Winograd: The PageRank CitationRanking: Bringing Order to the Web. Technical Report “1999–66”. StanfordInfoLab, 1998.

[Pao+02] M. Paolucci, T. Kawamura, T. R. Payne, and K. P. Sycara: “SemanticMatching of Web Services Capabilities”. In: Proceedings of the First Interna-tional Semantic Web Conference (ISWC 2002). Chia, Sardinia, Italy, 2002,pp. 333–347.

[Par72] D. L. Parnas: “On the Criteria to Be Used in Decomposing Systems intoModules”. In: Communications of the ACM 15 (12) (1972), pp. 1053–1058.

http://oreilly.com/web2/archive/what-is-web-20.html

http://oreilly.com/web2/archive/what-is-web-20.html

Bibliography 329

[PB13] E. Prud’hommeaux and C. Buil-Aranda: SPARQL 1.1 Federated Query.W3C Recommendation 21 March 2013. 2013. url: http://www.w3.org/TR/2013/REC-sparql11-federated-query-20130321/ (accessed onMay 24, 2014).

[PBB14] P. Petrovski, V. Bryl, and C. Bizer: “Integrating Product Data from Web-sites Offering Microdata Markup”. In: Proceedings of the 23rd World WideWeb Conference (WWW 2014), Companion Volume: Workshop on DataExtraction and Object Search (DEOS 2014). Seoul, Korea, 2014, pp. 1299–1304.

[PC14] E. Prud’hommeaux and G. Carothers: RDF 1.1 Turtle: Terse RDF TripleLanguage. W3C Recommendation 25 February 2014. 2014. url: http://www.w3.org/TR/2014/REC-turtle-20140225/ (accessed on May 15,2014).

[Per95] J. Persky: “Retrospectives: The Ethology of Homo Economicus”. In: Journalof Economic Perspectives 9 (2) (1995), pp. 221–231.

[PFH06] A. Polleres, C. Feier, and A. Harth: “Rules with Contextually Scoped Nega-tion”. In: Proceedings of the 3rd European Semantic Web Conference (ESWC2006). Budva, Montenegro, 2006, pp. 332–347.

[PG07] F. Pérez and B. E. Granger: “IPython: A System for Interactive ScientificComputing”. In: Computing in Science and Engineering 9 (3) (2007), pp. 21–29.

[Pis01] C. Pissarides: “Search, Economics of”. In: International Encyclopedia of theSocial & Behavioral Sciences. Ed. by N. J. Smelser and P. B. Baltes. Oxford,UK: Elsevier, 2001, pp. 13760–13768.

[Pit+02] J. Pitkow, H. Schütze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E.Adar, and T. Breuel: “Personalized Search”. In: Communications of the ACM45 (9) (2002), pp. 50–55.

[PM10] A. Passant and P. N. Mendes: “SparqlPuSH: Proactive Notification of DataUpdates in RDF Stores Using PubSubHubbub”. In: Proceedings of the SixthWorkshop on Scripting and Development for the Semantic Web (SFSW 2010).Heraklion, Greece, 2010.

[Poo+11] F. Poon, T. Chin, M. Bentrovato, O. Shafiq, A. Chen, F. Triant, J. Rokne,and R. Alhajj: “Semantically Enhanced Matchmaking of Consumers andProviders: A Canadian Real Estate Case Study”. In: Proceedings of the13th International Conference on Information Integration and Web-based

http://www.w3.org/TR/2013/REC-sparql11-federated-query-20130321/

http://www.w3.org/TR/2013/REC-sparql11-federated-query-20130321/

http://www.w3.org/TR/2014/REC-turtle-20140225/

http://www.w3.org/TR/2014/REC-turtle-20140225/

Bibliography 330

Applications and Services (iiWAS 2011). Ho Chi Minh City, Vietnam, 2011,pp. 198–205.

[Por80] M. F. Porter: “An Algorithm for Suffix Stripping”. In: Program 14 (3) (1980),pp. 130–137.

[Pra05] M. J. Pratt: “ISO 10303, the STEP Standard for Product Data Exchange,and Its PLM Capabilities”. In: International Journal of Product LifecycleManagement 1 (1) (2005), pp. 86–94.

[PRS04] X. Pan, B. T. Ratchford, and V. Shankar: “Price Dispersion on the Internet:A Review and Directions for Future Research”. In: Journal of InteractiveMarketing 18 (4) (2004), pp. 116–135.

[PRW08] A. Picot, R. Reichwald, and R. T. Wigand: Information, Organization andManagement. Springer Berlin Heidelberg, 2008.

[PS08] E. Prud’hommeaux and A. Seaborne: SPARQL Query Language for RDF.W3C Recommendation 15 January 2008. 2008. url: http://www.w3.org/TR/2008/REC- rdf- sparql- query- 20080115/ (accessed on May 23,2014).

[PW97] B. Peat and D. Webber: Introducing XML/EDI... "the e-Business frame-work". 1997. url: http://web.archive.org/web/20011005233701/http://www.geocities.com/WallStreet/Floor/5815/start.htm

(accessed on October 29, 2015).

[Qui86] J. R. Quinlan: “Induction of Decision Trees”. In: Machine Learning 1 (1)(1986), pp. 81–106.

[Qui93] J. R. Quinlan: C4.5: Programs for Machine Learning. Morgan KaufmannPublishers, 1993.

[Rad+13] A. Radinger, B. Rodriguez-Castro, A. Stolz, and M. Hepp: “BauDataWeb:The Austrian Building and Construction Materials Market as Linked Data”.In: Proceedings of the 9th International Conference on Semantic Systems(I-SEMANTICS 2013). Graz, Austria, 2013, pp. 25–32.

[Rag+08] A. Ragone, U. Straccia, F. Bobillo, T. Di Noia, and E. Di Sciascio: “FuzzyBilateral Matchmaking in E-Marketplaces”. In: Proceedings of the 12thInternational Conference on Knowledge-based Intelligent Information andEngineering Systems (KES 2008). Zagreb, Croatia, 2008, pp. 293–301.

[RB01] E. Rahm and P. A. Bernstein: “A Survey of Approaches to AutomaticSchema Matching”. In: The VLDB Journal 10 (4) (2001), pp. 334–350.

http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/

http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/

http://web.archive.org/web/20011005233701/http://www.geocities.com/WallStreet/Floor/5815/start.htm

http://web.archive.org/web/20011005233701/http://www.geocities.com/WallStreet/Floor/5815/start.htm

Bibliography 331

[RD00] E. Rahm and H. H. Do: “Data Cleaning: Problems and Current Approaches”.In: IEEE Data Engineering Bulletin 23 (4) (2000), pp. 3–13.

[RLS98] R. Raman, M. Livny, and M. Solomon: “Matchmaking: Distributed Re-source Management for High Throughput Computing”. In: Proceedings of theSeventh IEEE International Symposium on High Performance DistributedComputing (HPDC 1998). Chicago, IL, USA, 1998, pp. 140–146.

[Rob77] S. E. Robertson: “The Probability Ranking Principle in IR”. In: Journal ofDocumentation 33 (4) (1977), pp. 294–304.

[Roc71] J. J. Rocchio: “Relevance Feedback in Information Retrieval”. In: TheSMART Retrieval System – Experiments in Automatic Document Process-ing. Ed. by G. Salton. Englewood Cliffs, New Jersey: Prentice Hall, 1971,pp. 313–323.

[Ros73] S. A. Ross: “The Economic Theory of Agency: The Principal’s Problem”. In:American Economic Review 63 (2) (1973), pp. 134–139.

[RRS11] F. Ricci, L. Rokach, and B. Shapira: “Introduction to Recommender SystemsHandbook”. In: Recommender Systems Handbook. Ed. by F. Ricci, L. Rokach,B. Shapira, and P. B. Kantor. Springer US, 2011. Chap. 1, pp. 1–35.

[RS76] S. E. Robertson and K. Sparck Jones: “Relevance Weighting of Search Terms”.In: Journal of the American Society for Information Science 27 (3) (1976),pp. 129–146.

[RSA04] C. Rocha, D. Schwabe, and M. P. Aragao: “A Hybrid Approach for Searchingin the Semantic Web”. In: Proceedings of the 13th International World WideWeb Conference (WWW 2004). New York, NY, USA, 2004, pp. 374–383.

[RvAT13] H. Rijgersberg, M. van Assem, and J. Top: “Ontology of Units of Measureand Related Concepts”. In: Semantic Web - Linked Data for Science andEducation 4 (1) (2013), pp. 3–13.

[RZ09] S. E. Robertson and H. Zaragoza: “The Probabilistic Relevance Framework:BM25 and Beyond”. In: Foundations and Trends in Information Retrieval3 (4) (2009), pp. 333–389.

[Sac00] G. M. Sacco: “Dynamic Taxonomies: A Model for Large Information Bases”.In: IEEE Transactions on Knowledge and Data Engineering 12 (3) (2000),pp. 468–479.

Bibliography 332

[Sac05] G. M. Sacco: “The Intelligent E-Store: Easy Interactive Product Selection andComparison”. In: Proceedings of the Seventh IEEE International Conferenceon E-Commerce Technology (CEC 2005). Munich, Germany, 2005, pp. 240–248.

[SB88] G. Salton and C. Buckley: “Term-weighting Approaches in Automatic TextRetrieval”. In: Information Processing and Management: An InternationalJournal 24 (5) (1988), pp. 513–523.

[SB90] G. Salton and C. Buckley: “Improving Retrieval Performance by RelevanceFeedback”. In: Journal of the American Society for Information Science41 (4) (1990), pp. 288–297.

[SBC97] B. Shneiderman, D. Byrd, and W. B. Croft: “Clarifying Search: A User-Interface Framework for Text Searches”. In: D-Lib Magazine 3 (1) (1997).

[SBF98] R. Studer, R. Benjamins, and D. Fensel: “Knowledge Engineering: Principlesand Methods”. In: Data & Knowledge Engineering 25 (1-2) (1998), pp. 161–197.

[SBM96] A. Singhal, C. Buckley, and M. Mitra: “Pivoted Document Length Nor-malization”. In: Proceedings of the 19th Annual International ACM SIGIRConference on Research and Development in Information Retrieval (SIGIR1996). Zurich, Switzerland, 1996, pp. 21–29.

[SC08] L. Sauermann and R. Cyganiak: Cool URIs for the Semantic Web. W3CInterest Group Note 03 December 2008. 2008. url: http://www.w3.org/TR/2008/NOTE-cooluris-20081203/ (accessed on May 13, 2014).

[SC10] A. Singhal and M. Cutts: Using Site Speed in Web Search Ranking. GoogleWebmaster Central Blog. 2010. url: https://googlewebmastercentral.blogspot.de/2010/04/using-site-speed-in-web-search-ranking.

html (accessed on February 18, 2016).

[SC12] M. Sanderson and W. B. Croft: “The History of Information Retrieval Re-search”. In: Proceedings of the IEEE 100 (Centennial Issue) (2012), pp. 1444–1451.

[Sch+04] D. J. Schiano, B. A. Nardi, M. Gumbrecht, and L. Swartz: “Blogging by theRest of Us”. In: Extended Abstracts of the 2004 Conference on Human Factorsin Computing Systems (CHI EA 2004). Vienna, Austria, 2004, pp. 1143–1146.

http://www.w3.org/TR/2008/NOTE-cooluris-20081203/

http://www.w3.org/TR/2008/NOTE-cooluris-20081203/

https://googlewebmastercentral.blogspot.de/2010/04/using-site-speed-in-web-search-ranking.html



Bibliography 333

[Sch+14] M. Schmachtenberg, C. Bizer, A. Jentzsch, and R. Cyganiak: Linking OpenData Cloud Diagram 2014. 2014. url: http://lod-cloud.net/ (accessedon February 26, 2015).

[SchND] Schema.org: Welcome to Schema.org. url: http://schema.org/ (accessedon October 19, 2015).

[Sch04] B. Schwartz: The Paradox of Choice: Why More Is Less. Harper Perennial,2004.

[SE13] P. Shvaiko and J. Euzenat: “Ontology Matching: State of the Art and FutureChallenges”. In: IEEE Transactions on Knowledge and Data Engineering25 (1) (2013), pp. 158–176.

[Sen00] J. A. Senn: “The Emergence of M-Commerce”. In: Computer 33 (12) (2000),pp. 148–150.

[SFW83] G. Salton, E. A. Fox, and H. Wu: “Extended Boolean Information Retrieval”.In: Communications of the ACM 26 (12) (1983), pp. 1022–1036.

[SGH12] A. Stolz, M. Ge, and M. Hepp: “GR4PHP: A Programming API for Con-suming E-Commerce Data from the Semantic Web”. In: Proceedings of theFirst Workshop on Programming the Semantic Web (PSW 2012). Boston,MA, USA, 2012.

[SH01] M. Stonebraker and J. M. Hellerstein: “Content Integration for E-Business”.In: Proceedings of the 2001 ACM SIGMOD International Conference onManagement of Data (SIGMOD 2001). Santa Barbara, California, USA,2001, pp. 552–560.

[SH13a] A. Stolz and M. Hepp: “Currency Conversion the Linked Data Way”. In:Proceedings of the First Workshop on Services and Applications over LinkedAPIs and Data (SALAD2013). Montpellier, France, 2013, pp. 44–55.

[SH13b] A. Stolz and M. Hepp: “From RDF to RSS and Atom: Content Syndicationwith Linked Data”. In: Proceedings of the 24th ACM Conference on Hypertextand Social Media (Hypertext 2013). Paris, France, 2013, pp. 236–241.

[SH14] A. Stolz and M. Hepp: GR2RSS: Publishing Linked Open Commerce Dataas RSS and Atom Feeds. Technical Report TR–2014–1. E-Business and WebScience Research Group, Universität der Bundeswehr München, 2014.

[SH15a] A. Stolz and M. Hepp: “Adaptive Faceted Search for Product Comparisonon the Web of Data”. In: Proceedings of the 15th International Conference onWeb Engineering (ICWE 2015). Rotterdam, The Netherlands, 2015, pp. 420–429.

http://lod-cloud.net/

http://schema.org/

Bibliography 334

[SH15b] A. Stolz and M. Hepp: “An Adaptive Faceted Search Interface for StructuredProduct Offers on the Web”. In: Proceedings of the 4th International Work-shop on Intelligent Exploration of Semantic Data (IESD 2015). Bethlehem,PA, USA, 2015.

[SH15c] A. Stolz and M. Hepp: “Towards Crawling the Web for Structured Data:Pitfalls of Common Crawl for E-Commerce”. In: Proceedings of the 6th In-ternational Workshop on Consuming Linked Data (COLD 2015). Bethlehem,PA, USA, 2015.

[SHB06] N. Shadbolt, W. Hall, and T. Berners-Lee: “The Semantic Web Revisited”.In: IEEE Intelligent Systems 21 (3) (2006), pp. 96–101.

[She99] A. P. Sheth: “Changing Focus on Interoperability in Information Systems:From System, Syntax, Structure to Semantics”. In: Interoperating GeographicInformation Systems. Ed. by M. F. Goodchild, M. J. Egenhofer, R. Fegeas,and C. A. Kottman. Vol. 495. The Springer International Series in Engineer-ing and Computer Science. Springer US, 1999, pp. 5–29.

[SHH07] M. Stollberg, M. Hepp, and J. Hoffmann: “A Caching Mechanism for Seman-tic Web Service Discovery”. In: Proceedings of the 6th International SemanticWeb Conference and 2nd Asian Semantic Web Conference (ISWC 2007 +ASWC 2007). Busan, Korea, 2007, pp. 480–493.

[SI05] A. Saaksvuori and A. Immonen: Product Lifecycle Management. 2nd ed.Springer Berlin Heidelberg, 2005.

[Sil+11] R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo: “Man-aging One Master Data – Challenges and Preconditions”. In: IndustrialManagement & Data Systems 111 (1) (2011), pp. 146–162.

[Sim59] H. A. Simon: “Theories of Decision-Making in Economics and BehavioralScience”. In: The American Economic Review 49 (3) (1959), pp. 253–283.

[Sim97] H. A. Simon: Administrative Behavior: A Study of Decision-making Processesin Administrative Organisations. 4th ed. New York, NY, USA: The FreePress, 1997.

[Sin01] A. Singhal: “Modern Information Retrieval: A Brief Overview”. In: IEEEData Engineering Bulletin 24 (4) (2001), pp. 35–43.

[Sin12] A. Singhal: Introducing the Knowledge Graph: Things, Not Strings. GoogleOfficial Blog. 2012. url: http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html (accessed onSeptember 24, 2015).

http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html

http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html

Bibliography 335

[Sit08] Sitemaps.org: Sitemaps XML format. 2008. url: http://www.sitemaps.org/protocol.html (accessed on October 19, 2015).

[SK08] H. Stuckenschmidt and M. Kolb: “Partial Matchmaking for Complex Productand Service Descriptions”. In: Proceedings of Multikonferenz Wirtschaftsin-formatik (MKWI 2008). Munich, Germany, 2008.

[SKL14] M. Sporny, G. Kellogg, and M. Lanthaler: JSON-LD 1.0: A JSON-basedSerialization for Linked Data. W3C Recommendation 16 January 2014. 2014.url: http://www.w3.org/TR/2014/REC-json-ld-20140116/ (accessedon May 16, 2014).

[SKR99] J. B. Schafer, J. Konstan, and J. Riedl: “Recommender Systems in E-Commerce”. In: Proceedings of the 1st ACM Conference on Electronic Com-merce (EC 1999). Denver, Colorado, USA, 1999, pp. 158–166.

[SKW07] F. M. Suchanek, G. Kasneci, and G. Weikum: “YAGO: A Core of SemanticKnowledge Unifying WordNet and Wikipedia”. In: Proceedings of the 16thInternational World Wide Web Conference (WWW 2007). Banff, Alberta,Canada, 2007, pp. 697–706.

[SL05] J. Sauro and J. R. Lewis: “Estimating Completion Rates from Small SamplesUsing Binomial Confidence Intervals: Comparisons and Recommendations”.In: Proceedings of the Human Factors and Ergonomics Society 49th AnnualMeeting (HFES 2005) 49 (24) (2005), pp. 2100–2104.

[SL98] B. F. Schmid and M. A. Lindemann: “Elements of a Reference Modelfor Electronic Markets”. In: Proceedings of the 31st Hawaii InternationalConference on System Sciences (HICCS 1998). Kohala Coast, HI, 1998,pp. 193–201.

[SLK04] V. Schmitz, J. Leukel, and O. Kelkar: “XML-based Data Exchange of ProductModel Data in E-Procurement and E-Sales: The Case of BMEcat 2.0”. In:Proceedings of the International Conference on Economic, Technical andOrganisational Aspects of Product Configuration Systems (PETO 2004).Copenhagen, Denmark, 2004, pp. 97–107.

[SLK05a] V. Schmitz, J. Leukel, and O. Kelkar: Specification BMEcat 2005. Frankfurtam Main, Germany: Bundesverband Materialwirtschaft, Einkauf und Logistike.V., 2005.

[SLK05b] V. Schmitz, J. Leukel, and O. Kelkar: Specification BMEcat 2005. Frankfurtam Main, Germany: Bundesverband Materialwirtschaft, Einkauf und Logistike.V., 2005.

Bibliography 336

[SLÖ08] J. W. Schemm, C. Legner, and H. Österle: “Global Data Synchronization– Lösungsansatz für das überbetriebliche Produktstammdatenmanagementzwischen Konsumgüterindustrie und Handel?” In: Wertschöpfungsnetzwerke.Ed. by J. Becker, R. Knackstedt, and D. Pfeiffer. Physica-Verlag Heidelberg,2008, pp. 173–192.

[SM12] T. Steiner and S. Mirea: “SEKI@home, or Crowdsourcing an Open KnowledgeGraph”. In: Proceedings of the First International Workshop on KnowledgeExtraction and Consolidation from Social Media (KECSM 2012). Boston,USA, 2012.

[SM13] G. Schadow and C. J. McDonald: The Unified Code for Units of Mea-sure. Version: 1.9. 2013. url: http://unitsofmeasure.org/ucum.html(accessed on October 30, 2014).

[SN10] S. Sakr and G. Al-Naymat: “Relational Processing of RDF Queries: ASurvey”. In: SIGMOD Record 38 (4) (2010), pp. 23–28.

[Sor+10] S. Sorrentino, S. Bergamaschi, M. Gawinecki, and L. Po: “Schema LabelNormalization for Improving Schema Matching”. In: Data & KnowledgeEngineering 69 (12) (2010), pp. 1254–1273.

[Sow13] J. F. Sowa: Semantic Networks. 2013. url: http://www.jfsowa.com/pubs/semnet.htm (accessed on May 16, 2014).

[Spa72] K. Sparck Jones: “A Statistical Interpretation of Term Specificity and ItsApplication in Retrieval”. In: Journal of Documentation 28 (1) (1972), pp. 11–21.

[Spr90] K. Spremann: “Asymmetrische Information”. In: Zeitschrift für Betrieb-swirtschaft 60 (5/6) (1990), pp. 561–586.

[SR14] G. Schreiber and Y. Raimond: RDF 1.1 Primer. W3C Working Group Note25 February 2014. 2014. url: http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/ (accessed on May 16, 2014).

[SRH13a] A. Stolz, B. Rodriguez-Castro, and M. Hepp: RDF Translator: A RESTfulMulti-Format Syntax Converter for the Semantic Web. Technical ReportTR–2013–1. E-Business and Web Science Research Group, Universität derBundeswehr München, 2013.

[SRH13b] A. Stolz, B. Rodriguez-Castro, and M. Hepp: “Using BMEcat Catalogs asa Lever for Product Master Data on the Semantic Web”. In: Proceedingsof the 10th Extended Semantic Web Conference (ESWC 2013). Montpellier,France, 2013, pp. 623–638.

Bibliography 337

[SS09] U. Schonfeld and N. Shivakumar: “Sitemaps: Above and Beyond the Crawl ofDuty”. In: Proceedings of the 18th International World Wide Web Conference(WWW 2009). Madrid, Spain, 2009, pp. 991–1000.

[ST09] G. M. Sacco and Y. Tzitzikas: Dynamic Taxonomies and Faceted Search:Theory, Practice, and Experience. Springer Berlin Heidelberg, 2009.

[Sta08] Statistisches Bundesamt: Klassifikation der Wirtschaftszweige: Mit Erläuterun-gen. Wiesbaden, Germany: Statistisches Bundesamt, 2008.

[Sta11] J. Stark: Product Lifecycle Management: 21st Century Paradigm for ProductRealisation. 2nd ed. Springer-Verlag London, 2011.

[Ste+09] M. Stefaner, S. Ferré, S. Perugini, J. Koren, and Y. Zhang: “User InterfaceDesign”. In: Dynamic Taxonomies and Faceted Search. Ed. by G. M. Saccoand Y. Tzitzikas. Springer Berlin Heidelberg, 2009. Chap. 4, pp. 75–112.

[Sti61] G. J. Stigler: “The Economics of Information”. In: The Journal of PoliticalEconomy 69 (3) (1961), pp. 213–255.

[Sto+07] M. Stollberg, U. Keller, H. Lausen, and S. Heymans: “Two-Phase WebService Discovery Based on Rich Functional Descriptions”. In: Proceedingsof the 4th European Semantic Web Conference (ESWC 2007). Innsbruck,Austria, 2007, pp. 99–113.

[Sto+14] A. Stolz, B. Rodriguez-Castro, A. Radinger, and M. Hepp: “PCS2OWL: AGeneric Approach for Deriving Web Ontologies from Product ClassificationSystems”. In: Proceedings of the 11th Extended Semantic Web Conference(ESWC 2014). Anissaras/Hersonissou, Crete, Greece, 2014, pp. 644–658.

[Su+14] A.-J. Su, Y. C. Hu, A. Kuzmanovic, and C.-K. Koh: “How to Improve YourSearch Engine Ranking: Myths and Reality”. In: ACM Transactions on theWeb (TWEB) 8 (2) (2014), 8:1–8:25.

[SV99] C. Shapiro and H. R. Varian: Information Rules: A Strategic Guide to theNetwork Economy. Harvard Business School Press, 1999.

[SW65] S. S. Shapiro and M. B. Wilk: “An Analysis of Variance Test for Normality(Complete Samples)”. In: Biometrika 52 (3-4) (1965), pp. 591–611.

[SWY75] G. Salton, A. Wong, and C. Yang: “A Vector Space Model for AutomaticIndexing”. In: Communications of the ACM 18 (11) (1975), pp. 613–620.

[Syc+02] K. Sycara, S. Widoff, M. Klusch, and J. Lu: “LARKS: Dynamic MatchmakingAmong Heterogeneous Software Agents in Cyberspace”. In: AutonomousAgents and Multi-Agent Systems 5 (2) (2002), pp. 173–203.

Bibliography 338

[Syc+99] K. Sycara, J. Lu, M. Klusch, and S. Widoff: “Matchmaking among Heteroge-neous Agents on the Internet”. In: Proceedings of the AAAI Spring Symposiumon lntelligent Agents in Cyberspace. Stanford, USA, 1999, pp. 152–164.

[TB96] C. P. Thorpe and J. C. L. Bailey: Commercial Contracts: A Practical Guideto Deals, Contracts, Agreements and Promises. Woodhead Publishing, 1996.

[TBH06] R. Tiwari, S. Buse, and C. Herstatt: “From Electronic to Mobile Commerce:Opportunities through Technology Convergence for Business Services”. In:Asia Pacific Tech Monitor 23 (5) (2006), pp. 38–45.

[TBL10] E. Turban, N. Bolloju, and T.-P. Liang: “Social Commerce: An E-CommercePerspective”. In: Proceedings of the 12th International Conference on Elec-tronic Commerce: Roadmap for the Future of Electronic Business (ICEC2010). Honolulu, Hawaii, 2010, pp. 33–42.

[TH06] P. Thomas and D. Hawking: “Evaluation by Comparing Result Sets inContext”. In: Proceedings of the 15th ACM International Conference onInformation and Knowledge Management (CIKM 2006). Arlington, Virginia,USA, 2006, pp. 94–101.

[TH14] L. Török and M. Hepp: “Towards Portable Shopping Histories: UsingGoodRelations to Expose Ownership Information to E-Commerce Sites”. In:Proceedings of the 11th Extended Semantic Web Conference (ESWC 2014).Anissaras/Hersonissou, Crete, Greece, 2014, pp. 691–705.

[TS08] G. Tindsley and P. Stephenson: “E-Tendering Process within Construction:A UK Perspective”. In: Tsinghua Science and Technology 13 (S1) (2008),pp. 273–278.

[TSK05] P.-N. Tan, M. Steinbach, and V. Kumar: “Classification: Basic Concepts,Decision Trees, and Model Evaluation”. In: Introduction to Data Mining. 1st

ed. Addison-Wesley, 2005. Chap. 4, pp. 145–205.

[Tun09] D. Tunkelang: Faceted Search. Synthesis Lectures on Information Concepts,Retrieval, and Services. Morgan & Claypool, 2009.

[Tva11] M. Tvarožek: “Exploratory Search in the Adaptive Social Semantic Web”. In:Information Sciences and Technologies Bulletin of the ACM Slovakia 3 (1)(2011), pp. 42–51.

[UN 12] UN Economic Commission for Europe: UN/EDIFACT – Price/Sales Cata-logue Message. United Nations Directories for Electronic Data Interchangefor Administration, Commerce and Transport. 2012. url: http://www.

Bibliography 339

unece.org/trade/untdid/d12b/trmd/pricat_c.htm (accessed onMay 16, 2014).

[UN 14] UN Economic Commission for Europe: UN/EDIFACT – Product Data Mes-sage. United Nations Directories for Electronic Data Interchange for Admin-istration, Commerce and Transport. 2014. url: http://www.unece.org/fileadmin/DAM/trade/untdid/d13b/trmd/prodat_c.htm (accessed onMay 16, 2014).

[UniND] United Nations Development Programme: The United Nations StandardProducts and Services Code (UNSPSC). url: http://www.unspsc.org/(accessed on May 16, 2014).

[Uni06] United Nations Economic Commission for Europe: Recommendation No.20: Codes for Units of Measure Used in International Trade. Revision 4.UNECE, 2006.

[Uni09a] United Nations Economic Commission for Europe: Codes for Units ofMeasure Used in International Trade Revision 6 – Annex II & Annex III.UN/ECE CEFACT Trade Facilitation Recommendation No.20. UNECE,2009.

[Uni09b] United Nations Economic Commission for Europe: Recommendation No.20: Codes for Units of Measure Used in International Trade. Revision 6.UNECE, 2009.

[Uni14] United States Census Bureau: Quarterly Retail E-Commerce Sales: 4thQuarter 2014. Washington, DC, USA, 2014. url: http://www2.census.gov / retail / releases / historical / ecomm / 14q4 . pdf (accessed onFebruary 26, 2015).

[Utg89] P. E. Utgoff: “Incremental Induction of Decision Trees”. In: Machine Learning4 (2) (1989), pp. 161–186.

[Vei+01] D. Veit, J. P. Müller, M. Schneider, and B. Fiehn: “Matchmaking for Au-tonomous Agents in Electronic Marketplaces”. In: Proceedings of the FifthInternational Conference on Autonomous Agents (AGENTS 2001). Montreal,Canada, 2001, pp. 65–66.

[Vei03] D. Veit: Matchmaking in Electronic Markets: An Agent-based Approachtowards Matchmaking in Electronic Negotiations. Springer Berlin Heidelberg,2003.

Bibliography 340

[Ver+14] R. Verborgh, O. Hartig, B. De Meester, G. Haesendonk, L. De Vocht, M.Vander Sande, R. Cyganiak, P. Colpaert, E. Mannens, and R. Van de Walle:“Querying Datasets on the Web with High Availability”. In: Proceedings ofthe 13th International Semantic Web Conference (ISWC 2014). Riva delGarda, Trentino, Italy, 2014, pp. 180–196.

[VFK13] D. Vandic, F. Frasincar, and U. Kaymak: “Facet Selection Algorithms for WebProduct Search”. In: Proceedings of the 22nd ACM International Conferenceon Information and Knowledge Management (CIKM 2013). San Francisco,CA, USA, 2013, pp. 2327–2332.

[Vil11] B. Villazon-Terrazas: “A Method for Reusing and Re-engineering Non-ontological Resources for Building Ontologies”. PhD thesis. UniversidadPolitécnica de Madrid, 2011.

[VK14] D. Vrandečić and M. Krötzsch: “Wikidata: A Free Collaborative Knowledge-base”. In: Communications of the ACM 57 (10) (2014), pp. 78–85.

[Vol+09] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov: “Silk – A Link DiscoveryFramework for the Web of Data”. In: Proceedings of the WWW2009 Workshopon Linked Data on the Web (LDOW 2009). Madrid, Spain, 2009.

[Vor94] E. M. Vorhees: “Query Expansion Using Lexical-Semantic Relations”. In:Proceedings of the 17th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR 1994). Dublin,Ireland, 1994, pp. 61–69.

[VvDF12] D. Vandic, J.-W. van Dam, and F. Frasincar: “Faceted Product SearchPowered by the Semantic Web”. In: Decision Support Systems 53 (3) (2012),pp. 425–437.

[VVK00] U. Varshney, R. J. Vetter, and R. Kalakota: “Mobile Commerce: A NewFrontier”. In: Computer 33 (10) (2000), pp. 32–38.

[VWM02] D. Veit, C. Weinhardt, and J. P. Müller: “Multidimensional Matchmakingfor Electronic Markets”. In: Applied Artificial Intelligence 16 (9-10) (2002),pp. 853–869.

[Wan+09] X. Wang, X. Sun, F. Cao, L. Ma, N. Kanellos, K. Zhang, Y. Pan, and Y.Yu: “SMDM: Enhancing Enterprise-wide Master Data Management UsingSemantic Web Technologies”. In: Proceedings of the VLDB Endowment 2 (2)(2009), pp. 1594–1597.

Bibliography 341

[WebNDa] Web Data Commons: Web Data Commons Extraction Report – August 2012Corpus. url: http://www.webdatacommons.org/structureddata/2012-08/stats/stats.html (accessed on July 22, 2014).

[WebNDb] Web Data Commons: Web Data Commons Extraction Report – February 2012Corpus. url: http://www.webdatacommons.org/structureddata/2012-02/stats/stats.html (accessed on July 22, 2014).

[WebNDc] Web Data Commons: Web Data Commons – RDFa, Microdata, and Micro-formats Data Sets – December 2014. url: http://www.webdatacommons.org/structureddata/2014-12/stats/stats.html (accessed on June 1,2015).

[WebNDd] Web Data Commons: Web Data Commons – RDFa, Microdata, and Micro-formats Data Sets – November 2013. url: http://www.webdatacommons.org/structureddata/2013-11/stats/stats.html (accessed on July 22,2014).

[Web11] A. Weber: “Marktanalyse von Software für Produkt-Informations-Management(PIM)”. Bachelor thesis. Universität der Bundeswehr München, Neubiberg,Germany, 2011.

[Wed+95] C. Wedekind, T. Seebeck, F. Bettens, and A. J. Paepke: “MHC-dependentMate Preferences in Humans”. In: Biological Sciences 260 (1359) (1995),pp. 245–249.

[Wei+11] D. Wei, T. Wang, J. Wang, and A. Bernstein: “SAWSDL-iMatcher: ACustomizable and Effective Semantic Web Service Matchmaker”. In: WebSemantics: Science, Services and Agents on the World Wide Web 9 (4)(2011), pp. 402–417.

[Wei+13] B. Wei, J. Liu, Q. Zheng, W. Zhang, X. Fu, and B. Feng: “A Survey ofFaceted Search”. In: Journal of Web Engineering 12 (1) (2013), pp. 41–64.

[Wei12] D. M. Weijers: “Hedonism and Happiness in Theory and Practice”. PhDthesis. Victoria University of Wellington, 2012.

[Whi+06a] A. White, D. Newman, D. Logan, and J. Radcliffe: Mastering Master DataManagement. Research Report. Stamford: Gartner, 2006.

[Whi+06b] R. W. White, B. Kules, S. M. Drucker, and M. Schraefel: “SupportingExploratory Search”. In: Communications of the ACM 49 (4) (2006), pp. 37–39.

[Whi07] A. White: Magic Quadrant for Product Information Management, 2Q07.Research Report. Stamford: Gartner, 2007.

Bibliography 342

[Wil45] F. Wilcoxon: “Individual Comparisons by Ranking Methods”. In: BiometricsBulletin 1 (6) (1945), pp. 80–83.

[Wil81] O. E. Williamson: “The Economics of Organization: The Transaction CostApproach”. In: The American Journal of Sociology 87 (3) (1981), pp. 548–577.

[Wil83] O. E. Williamson: “Credible Commitments: Using Hostages to SupportExchange”. In: American Economic Review 73 (4) (1983), pp. 519–538.

[Woo14] D. Wood: What’s New in RDF 1.1. W3C Working Group Note 25 February2014. 2014. url: http://www.w3.org/TR/2014/NOTE- rdf11- new-20140225/ (accessed on May 8, 2014).

[WS96] R. Y. Wang and D. M. Strong: “Beyond Accuracy: What Data Quality Meansto Data Consumers”. In: Journal of Management Information Systems 12 (4)(1996), pp. 5–34.

[WZ12] C. Wang and P. Zhang: “The Evolution of Social Commerce: The People,Management, Technology, and Information Dimensions”. In: Communicationsof the Association for Information Systems 31 (5) (2012), pp. 105–127.

[Yee+03] K.-P. Yee, K. Searingen, K. Li, and M. A. Hearst: “Faceted Metadata forImage Search and Browsing”. In: Proceedings of the SIGCHI Conferenceon Human Factors in Computing Systems (CHI 2003). Fort Lauderdale,Florida, USA, 2003, pp. 401–408.

[Zar+13] M. Zaremba, S. Bhiri, T. Vitvar, and M. Hauswirth: “Matchmaking of IaaSCloud Computing Offers Leveraging Linked Data”. In: Proceedings of the28th Annual ACM Symposium on Applied Computing (SAC 2013). Coimbra,Portugal, 2013, pp. 383–388.

[ZD04] P. Ziegler and K. R. Dittrich: “Three Decades of Data Integration - All Prob-lems Solved?” In: Proceedings of the 18th IFIP World Computer Congress(WCC 2004). Toulouse, France, 2004, pp. 3–12.

[ZL03] Y. Zhao and J. Lövdahl: “A Reuse-based Method of Developing the Ontol-ogy for E-Procurement”. In: Proceedings of the Nordic Conference on WebServices (NCWS). Växjö, Sweden, 2003.

[ZM06] J. Zobel and A. Moffat: “Inverted Files for Text Search Engines”. In: ACMComputing Surveys 38 (2) (2006).

Deep Product Comparison onthe Semantic Web|AlexStolz

In this thesis, Alex Stolz analyzes how the Semantic Web and its growing amount ofstructured data on the basis of the GoodRelations and schema.org vocabularies canbe used to provide a better product search paradigm and interaction model for theWeb, and improved, data-driven e-commerce in general. He identifies five coreproblems and describes theoretically sound and practically usable solutions to each ofthem, namely (1) an efficient crawling method that can deal with the fact thatrelevant markup is found in the deep branches of e-commerce Web sites, (2) aconceptual approach and toolchain for integrating product model master data fromPIM/PDM/PLM spheres, (3) a method for harvesting product category informationfrom standards like eCl@ss and the UNSPSC, (4) data quality management, and (5)an interaction model and user interface that supports the incremental discovery ofthe product option space.

The work is the first comprehensive analysis of this topic and is suited for researchersand practitioners alike. Starting with a comprehensive survey of the state of the art,it first analyzes the problems at a conceptual level and then presents a state-of-the-art implementation of several prototypes as a proof of concept. All software and datadescribed in the thesis are available online and can serve as valuable input for futurework in industry and academia.

Keywords: Semantic Web, Web of Linked Data, Faceted Search, Data QualityManagement, SPARQL Query Language, User Interaction, Schema.org, Web Crawl,GoodRelations, Web Data Commons, eCl@ss, eClass, UNSPSC, GPC, BMEcat

Alex Stolz – Universität der Bundeswehr München, Neubiberg – 2017