View
5
Download
0
Category
Preview:
Citation preview
QUERY PROCESSING FOR HETEROGENEOUS DATA INTEGRATION
USING ONTOLOGIES
BY
HUIYONG XIAOB.S., Huazhong University of Science and Technology, 1999
M.S., Tsinghua University, China, 2002
THESIS
Submitted as partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Computer Science
in the Graduate College of theUniversity of Illinois at Chicago, 2006
Chicago, Illinois
Copyright by
Huiyong Xiao
2006
To my parents.
iii
ACKNOWLEDGMENTS
First of all, I would like to thank my advisor, Professor Isabel Cruz, without whose patient
guidance and persistent support, I could not have finished my doctoral research. She gave me
both the motivations for starting new research topics and the freedom of working on my research
interests. Not only has she taught me the systematic research methodologies and ethics, but
also she has given numerous suggestions to improve my English writing skills. All that I learnt
from her definitely will benefit me in my future career.
I would also like to thank all the committee for my preliminary examination and thesis
defense, including Professors Kevin Chang, Ajay Kshemkalyani, Bing Liu, Peter Nelson, Aris
Ouksel, and Clement Yu. They have given me valuable feedback and advice on my thesis
research.
I feel very fortunate to have Kalyan Ayloo, William Sunna, Paul Varkey, Nalin Makar,
Feihong Hsu, Ryan Aviles, Fang Fang, and Amira Rahal as my colleagues, who have made my
life in UIC much easier.
I owe special thanks to my wife, with whose support I have been able to save a large amount
of time for my thesis work.
HYX
iv
TABLE OF CONTENTS
CHAPTER PAGE
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Ontology Based Data Integration . . . . . . . . . . . . . . . . . 31.3 Our Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.1 Central Data Integration . . . . . . . . . . . . . . . . . . . . . . 81.3.2 Peer-to-Peer Data Integration . . . . . . . . . . . . . . . . . . . 151.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 CENTRAL DATA INTEGRATION . . . . . . . . . . . . . . . . . . . . 232.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.2 Semantic XML Data Integration . . . . . . . . . . . . . . . . . . 252.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.1 Semantic Integration . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.2 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.3 Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.4 Integrating Structure and Semantics . . . . . . . . . . . . . . . 362.4.1 Local XML Schemas and Local RDFS Ontologies . . . . . . . 362.4.2 The Global RDFS Ontology . . . . . . . . . . . . . . . . . . . . 402.4.3 Data Integration Semantics . . . . . . . . . . . . . . . . . . . . . 442.5 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.5.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.5.2 Certain Answers and Query Containment . . . . . . . . . . . . 492.5.3 Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3 HYBRID PEER-TO-PEER DATA INTEGRATION . . . . . . . . 623.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.3 The PEPSINT Architecture . . . . . . . . . . . . . . . . . . . . 673.4 Mapping Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.4.1 Mapping Local RDF Schemas to the Global Ontology . . . . . 703.4.2 Mapping Local XML Schemas to the Global Ontology . . . . 713.5 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
v
TABLE OF CONTENTS (Continued)
CHAPTER PAGE
3.5.2 Query Answering in Data Integration Mode . . . . . . . . . . . 743.5.3 Query Answering in Hybrid P2P Mode . . . . . . . . . . . . . . 773.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4 PURE PEER-TO-PEER DATA INTEGRATION . . . . . . . . . . 814.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.3.1 The Layered Peer Architecture . . . . . . . . . . . . . . . . . . . 844.3.2 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . 864.3.3 RDF Metadata Representation . . . . . . . . . . . . . . . . . . . 874.3.4 P2P Mapping and Query Answering . . . . . . . . . . . . . . . 884.4 P2P Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.4.1 RDFMS Meta-Ontology . . . . . . . . . . . . . . . . . . . . . . . 904.4.2 P2P Mapping Language – PML . . . . . . . . . . . . . . . . . . 924.5 P2P Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 954.5.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.5.2 Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5 DATA INTEROPERABILITY IN THE SEMANTIC DESKTOP 1025.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.3 The Layered Multi-Ontology Framework . . . . . . . . . . . . . 1075.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 1115.5 Semantic Data Organization . . . . . . . . . . . . . . . . . . . . 1145.5.1 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.5.2 Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.5.3 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.6 Semantic Data Navigation . . . . . . . . . . . . . . . . . . . . . 1205.7 Personal Information Applications . . . . . . . . . . . . . . . . . 1245.7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.7.2 MVC-based PIA Development . . . . . . . . . . . . . . . . . . . 1265.7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275.8 Services-based Desktop Interoperation . . . . . . . . . . . . . . 1315.9 Semantic Query Processing . . . . . . . . . . . . . . . . . . . . . 1355.9.1 Query Processing in a PIA . . . . . . . . . . . . . . . . . . . . . 1365.9.2 A2A Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 1385.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6 GEOSPATIAL DATA MANAGEMENT IN E-GOVERNMENT 1426.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
vi
TABLE OF CONTENTS (Continued)
CHAPTER PAGE
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476.2.1 Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 1476.2.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1496.3 Data Heterogeneities . . . . . . . . . . . . . . . . . . . . . . . . . 1516.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1546.4.1 Schema Transformation and Ontology Mapping . . . . . . . . 1556.4.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1566.5 Ontology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 1576.5.1 Schema Transformation . . . . . . . . . . . . . . . . . . . . . . . 1576.5.2 Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 1616.5.2.1 Mapping Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1626.5.2.2 Deduction Process . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.5.2.3 Mapping Representation . . . . . . . . . . . . . . . . . . . . . . 1676.6 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1696.6.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 1696.6.2 Query Rewriting and Answering . . . . . . . . . . . . . . . . . . 1706.6.2.1 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1726.6.2.2 Query Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1746.6.2.3 Rewriting Constants . . . . . . . . . . . . . . . . . . . . . . . . . 1756.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1776.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
CITED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
vii
LIST OF TABLES
TABLE PAGE
I INFERENCE RULES FOR SEMANTIC RELATIONS. . . . . . . . 19
II MAPPINGS BETWEEN XML SOURCE SCHEMA S1 AND THELOCAL ONTOLOGY R1 . . . . . . . . . . . . . . . . . . . . . . . . . 39
III MAPPING TABLE BETWEEN THE GLOBAL ONTOLOGY ANDLOCAL ONTOLOGIES . . . . . . . . . . . . . . . . . . . . . . . . . . 43
IV MAPPING TABLE BETWEEN THE GLOBAL ONTOLOGY ANDXML SOURCE SCHEMAS . . . . . . . . . . . . . . . . . . . . . . . . 43
V RESOURCE-RESOURCE ASSOCIATIONS. . . . . . . . . . . . . . . 118
VI RDF PROPERTIES FOR THE REPRESENTATION OF ASSOCI-ATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
VII SEMANTIC HETEROGENEITY RESULTED FROM DIFFERENTENCODINGS OF LAND USE DATA. . . . . . . . . . . . . . . . . . . 153
VIII ELEMENT-LEVEL SCHEMA TRANSFORMATION . . . . . . . . 159
IX MAPPINGS BETWEEN XML SOURCE SCHEMA D1 AND LO-CAL ONTOLOGY O1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
viii
LIST OF FIGURES
FIGURE PAGE
1 Two XML sources with heterogeneous schemas. . . . . . . . . . . . . . . 8
2 A central architecture for XML data integration. . . . . . . . . . . . . . . 9
3 Local ontologies generated from XML source schemas. . . . . . . . . . . 11
4 A conceptual view on local sources. . . . . . . . . . . . . . . . . . . . . . . 13
5 The hybrid peer-to-peer architecture of PEPSINT. . . . . . . . . . . . . 15
6 Mediation for peer-to-peer query rewriting. . . . . . . . . . . . . . . . . . 17
7 Thesaurus-based schema mapping process. . . . . . . . . . . . . . . . . . 18
8 Two XML sources with structural heterogeneities. . . . . . . . . . . . . . 26
9 The ontology-based framework for the integration of XML sources. . . . 31
10 Local ontologies R1 and R2 transformed from XML source schemas S1
and S2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
11 The global ontology G that results from merging R1 and R2. . . . . . . 42
12 The global database of G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
13 The retrieved database on S1 w.r.t. S2 and that on S2 w.r.t. S1. . . . . 50
14 The GLRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 55
15 A part of XML data integration setting. . . . . . . . . . . . . . . . . . . . 57
16 The LLRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 58
17 An example of heterogeneous XML and RDF data sources. . . . . . . . 63
18 The PEPSINT architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 68
ix
LIST OF FIGURES (Continued)
FIGURE PAGE
19 RDF schemas transformed from local XML source schemas. . . . . . . . 71
20 The global ontology and its mapping table. . . . . . . . . . . . . . . . . . 72
21 The layered peer architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 85
22 A motivating example for P2P data integration. . . . . . . . . . . . . . . 86
23 local RDFS ontologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
24 The meta-ontology of RDFMS. . . . . . . . . . . . . . . . . . . . . . . . . 90
25 An example of P2P mappings represented in RDFMS. . . . . . . . . . . 91
26 The P2PRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 97
27 An example of files in a PI space. . . . . . . . . . . . . . . . . . . . . . . . 103
28 An ontology-based framework of a PIM system. . . . . . . . . . . . . . . 108
29 The architecture of MOSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
30 An example of an email message. . . . . . . . . . . . . . . . . . . . . . . . 116
31 Data organization in the application, domain, and resource layers. Allontologies are represented in RDFS. Two application ontologies for PIAs,i.e., picture management and publication management, are constructed.Below them are four ontologies for the domains of Email, Talk, Publi-cation, and Photo, respectively. At the bottom, the resource-file andresource-resource associations are represented as triples or in a graph. . 122
32 The browser for PIM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
33 The PIA designer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
34 Desktop services composition and execution. . . . . . . . . . . . . . . . . 133
35 The ADRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 138
36 An example of XML schematic heterogeneity. . . . . . . . . . . . . . . . . 145
37 Local XML land use data sources. . . . . . . . . . . . . . . . . . . . . . . 152
x
LIST OF FIGURES (Continued)
FIGURE PAGE
38 The ontology-based architecture. . . . . . . . . . . . . . . . . . . . . . . . 155
39 An example of local RDFS ontologies. . . . . . . . . . . . . . . . . . . . . 160
40 An example of mapping between two land use taxonomies. The labelsover the edges represent mappings types, followed (in between parenthe-ses) by the deduction rule(s) that can be applied, if any. . . . . . . . . . 163
41 A fragment of ontology mappings represented in RDFS. . . . . . . . . . 168
42 The QueryRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 171
43 The QueryExpand algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 173
44 The ConstantMapping algorithm. . . . . . . . . . . . . . . . . . . . . . . . 176
xi
LIST OF ABBREVIATIONS
GaV Global-as-View
LaV Local-as-View
GLaV Global-Local-as-View
RDF Resource Description Framework
RDFS RDF Schema
RDFMS RDF Mapping Schema
OWL Web Ontology Language
DAML+OIL DARPA Agent Markup Language and Ontology
Interface Language
RQL RDF Query Language
RDQL RDF Data Query Language
c-RQL Conjunctive RQL
c-XQuery Conjunctive XQuery
P2P Peer-to-Peer
PML P2P Mapping Language
PEPSINT Peer-to-Peer Semantic Integration Framework
MOSE Multiple Ontology based Semantic Desktop
xii
LIST OF ABBREVIATIONS (Continued)
PIM Personal Information Management
PIA Personal Information Application
xiii
SUMMARY
Data integration provides the ability to manipulate data transparently across multiple dis-
tributed data sources. We have studied comprehensively several scenarios where the need for
heterogeneous data integration occurs, including centralized integration of XML data sources,
hybrid peer-to-peer integration of XML and RDF data sources, pure peer-to-peer XML and
RDF data integration and interoperability, personal information management within and across
desktops, and geospatial data integration for e-Government.
There are different kinds of heterogeneity: syntactic heterogeneity, which is caused by dif-
ferent languages used for modeling the different sources, schematic heterogeneity, which results
from different structures of source schemas, and semantic heterogeneity, which arises when
different sources contain instances with meanings or interpretations.
The key notion of the emerging Semantic Web is that of an ontology, which is a formal
and explicit specification of a shared conceptualization. The use of ontologies can benefit data
integration tasks in a variety of ways, including metadata representation, global conceptualiza-
tion, support for high-level queries, declarative mediation, and mapping support. As the main
contribution of this thesis, we focus on the role of ontologies in data integration and propose
a series of ontology-based approaches to resolve the heterogeneities so as to achieve data inter-
operability. In this thesis, we report our achievements on ontology-based heterogeneous data
integration, and discuss the fundamental issues, including metadata representation, mapping
process, and query processing, in our approaches to different applications of data integration.
xiv
CHAPTER 1
INTRODUCTION
1.1 Problem Description
Data integration provides the ability to manipulate data transparently across multiple data
sources. It is relevant to a number of applications including enterprise information integra-
tion, medical information management, geographical information systems, and e-Commerce
applications. Based on the architecture, there are two different kinds of systems: central
data integration systems (3; 7; 29; 39; 81; 109) and peer-to-peer data integration systems
(6; 11; 15; 40; 59; 86). A central data integration system usually has a global schema, which
provides the user with a uniform interface to access information stored in the data sources.
In contrast, in a peer-to-peer data integration system, there are no global points of control
on the data sources (or peers). Instead, any peer can accept user queries for the information
distributed in the whole system.
The two most important approaches for building a data integration system are Global-as-
View (GaV) and Local-as-View (LaV) (109; 70). In the GaV approach, every entity in the
global schema is associated with a view over the source local schema. Therefore querying
strategies are simple, but the evolution of the local source schemas is not easily supported.
On the contrary, the LaV approach permits changes to source schemas without affecting the
1
2
global schema, since the local schemas are defined as views over the global schema, but query
processing can be complex.
Data sources can be heterogeneous in syntax, schema, or semantics, thus making data
interoperation a difficult task (16). Syntactic heterogeneity is caused by the use of different
models or languages. Schematic heterogeneity results from structural differences. Semantic
heterogeneity is caused by different meanings or interpretations of data in various contexts. To
achieve data interoperability, the issues posed by data heterogeneity need to be eliminated.
The advent of XML has created a syntactic platform for Web data standardization and
exchange. However, schematic data heterogeneity may persist, depending on the XML schemas
used (e.g., nesting hierarchies). Likewise, semantic heterogeneity may persist even if both
syntactic and schematic heterogeneities do not occur (e.g., naming concepts differently). In this
thesis, we are concerned with solving all three kinds of heterogeneities by bridging syntactic,
schematic, and semantic heterogeneities across different sources.
We call semantic data integration the process of using a conceptual representation of the
data and of their relationships to eliminate possible heterogeneities. At the heart of seman-
tic data integration is the concept of ontology, which is an explicit specification of a shared
conceptualization (55; 54).
Ontologies were developed by the Artificial Intelligence community to facilitate knowledge
sharing and reuse (56). Carrying semantics for particular domains, ontologies are largely used
for representing domain knowledge. A common use of ontologies is data standardization and
conceptualization via a formal machine-understandable ontology language. For example, the
3
global schema in a data integration system may be an ontology, which then acts as a mediator
for reconciliating the heterogeneities between different sources. As an example of the use of
ontologies on peer-to-peer data integration, we can produce for each source schema a local
ontology, which is made accessible to other peers so as to support semantic mappings between
different local ontologies.
1.2 Ontology Based Data Integration
An ontology is a formal, explicit specification of a shared conceptualization (55). In this
definition, “conceptualization” refers to an abstract model of some domain knowledge in the
world that identifies that domain’s relevant concepts. “Shared” indicates that an ontology
captures consensual knowledge, that is, it is accepted by a group. “Explicit” means that the
type of concepts in an ontology and the constraints on these concepts are explicitly defined.
Finally, “formal” means that the ontology should be machine understandable.
Typical “real-world” ontologies include taxonomies on the Web (e.g., Yahoo! categories),
catalogs for on-line shopping (e.g., Amazon.com’s product catalog), and domain-specific stan-
dard terminology (e.g., UMLS1 and Gene Ontology2). As an online lexicon database, WordNet3
is widely used for discovery of semantic relationships between concepts.
Existing ontology languages include:
1http://www.nlm.nih.gov/research/umls/
2http://www.geneontology.org
3http://www.cogsci.princeton.edu/∼wn/
4
XML Schema. Strictly speaking, XML Schema is a semantic markup language for Web data.
The database-compatible data types supported by XML Schema provide a way to specify
a hierarchical model.1 However, there are no explicit constructs for defining classes and
properties in XML Schema, therefore ambiguities may arise when mapping an XML-based
data model to a semantic model.
RDF and RDFS. RDF (Resource Description Framework) is a data model developed by the
W3C for describing Web resources.2 RDF allows for the specification of the semantics of
data in a standardized, interoperable manner. In RDF, a pair of resources (nodes) con-
nected by a property (edge) forms a statement: (resource, property, value). RDFS (RDF
Schema)3 is a language for describing vocabularies of RDF data in terms of primitives
such as rdfs:Class, rdf:Property, rdfs:domain, and rdfs:range. In other words, RDFS is used
to define the semantic relationships between properties and resources.
DAML+OIL. DAML+OIL (DARPA Agent Markup Language-Ontology Interface Language)
is a full-fledged Web-based ontology language developed on top of RDFS.4 It features an
XML-based syntax and a layered architecture. DAML+OIL provides modeling primitives
commonly used in frame-based approaches to ontology engineering, and formal semantics
1http://www.w3.org/TR/xmlschema-2
2http://www.w3.org/TR/rdf-primer
3http://www.w3.org/TR/rdf-schema
4http://www.w3.org/TR/daml+oil-reference
5
and reasoning support found in description logic approaches. It also integrates XML
Schema data types for semantic interoperability in XML.
OWL. OWL (Web Ontology Language) is a semantic markup language for publishing and
sharing ontologies on the Web. It is developed as a vocabulary extension of RDF and is
derived from DAML+OIL.1
Other ontology languages include SHOE (Simple HTML Ontology Extensions),2 XOL (Ontol-
ogy Exchange Language),3 and UML (Unified Modeling Language).4
Among all these ontology languages, we are most interested in XML Schema and RDFS
for their particular roles in data integration and the “Semantic Web” (42). More specifically,
XML Schema and RDFS use the same syntax and can be used for data modeling and ontology
representation. But they have their own particular features in the sense that XML data has
document structure in terms of the nesting elements in an individual XML document, whereas
RDF data has domain structure formed by the concepts and relationships between concepts
(40; 59).
1http://www.w3.org/TR/owl-ref
2http://www.cs.umd.edu/projects/plus/shoe
3http://www.ai.sri.com/pkarp/xol/
4http://www.uml.org/
6
Ontologies have been extensively used in data integration systems because they provide an
explicit and machine-understandable conceptualization of a domain. They have been used in
one of the three following ways (111):
Single ontology approach. All source schemas are directly related to a shared global ontol-
ogy that provides a uniform interface to the user (38). However, this approach requires
that all sources have nearly the same view on a domain, with the same level of granularity.
A typical example of a system using this approach is SIMS (7).
Multiple ontology approach. Each data source is described by its own (local) ontology sep-
arately. Instead of using a common ontology, local ontologies are mapped to each other.
For this purpose, an additional representation formalism is necessary for defining the
inter-ontology mappings. The OBSERVER system (81) is an example of this approach.
Hybrid ontology approach. A combination of the two preceding approaches is used. First,
a local ontology is built for each source schema, which, however, is not mapped to other
local ontologies, but to a global shared ontology. New sources can be easily added with
no need for modifying existing mappings. Our layered framework (38) is an example of
this approach.
The single and hybrid approaches are appropriate for building central data integration
systems, the former being more appropriate for GaV systems and the latter for LaV systems.
A hybrid peer-to-peer system, where a global ontology exists in a “super-peer” can also use the
7
hybrid ontology approach (40). The multiple ontology approach can be best used to construct
pure peer-to-peer data integration systems, where there are no super-peers.
We identify the following five uses of ontologies in data integration:
Metadata Representation. Metadata (i.e., source schemas) in each data source can be ex-
plicitly represented by a local ontology, using a single language.
Global Conceptualization. The global ontology provides a conceptual view over the schemat-
ically heterogeneous source schemas.
Support for High-level Queries. Given a high-level view of the sources, as provided by a
global ontology, the user can formulate a query without specific knowledge of the different
data sources. The query is then rewritten into queries over the sources, based on the
semantic mappings between the global and local ontologies.
Declarative Mediation. Query processing in a hybrid peer-to-peer system uses the global
ontology as a declarative mediator for query rewriting between peers.
Mapping Support. A thesaurus, formalized in terms of an ontology, can be used for the
mapping process to facilitate its automation.
In the following section we discuss five case studies, which correspond to the above five uses.
The first three case studies are in the context of centralized data integration systems, while the
last two are in the context of peer-to-peer data integration systems. We base our discussion on
our previous work (38; 39; 40; 113; 116).
8
1.3 Our Solutions
1.3.1 Central Data Integration
In this section, we will describe three case studies of ontologies in the context of central
data integration. To make the issues concrete, we use a running example involving two XML
sources and demonstrate how to enable semantic interoperation between them.
Example 1.1 Figure 1 displays two XML schemas (S1 and S2) and their respective documents
(D1 and D2), which are represented as trees. The two XML documents conform to different
schemas but represent data with similar semantics. In particular, both schemas represent a
many-to-many relationship between two concepts: book and author in S1 (equivalently denoted
by article and writer in S2). However, structurally speaking, they are different: S1 (book-
centric schema) has the author element nested under the book element, whereas S2 (author-
centric schema) has the article element nested under the writer element.
Semantically equivalent data elements, such as the authors of publication “b2”, can be
reached using different XML path patterns, respectively for schema S1 and schema S2:
books
book *
author
@booktitle @name
writers
article *
@title @fullname
writer *
books
book
author
"b1"
book
author
"b2"
"a1" "a3"
writers
writer
article
"w1" "w2"
"t1" "a2"
XML schema S 1 XML document D 1
"books.xml"
writer writer
article article
"t2"
"w3"
"t2"
[1..10]
XML schema S 2 XML document D 2
"writers.xml"
author
Figure 1. Two XML sources with heterogeneous schemas.
9
/books/book[@booktitle="b2"]/author/@name
and
/writers/writer[article/@title="b2"]/@fullname
where the contents in the square brackets specify the constraints for the search patterns.
The example demonstrates that multiple XML schemas (or structures) can exist for a single
conceptual model. In comparison, the schema or ontology languages (e.g., RDFS, DAML+OIL,
and OWL) that operate on the conceptual level are structurally flat so that the user can
formulate a query from a conceptual perspective without considering the structure of the
source (3; 29; 111; 39).
mapping table
local XML
source 1
local XML
source 2
local XML
source n
RDF-based
global ontology
local RDF
ontology n
local RDF
ontology 1
local RDF
ontology 2 ...
...
Query translator
Query in data-integration direction
Query in peer-to-peer direction
Ontology Integration
Figure 2. A central architecture for XML data integration.
10
Figure 2 shows the architecture of a system that interoperates among schematically hetero-
geneous data sources (39). The following three cases study in detail the principles embodied in
this architecture.
Case Study 1 - Metadata Representation
As a first step for bridging across the heterogeneities of diverse local sources, a local ontology
must be generated from each source database schema (e.g., relational, XML, or RDF). A local
ontology is a conceptualization of the elements and relationships between elements in each
source schema. To facilitate interoperation, those ontologies should be expressed using the
same model. Furthermore, for the sake of correct query processing, the structure of source
schemas and the integrity constraints (e.g., relational foreign keys) expressed on the schemas
should be preserved in the local ontology. We choose RDFS to represent each local ontology.
In our approach, ontology generation from source schemas is accomplished by model-based
schema transformation (38). In particular, the following approaches are taken for the relational
and XML schema transformation:
Relational Schema. Relations are converted into RDF classes and attributes into RDF prop-
erties, which are attached to the class corresponding to the relation to which the attributes
belong. Foreign key dependencies between two relations are represented by two properties
(corresponding to the two relations) sharing the same value in the target local ontology.
XML Schema. Complex-type elements are converted into RDF classes and simple-type el-
ements and attributes are converted into RDF properties. This transformation process
11
encodes the mapping information between each concept in the local RDF ontology and
the path to the corresponding element in the XML source. Nesting relationships between
XML elements are represented using a meta-property rdfx:contained; rdfx stands for the
namespace where contained is defined. This meta-property enables the RDF representa-
tion of the XML nesting structure, by connecting two RDF classes representing the two
nesting XML elements.
Example 1.2 Following Example 1.1, Figure 3 shows the local RDF ontologies S′1 and S′2,
which are generated respectively from the XML source schemas S1 and S2.
Books
name booktitle
rdfx:contained
Local RDFS ontology S 1 '
Author rdfx:contained
Book Article
fullname title
rdfx:contained
Writers rdfx:contained
Writer
Local RDFS ontology S 2 '
rdfs:domain rdfs:domain rdfs:domain rdfs:domain
Figure 3. Local ontologies generated from XML source schemas.
Case Study 2 - Global Conceptualization
To make the integration system accessible through the uniform interface of the global on-
tology, semantic mappings are established between the global ontology and the local ontologies.
12
In our approach, this mapping process is accomplished during the construction of the global on-
tology, which is generated by merging the local ontologies, for example, using a GaV approach.
We consider that each local ontology is merged into the global ontology, the target ontology.
The process of ontology merging consists of several operations:
• Copying a class and/or its properties: classes and properties that do not exist in the
target ontology are copied into it.
• Class Merging: conceptually equivalent classes in the local and target ontologies are
combined into one class in the target ontology.
• Property Merging: conceptually equivalent properties of a class in the local and target
ontologies are combined into one property in the target ontology.
• Relationship Merging: conceptually equivalent relationships from one class c1 to another
class c2 in the local and target ontologies are combined into a single relationship in the
target ontology (i.e., an RDF property having c1 as its domain and c2 as its range).
• Class Generalization: related classes in the local and target ontologies can be generalized
into a a superclass. The superclass can be obtained by searching an existing knowledge
domain (e.g., the DAML Ontology Library 1) or reasoning over a thesaurus.
We note that along with the above operations, semantic correspondences are established.
For example, for each element pL in a local ontology, if there exists a semantically equivalent
1http://www.daml.org/ontologies/
13
element pG in the global ontology, the two elements will be merged and a correspondence
between pL and pG will be generated.
Book Author
title
Books
Publications Person
Authors
name
rdfx:contained
rdfx:contained
rdfx:contained rdfx:contained
Book Author
booktitle
Books
name
rdfx:contained rdfx:contained
Article Writer
title
Writers
fullname
rdfx:contained rdfx:contained
Local RDF ontology S 1 ' Local RDF ontology S 2 '
rdfs:domain rdfs:domain rdfs:domain rdfs:domain
rdfs:subClassof
rdfs:domain rdfs:domain
correspondence The global RDF ontology
Figure 4. A conceptual view on local sources.
Example 1.3 Figure 4 shows the global RDF ontologies generated by merging the local ontolo-
gies S′1 and S′2 of Example 1.2. Note that the classes (properties) represented in grey are merged
classes (properties), and the classes Book and Author are also extended, with Publication and
Person being their superclasses, respectively.
Case Study 3 - Support for High-level Queries
14
Given a conceptual view of available information sources, the user may pose a query in
terms of the global ontology. We say the query is a high-level query if its formulation does not
require awareness of particular source schemas. The query is then reformulated by a rewriting
algorithm into a subquery for each source. The subqueries over sources are subject to the
structure of source schemas, and may be expressed in a different language from that of the
high-level query. An inference mechanism may be needed in the query rewriting, for example,
when a concept involved in the query has super-concepts or sub-concepts.
In addition to handling high-level queries on the global ontology, a bidirectional query
translation algorithm is also supported (39) (see Figure 2). In this case, we can translate a
query posed against an XML source to an equivalent query against any other XML source.
Example 1.4 Suppose the user asks the query “Find the persons who have written publication
b2.” This query will be expressed in a RDF query language such as RDQL. 1 First, Person
has sub-concept Author, which corresponds to two different concepts (Author and Writer) in two
different RDF local databases. Therefore the initial query will be rewritten as two sub-queries to
those databases. In turn, those queries may be further rewritten using a XML query language
incorporating the path expressions of Example 1.1 (unless the data was materialized under the
RDF local ontologies). Using the bidirectional query translation mechanism, a query involving
the concepts Book and Author in one source will be translated into a query involving Article and
Writer in another data source, by using the correspondences established by the global ontology.
1http://www.hpl.hp.com/semweb/rdql.htm
15
mapping table
local XML
schema
Global RDF
ontology
peer 1 super peer
mapping
table
local RDF
schema
mapping table
peer n
XML to
RDF wrapper
local
XML
schema
peer i
mapping
table
XML to
RDF
wrapper
Query processing in
data-integration fashion
Query processing in
hybrid P2P fashion
Mapping process
Q 1
Q 2n '
Q 2i '
Q 2
Q 11 '
Q 1i '
Q 1n '
Figure 5. The hybrid peer-to-peer architecture of PEPSINT.
1.3.2 Peer-to-Peer Data Integration
We consider again the two XML sources of Figure 1. However, this time they are connected
in a peer-to-peer architecture. We consider a hybrid peer-to-peer architecture with two types of
peers: super-peers containing the global RDF ontology, and peers each containing a data source
and an ontology. Each peer represents an autonomous information system and connects to a
super-peer via semantic mappings. Peer-to-peer data integration systems or frameworks include
LRM (Local Relational Model) (15), Hyperion (6), Piazza (59), PeerDB (86), SEWASIE (11),
and PEPSINT (40).
Case Study 4 - Declarative Mediation
The PEPSINT system is a hybrid peer-to-peer system whose architecture is shown in Fig-
ure 5. PEPSINT uses a GaV approach. The global ontology in a super-peer serves two functions:
16
(1) It provides the user with a uniform high-level view of the data sources in the distributed
peers, and (2) it serves as a a mediator for query translation from one peer to another. The
former function is similar to the one described in Case Study 3. The latter function is discussed
in detail here.
The user can pose a query against the local XML or RDF data source in any peer. Locally,
the query will be executed on the local source to get a local answer. Meanwhile, the source
query is rewritten into a target query over every connected peer. The query rewriting utilizes
the global ontology, and the composition of mappings from the original peer to the super peer
with mappings from the super-peer to the target peers. By executing the target query, each
peer returns an answer to the original peer, called the remote answer. The local and remote
answers are integrated and returned to the user at the site of the originating peer.
Example 1.5 Consider two XML sources, one in peer p1 and the other in peer p2, and a global
ontology expressed in RDF in a super-peer. As shown in Figure 6, the global ontology consists
of a class Publication and two sub-classes Paper and Book. The Publication class is mapped to
the publication element of the XML source in p1, while the class Book corresponds to book of
the XML source in p2. An XML query Q1 on p1 involving publication will be rewritten to a
target query Q2 on p2 involving include book. The XML fragments inside the dashed-line boxes
are integrated and returned as answers.
Case Study 5 - Mapping Support
17
<publications>
< publication title="b1">
<author> a1 </author>
<ISBN> 1234567890 </ISBN>
</ publication >
</publications>
<books>
< book booktitle= 2?
<author> a2 </author>
<price> $23.00 </price>
</ book >
</books>
Publication
Paper Book
rdfs:subClassof rdfs:subClassof
The global RDF ontology in
the super-peer XML source in Peer p1
Q1: List all publications
Q2 XML source in Peer p2
Figure 6. Mediation for peer-to-peer query rewriting.
A thesaurus can be used for data integration to facilitate the automation of the schema mapping
process (99; 38). In particular, it can help discovering the semantic relationships between
concepts in different schemas or ontologies. WordNet is an example of such a thesaurus. It
consists of a network of terms and their semantic relations (e.g., synonym, hypernym, and
hyponym). A term may have multiple senses, each being a synset.
A thesaurus-based schema matching approach has been devised for peer-to-peer data inte-
gration (113); this approach consists of the following three steps (as illustrated in Figure 7):
1. Path Exploration. Among the semantic relations between synsets in WordNet, we
choose those of synonymy, hyponymy/hypernymy (i.e., more specific/more general), and related-
to, when enumerating the paths between two arbitrary concepts from different local ontologies
in peers. As shown in Figure 7, six paths are found from Quantity to Number.
18
2. Path Selection. When multiple paths are found between two concepts, we choose the
optimal path, which corresponds to the most likely semantic relation between the two concepts.
For this purpose, semantic similarities (i.e., the number above each path in the figure) are
calculated for all the paths. The calculation is implemented by assigning different semantic
relations with different weights (e.g., 1 for synonymy and 0.8 for hypernymy) and then taking
the average of all the weights. The path with highest similarity is then chosen as the optimal
path. If there is more than one such path, then the user’s intervention is needed.
3. Semantic Derivation. The last step is to derive the (direct) semantic relationship,
Sem, between the two concepts by reasoning on the semantic relations along the optimal path
p between them. More specifically, Sem(p) = Sem(pn) is computed based on the following
SYN (Synonym): 1
HYPER (Hypernym): 0.8
HYPO (Hyponym): 0.8
REL (Related-to): 0.5
Amount
Total
Definite Quantity
Product
Constant
Sum
Quantity Number
S Y N
S Y N
H Y P O
H Y P O H Y P E R
H Y P E R
S Y N H Y P O H Y P O
H Y P O
H Y P O
H Y P O
1
0.9
0.8
0.8
0.8
0.8
Quantity Number Amount SYN SYN
Quantity Number SYN
2. Path
Selection
3. Semantic
Derivation
WordNet 1. Path
Exploration
Figure 7. Thesaurus-based schema mapping process.
19
recursive algorithm, where pn = (r1, r2, ..., rn), and ri(1≤i≤n) are the edges (semantic relations)
along p.
Sem(pn) = Sem(pn−1) ∧ Sem(rn), if n > 1; (1.1)
Sem(pn) = ≈, ⊇, ⊆, or ∼, if n = 1. (1.2)
In the above formulas, the symbols ≈, ⊇, ⊆, and ∼, respectively stand for the semantic
relation of synonymy, hypernymy, hyponymy, and related-to. The operation ∧ obeys the rules
that are shown in Table I. Specifically, the first row and the first column are the operands, and
the cells at the intersection of each pair of operands contain the results of the operation ∧ on
both operands. A question mark indicates that human intervention is needed.
TABLE I
INFERENCE RULES FOR SEMANTIC RELATIONS.∧ ≈ ⊇ ⊆ ∼≈ ≈ ⊇ ⊆ ∼⊇ ⊇ ⊇ ? ∼⊆ ⊆ ? ⊆ ∼∼ ∼ ∼ ∼ ∼
20
1.4 Contributions
This thesis is focused on the reconciliation of heterogeneities among distributed data sources
to achieve data integration and interoperability. Semantic Web technologies, centered on the use
of ontologies, are extensively used in our approaches. We discuss the fundamental issues, includ-
ing metadata representation, mapping process, and query processing, in all our approaches to
different situations of the data integration, as described in the above case studies. In particular,
we make the following contributions in this thesis:
• In Chapter 2, we propose an ontology-based approach to the integration of heterogeneous
XML sources. The global ontology takes into account both the XML nesting structure
and the domain structure, which are expressed in RDFS, so as to enable semantic inter-
operation between the XML sources. This integration process is lossless with respect to
the nesting structure of the XML sources, so that XML structural queries can be cor-
rectly rewritten. We refine the concepts of certain answers and of query containment,
in two cases of query processing: global-to-local query rewriting and local-to-local query
rewriting. A query rewriting algorithm that guarantees equivalence is provided for each
case of query rewriting.
• To achieve interoperability across heterogeneous data sources with schemas, we propose
an ontology-based framework, PEPSINT, built on a hybrid P2P architecture, as presented
in Chapter 3. The global RDF ontology is constructed using the GaV approach in the
super peer. It behaves not only as a central control point over the peers but also as a
mediator for query translation from peer to peer. We provide a set of query rewriting
21
algorithms that can be used to propagate a user’s query across the heterogeneous XML
or RDF data sources in PEPSINT. The integration of the answer structures is considered
in query processing.
• In our framework for pure P2P data integration, we propose a mapping language, namely
the P2P Mapping Language (PML), to express the semantic mappings among local on-
tologies that represent the local schemas. A meta-ontology called RDF Mapping Schema
(RDFMS) is used as a physical representation of the mappings. We also discuss the
process of P2P query answering across the individual peers, by considering the individual
architecture of each peer. We also define the semantics of PML based on first-order logic
(FOL), which enables the use of the mappings for query processing in the system. We
propose a P2P query rewriting algorithm to process conjunctive RQL (c-RQL) queries
across the P2P network. We discuss these issues in detail in Chapter 4.
• Within the Semantic Desktop vision, we propose a layered framework for personal infor-
mation management (PIM) in desktops, in which multiple ontologies playing a variety
of roles are employed. As elaborated in Chapter 5, this layered architecture enables the
organization, navigation, and manipulation of desktop data in a semantically rich way,
and provides certain advantages (e.g., flexibility and extensibility) over the use of a single
domain model. We particularly present the idea of 3D navigation, which is a combination
of the vertical, horizontal and temporal navigation in the personal information space. We
introduce the idea of personal information application (PIA). We also discuss the devel-
opment of PIAs in a MVC-based designer, namely the PIA designer. Two different ways
22
of inter-desktop information sharing and data integration are presented, i.e., by means of
PIA-based desktop services and by means of PIA-to-PIA (A2A) query processing.
• In the scenario of heterogeneous geospatial data integration, we propose an ontology align-
ment algorithm based on a set of deduction rules, which can be performed automatically
when certain pre-conditions are established. We also propose a sound query rewriting
algorithm based on the bidirectionality and composition of the mappings. The algorithm
can compute a contained rewriting of a query in both cases. Query containment ensures
that all the answers retrieved by executing the rewriting are a subset of the answer to the
original query, thus guaranteeing precise query answering across distributed data sources.
We present our work on geospatial data integration in Chapter 6.
In the following chapters, we describe each of the above mentioned five approaches in the
following structure. Each chapter starts with an introduction of the problem to be addressed.
After reviewing the previous work related to that problem, we present the architecture of our
approach. Then, we focus on the discussion on metadata representation, ontology mapping, and
ontology-based query processing. Finally, each chapter ends with a summary of our approach
and of future work that addresses open research challenges. We conclude the entire thesis in
Chapter 7.
CHAPTER 2
CENTRAL DATA INTEGRATION
2.1 Introduction
2.1.1 Problem Description
Data integration is the problem of combining data residing at different sources, and provid-
ing the user with a unified view of these data (70). It is relevant to a number of applications
including data warehousing, enterprise information integration, geographic information sys-
tems, and e-commerce applications. Data integration systems are usually characterized by
an architecture based on a global schema, which provides a reconciled and integrated view of
the underlying sources. These systems are called central data integration systems, and a large
number of such systems have been proposed (3; 7; 25; 29; 69; 81; 93; 105; 109).
There are two key issues in central data integration, namely system modeling and query
processing. For modeling the relation between the sources and the global schema, two basic
approaches have been proposed (24; 70; 109). The first approach, called Global-as-View (GaV),
expresses the global schema in terms of the data sources. The second approach, called Local-
as-View (LaV), requires the global schema to be specified independently from the sources, and
the relationships between the global schema and the sources are established by defining every
source as a view over the global schema.
23
24
Query processing in central data integration may require a query reformulation step: the
query over the global schema has to be reformulated in terms of a set of queries over the
sources. In the GaV approach, every entity in the global schema is associated with a view
over the source local schema, therefore query processing in this case uses a simple “unfolding”
strategy (70). In contrast, query processing in LaV can be complex, since the local sources may
contain incomplete information. In this sense, query processing in LaV, called view-based query
processing (1; 27; 58), is similar to query answering with incomplete information (110). It can
also be the case that two data sources communicate in a peer-to-peer (P2P) way either through
the global schema or directly. Data exchange or query processing may occur in this case,
which requires data translation or query rewriting when heterogeneities are present between
the communicating sources (38; 68; 81; 93; 97).
The heterogeneities between distributed data sources can be classified as syntactic, schematic,
and semantic heterogeneities (16). Syntactic heterogeneity is caused by the use of different mod-
els or languages (e.g., relational and XML). Schematic heterogeneity results from the different
data organizations (e.g., aggregation or generalization hierarchies). Semantic heterogeneity is
caused by different meanings or interpretations of data. All these heterogeneities have to be
resolved, to achieve the goal of integration or interoperation. In this thesis, we consider the
semantic integration of XML data and data exchange between heterogeneous XML sources,
using ontologies.
XML documents that represent data with similar semantics may conform to different schemas.
Therefore, a user must construct queries in accordance to the different XML document’s struc-
25
tures even if to retrieve fragments of information that have the same meaning. This fact makes
the formulation of queries over heterogeneous XML sources a nontrivial burden to the user.
Furthermore, this shortcoming of XML impedes the interoperation between XML sources since
the reformulation of XML queries from one source to another has to eliminate the structural
differences of the queries while presenting the same semantics. Let us illustrate this problem
using a running example.
Example 2.1 Figure 8 shows two XML schemas (S1 and S2) with their instances (i.e., XML
documents D1 and D2), which are represented as trees. It is obvious that S1 and S2 both
represent a many-to-many relationship between two concepts: book and author (equivalently
denoted article and writer in S2). However, structurally speaking, they are different: S1,
which is a book-centric schema, has the author element nested under the book element, whereas
S2, which is an author-centric schema, has the article element nested under the writer
element. Suppose our query target is “Find all the authors of the publication b2.” The XML
path expressions that are used to define the search patterns in the two schema trees can be
respectively written as /books/book[booktitle.text()="b2"]/author/name and /writers
/writer[article/title.text()="b2"]/fullname, where the contents in the square brackets
specify the constraints for the search patterns. We notice that although the above two search
patterns refer to semantically equivalent concepts, they follow two distinct XML paths.
2.1.2 Semantic XML Data Integration
The structural diversity of conceptually equivalent XML schemas leads to the fact that XML
queries over different schemas may represent the same semantics even though they are formu-
26
books
book
booktitle author
name
books
book
book "b1"
booktitle
"b2"
name
"a2"
author "a1"
booktitle
author
author "a3"
name
name
writers
writer
fullname article
title
writ
ers
writer "w1"
fullname
title article "t1"
writer "w2"
fullname
title article "t2"
writer "w3"
fullname
title article "t2"
XML schema S 1 XML document D 1
"books.xml" XML schema S 2
XML document D 2
"writers.xml"
Figure 8. Two XML sources with structural heterogeneities.
lated using two different alphabets and structures. In comparison, the schema languages used
for conceptual modeling are structurally flat so that the user can formulate a determined con-
ceptual query without worrying about the structure of the source. RDF Schema (RDFS) (77),
DAML+OIL, and OWL are examples of languages used to create ontologies, which represent a
shared, formal conceptualization of the domain of knowledge (55). There are currently many
attempts to use conceptual schemas (or ontologies) (3; 5; 38) or conceptual queries (29; 31) to
overcome the problem of structural heterogeneities among XML sources.
In this chapter, we propose an ontology-based approach for the integration of XML sources.
We use the GaV approach to model the mappings between the source schemas and the global
ontology, which is, therefore, an integrated view of the source schemas. The global ontology
is expressed in terms of RDFS, which is at the core of several ontology languages (e.g., OWL
and DAML+OIL). In order to facilitate the mappings between the XML source schemas and
the global RDFS ontology, their syntactic disparity needs to be reconciled. To this end, we
27
first transform the heterogeneous XML sources into local RDFS ontologies (defined using the
RDFS space (22)), which are then merged into the global ontology. This transformation process
encodes the mapping information between each concept in the local ontology and the corre-
sponding element in the XML source. The ontology merging process can be semi-automatically
performed (e.g., by using the PROMPT algorithm (88)). In addition to the global ontology,
the merging process also produces a mapping table, which contains the mapping information
between concepts in the global ontology and concepts in the local ontologies. In our approach,
we can translate a query posed against the global ontology into subqueries over the sources.
We can also translate a query posed against an XML source to an equivalent query against any
other XML source. We call the query rewriting in the first case global-to-local query rewriting
and that in the second case local-to-local query rewriting. Given that we choose a GaV ap-
proach, the global ontology is a view over the local ontologies, therefore the process of mapping
a query over the global ontology to queries over the local ontologies is straightforward.
2.1.3 Contributions
We make the following contributions in this chapter:
• We propose an ontology-based approach to the integration of heterogeneous XML sources.
The global ontology takes into account both the XML nesting structure and the domain
structure, which are expressed in RDFS, so as to enable semantic interoperation between
the XML sources. This integration process is lossless with respect to the nesting structure
of the XML sources, so that XML structural queries can be correctly rewritten.
28
• We extend the RDFS space by defining additional metadata, which enables the encoding
of the nesting structure of the XML Schema in the RDF schema. We convert each of the
XML source schemas into a local RDFS ontology while preserving their structure, so that
they share a uniform representation with the global ontology.
• Finally, we refine the concepts of certain answers and of query containment, in two query-
ing modes: global-to-local query rewriting and local-to-local query rewriting. Further-
more, a query rewriting algorithm that guarantees equivalence is provided for each case
of query rewriting.
The rest of this chapter is organized as follows. Section 2.2 describes related work. Section
2.3 describes the framework for the integration of XML sources. Data integration and query
processing, which are the two key points in our approach, are discussed respectively in Sections
2.4 and 2.5. We draw conclusions and discuss future work in Section 2.6.
2.2 Related Work
There are a number of approaches addressing the problem of data integration or interoper-
ation among XML sources. We classify those approaches into three categories, depending on
their main focus, namely semantic integration, query languages, and query rewriting.
2.2.1 Semantic Integration
High-level Mediator Amann et al. propose an ontology-based approach to the integration of
heterogeneous XML Web resources in the C-Web project (3; 5). The proposed approach
is very similar to our approach except for the following differences. The first difference
is that they use a local-as-view (LaV) approach (24) with a hypothetical global ontology
29
that may be incomplete. The second difference is that they do not retain the XML
documents’ structures in their conceptual mediator so they cannot deal with the reverse
query translation (from the XML sources to the mediator). Our previous work involved
a layered approach for the interoperation of heterogeneous web sources, but the nesting
structure associated with XML was lost in the mapping from XML data to RDF data
(38).
Direct Translation Klein proposes a procedure to transform XML data directly into RDF
data by annotating the XML documents via external RDFS specifications (66). The
procedure makes the data in XML documents available for the Semantic Web. However,
since the proposed approach does not consider the document structure of XML sources,
it can not propagate queries from one XML source to another XML source.
Semantics Encoding The Yin/Yang Web approach proposed by Patel-Schneider and Simeon
address the problem of incorporating the XML and RDF paradigms (94). They develop
an integrated model for XML and RDF by integrating the semantics and inferencing rules
of RDF into XML, so that XML querying can benefit from their RDF reasoner. But the
Yin/Yang Web approach does not solve the problem of query answering across heteroge-
neous sources, that is, sources with different syntax or data models. It also cannot process
higher-level queries such as RDQL. Lakshmanan and Sadri also propose an infrastructure
for interoperating over XML data sources by semantically marking up the information
contents of data sources using application-specific common vocabularies (68). However,
the proposed approach relies on the availability of an application-specific standard ontol-
30
ogy that serves as the global schema. This global schema contains information necessary
for interoperation, such as key and cardinality information for predicates. This approach
has the same problem as the Yin/Yang Web approach, that is, higher-level queries can
not be processed downward to XML queries.
2.2.2 Query Languages
CXQuery is a new XML query language proposed by Chen and Revesz, which borrows
features from both SQL and other XML query languages (31). It overcomes the limitations of
the XQuery language by allowing the user to define views, explicitly specify the schema of the
query answers, and query through multiple XML documents. However, CXQuery does not solve
the issue of structural heterogeneities among XML sources. The user has to be familiar with
the document structure of each XML source to formulate queries. Heuser et al. also present a
new language (CXPath) based on XPath for querying XML sources at the conceptual level (29).
CXPath is used to write queries over a conceptual schema that abstracts the semantic content
of several XML sources. However, they do not consider the situation of query translation from
the XML sources to the global conceptual schema.
2.2.3 Query Rewriting
Query rewriting is often a key issue for both mediator-based integration systems and peer-
to-peer systems. The Clio approach, which provides an example for the former case, mainly
addresses schema mapping and data transformation between nested schemas and/or relational
databases (97). It focuses on how to take advantage of schema semantics to generate the
consistent translations from source to target by considering the constraints and structure of the
31
Global RDFS
Ontology G
Mapping Table M
Local RDFS
Ontology R 1
Local RDFS
Ontology R 2
Local RDFS
Ontology R n
. . .
Local XML
Source S 1
Local XML
Source S 2
Local XML
Source S n
. . .
Ontology Integration
Global-to-local Query Rewriting
Local-to-local Query Rewriting
Figure 9. The ontology-based framework for the integration of XML sources.
target schema. It uses queries to express the mappings from the data to the target schema. The
Piazza system is a peer-to-peer system that aims to solve the problem of data interoperation
between XML and RDF (59). The system achieves its interoperation in a low-level (syntactic)
way, i.e., through the interoperation of XML and the XML serialization of RDF, whereas we
aim to achieve the same objective at the semantic level. For example, our approach supports
a conceptual view of XML sources (to facilitate the formulation of queries) and allows for
conceptual queries (e.g., RDF queries).
32
2.3 Framework
In this section, we present the framework for the integration of XML data sources and
in particular we describe the integration of XML source schemas and query processing in the
integrated system.
As shown in Figure 9, we generate for each local XML source a local RDFS ontology, which
represents the source schema. These local RDFS ontologies are then merged into the global
RDFS ontology, which provides an overview of all the local ontologies and a mediation between
each pair of XML sources. In this merging process, a mapping table is also produced to contain
all the mappings, which are correspondences between the global ontology and local ontologies.
The ontology-based XML data integration framework I can be formalized as a quadruple
〈G,S, µ,M〉, where
• G is the global ontology expressed in RDFS over the alphabetAG . The alphabet comprises
the name of the classes and properties of G.
• S is the XML source schema expressed in a language LS over the alphabet AS , which
comprises the XML element names in S.
• µ is a schema transformation function, which generates a local RDFS ontology R for S,
such that R encodes the nesting structure specified by S.
• M is the mapping table consisting of a set of mappings between the global ontology G and
a set of n XML sources Si, where i ∈ [1..n]. Each entry in M is of the form (g, s1, ..., sn),
33
where g ∈ AG and si ∈ ASi ∪ {ε} for i ∈ [1..n]. Note that ε is used when a source schema
has no corresponding elements to an element of G.
The first task of the framework is the integration of the distributed and heterogeneous XML
sources. Here, we are mainly concerned with the issue of schematic heterogeneity, that is, with
the different schema structures among the sources. The process of data integration contains
two steps: schema transformation and ontology merging.
In the first step, we use a local RDFS ontology to represent each XML source schema so as to
achieve a uniform representation for the next step. In other words, the schema transformation
function µ takes as input the source schema S, and the output is the local ontology R. The key
operation in this schema transformation is the preservation of the nesting structure of S. To this
end, we have to extend the RDFS space since it does not have a property to encode the nesting
structure between elements. In particular, we add a new RDF property, contained, in the
namespace of “http://www.example.org/rdf-extension” (abbreviated as rdfx), The RDF/XML
syntax for this property is described below.
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:rdfx="http://www.example.org/rdf-extension#">
<rdf:Property rdf:about= "http://www.example.org/rdf-extension#contained">
<rdfs:isDefinedBy rdf:resource= "http://www.example.org/rdf-extension#"/>
<rdfs:label>contained</rdfs:label>
<rdfs:comment> The containment between two classes. </rdfs:comment>
34
<rdfs:range rdf:resource= "http://www.w3.org/2000/01/rdf-schema#Class"/>
<rdfs:domain rdf:resource= "http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdf:Property>
The second step is the merging (or integration) of all local ontologies, which generates the
global ontology as well as the mapping table. The merging is performed based on the semantics
of classes and properties from each of the local ontologies. In particular, the classes or properties
that have similar or same (equivalent) semantics are merged into a class or a property of the
global ontology. Then, each of these correspondences are recorded as an entry in the mapping
table. Different kinds of mappings can be established between two schemas or ontologies (116).
For this chapter, however, we consider only the equivalence type of mapping. We also do
not consider the different degrees to which two concepts may be equivalent. For instance, we
simply take book and article as equivalent concepts, although we could further refine such
equivalence. Additional domain-related knowledge (e.g., inheritance) may be considered. We
discuss these issues in more detail in Section 2.4.
It is worth mentioning that the global ontology in our system has two roles: (1) It provides
the user with access to the data with a uniform query interface to facilitate the formulation of
a query on all the XML sources; (2) It serves as the mediation mechanism for accessing the
distributed data through any of the XML sources.
Our framework handles user queries using a query rewriting strategy. More specifically,
query processing in our framework may occur in the following two directions, as shown in
Figure 9:
35
Global-to-local query rewriting. When the user poses a query q on the global ontology,
the system rewrites q into the union q′ of subqueries, one for each XML source. The
subqueries are then executed over the XML sources to get the answers, which are then
integrated (by using union) to produce the answer to q.
Local-to-local query rewriting. Given a query q posed on a local source, its answers then
include not only those retrieved from the local source, but also those from all the other
sources in the system. For the purpose of getting answers from the other sources, it
requires that q be rewritten (through the global ontology) into a union q′ of queries, one
on each of the other sources. Query rewriting in this direction is performed similarly to
that in peer-to-peer systems (101).
Query rewriting in both directions is based on the mapping information contained in the
mapping table. Each entry contains a element (RDF class or property) of the global ontology
and its corresponding elements in the local source schemas. Given that query rewriting is from a
query over one alphabet to that over another alphabet, the mapping table provides a convenient
way to finding the mapping between alphabets, in both rewriting directions. In addition, the
query languages used to formulate the queries have to be taken into consideration, since they
may have different expressiveness. We consider a subset of XQuery (19), called conjunctive
XQuery (c-XQuery), for queries over the XML sources and a subset of RDQL (62), namely
conjunctive RDQL (c-RDQL), for queries over the global RDFS ontology. We discuss in detail
query processing and related issues in Section 2.5.
36
2.4 Integrating Structure and Semantics
2.4.1 Local XML Schemas and Local RDFS Ontologies
To integrate heterogeneous XML data sources, we first transform the local XML schema into
a local RDFS ontology while preserving the XML document structure. By document structure,
we mean the structural relationship of objects specified in data-centric documents (21) by a
schema language (such as DTD, XML Schema, or RelaxNG1). In this chapter, we only focus on
the nesting structure (i.e., hierarchy). Other structural properties include order. A consequence
of not including order in our framework is that we cannot consider a query that involves the
order of the subelements of an element. However, this kind of query is of little interest in a
framework where we are mostly concerned with the semantics of the data.
Elements and attributes are the two basic building blocks of XML documents. Elements
can be defined as simple types, which cannot have element content and cannot carry attributes,
or complex types, which allow elements in their content and/or contain attributes. On the other
hand, all attribute declarations must reference simple types since attributes cannot contain other
elements or other attributes. From the perspective of XML Schema, these nesting relationships
are defined in terms of datatypes (simple or complex). An XML schema can be formalized as
an edge-labeled tree, namely an XML schema tree, as depicted in Figure 8. We overlook the
distinction between XML elements and attributes by considering both of them as vertices in
the XML schema tree.
1http://relaxng.sourceforge.net
37
Definition 2.1 An XML schema S over alphabet AS is an edge-labeled tree S = (V, E, λ),
where V is a set of vertices, E = {(vi, vj)|vi, vj ∈ V } is a set of edges, and λ is a labeling
function λ : E 7→ AS .
Before we discuss schema transformation, let us look at the formalization of ontologies.
Both the global ontology and local ontologies are actually RDF schemas defined in the RDFS
space, which is extended with the RDF property “rdfx:contained”. An RDF schema can be
formalized as a labeled graph, called RDF schema graph, as defined in Definition 2.2. We do
not elaborate on the data types of RDF properties and assume that they are all of type literal.
Also, we do not take into account the notion of namespace in the definition of both XML and
RDF schemas.
Definition 2.2 An RDF schema graph R over alphabet AR is a directed labeled graph R =
(V,E, λ), where V is a set of labeled vertices consisting of classes C, properties P , and data types
L, E = {(vi, vj)|vi, vj ∈ V } is a set of labeled edges, and λ is a labeling function λ : V ∪E 7→ AR,
such that
• ∀ v ∈ P , we have domain(v) ∈ C, range(v) ∈ C∪L, and λ((v, domain(v))) = “rdfs:domain”
and λ((v, range(v)))=“rdfs:range”;
• ∀ e = (vi, vj) ∈ E, we have λ(e)=“rdfs:subClassOf” (or “rdfx:contained”) if vi and vj ∈ C,
or λ(e) = “rdfs:subPropertyOf” if vi and vj ∈ P .
Now we are able to define the schema transformation function µ. Formally speaking, the
schema transformation function µ is a function µ : S 7→ R, where S = (VS , ES , λS), R =
38
(VR, ER, λR), and VR = C∪P , such that ∀ eij = (vi, vj) ∈ ES , we have µ(vj) ∈ VR, λR(µ(vj)) =
λS(eij), and furthermore:
(1) if ∃(vj , vk) ∈ ES , then µ(vj) ∈ C, (µ(vj), µ(vi)) ∈ ER, and λR(µ(vj), µ(vi)) = “rdfx:contained”;
(2) if @(vj , vk) ∈ ES , then µ(vj) ∈ P , (µ(vj), µ(vi)) ∈ ER, and λR(µ(vj), µ(vi)) = “rdfs:domain”.
The transformations thus defined fall into two categories:
Element-level transformation The element-level transformation converts from XML complex-
type elements to RDF classes and from XML simple-type elements to properties. For
example, for S1 in Example 2.1, we define the RDF classes Books, Book, and Author,
while taking booktitle and name as RDF properties of Book and Author, respectively,
as depicted in the resulting local RDFS ontology of Figure 10.
Structure-level transformation The structure-level transformation encodes the nesting struc-
ture of the XML schema into the local RDFS ontology. In particular, the nesting may
occur between two complex-type elements or between a complex-type element and its
child (simple) element. Following the element-level transformation, the nesting struc-
ture in the former case corresponds to a class-to-class relationship between two RDFS
classes, which are connected by the property rdfx:contained, The first item that defines
µ formalizes this case. In the latter case, the XML nesting structure corresponds to the
class-to-literal relationship in the local ontology, with the class and the literal connected
by the corresponding property. The second item that defines µ formalizes this case.
39
By applying the schema transformation function to the two XML schemas in Figure 8, we
can get the resulting local ontologies as shown in Figure 10. We see that rdfx:contained
enables the representation of the nesting relationship. Specifically, by following the edges
of rdfx:contained from Books to Author in R1, we actually get the corresponding path
/books/book/author in S1. In terms of the alphabets, the schema transformation function
specifies a mapping between the alphabet of the source schema and that of the local ontology.
Table II lists the mapping between the XML schema S1 and the local RDFS ontology R1. For
simplicity, we use XPath to specify the XML elements. Also, the properties in the mapping
table are in the form of an RDF expression c.p, where c is the class associated with p.
TABLE II
MAPPINGS BETWEEN XML SOURCE SCHEMA S1 AND THE LOCAL ONTOLOGY R1
XPath expressions in S1 RDF expressions in R1
/books Books/books/book Book/books/book/booktitle Book.booktitle/books/book/author Author/books/book/author/name Author.name
40
Books
name booktitle
rdfx:contained
property Class Legend
Local RDFS ontology
rdfs:domain
Author rdfx:contained
Book Article
fullname title
rdfx:contained
Writers rdfx:contained
Writer
Local RDFS ontology R 1 R 2
Figure 10. Local ontologies R1 and R2 transformed from XML source schemas S1 and S2.
2.4.2 The Global RDFS Ontology
Now that the source schemas are represented by local RDFS ontologies, we are able to merge
them to construct the global RDFS ontology. In other words, the process of ontology merging
takes as input the multiple local ontologies and returns a merged ontology as the output (108).
Ontology merging and ontology alignment, which require the mapping of ontologies, are
widely pursued research topics. Readers can be referred to a thorough survey of the state-of-the-
art of ontology mapping (64). In this chapter we do not intend to introduce a new technique for
ontology merging. Instead, we utilize existing techniques to generate the integrated ontology
from the local ontologies. In particular, we use an approach (such as PROMPT (88)) that
provides the following functionalities:
• Merging of classes: Multiple conceptually equivalent classes of the local ontologies are
combined into one class in the global ontology.
41
• Merging of properties: Multiple conceptually equivalent properties of the equivalent classes
in the local ontologies are combined as one property of the combined class in the global
ontology.
• Merging relationships between classes: Given two conceptually equivalent relationships,
e.g., p1 from a class c1 to another class c′1 and p2 from c2 to c′2, we combine p1 and p2
into one relationship p between the combined class c (of c1 and c2) and c′ (of c′1 and c′2).
• Copying a class or a property: If there does not exist a conceptually equivalent class or
property for a class c (or a property p of c), we simply copy c (or p, as a property of the
target class of c) into the global ontology.
• Generalizing semantically related classes into a superclass: The superclass can be obtained
by searching an existing knowledge domain (e.g., the DAML Ontology Library) or reason-
ing over a thesaurus such as WordNet.1 For example, we can find in the semantic network
of terms (consisting of terms and their semantic relations) that two classes (Author and
Writer) have the same hypernym (Person), which is then taken as a superclass of both
classes.
Figure 11 shows the global ontology that results from merging the two local RDF ontologies
of Figure 10. The greyed classes and properties are merged classes and properties from the
original ontologies. For instance, Book in R1 and Article in R2 are merged into Book, whereas
1http://wordnet.princeton.edu
42
booktitle in R1 and title in R2 are merged into title. The classes Book and Author are
also respectively extended with the superclasses Publication and Person.
Besides the global ontology, the process of ontology merging also yields as an output the
mapping table that contains the mappings between the local RDFS ontologies and the global
RDFS ontology. In general, if a class, property, or relationship between classes p in the global
ontology is the result of merging pi and pj from different local ontologies, then a tuple of the
form (p, pi, pj) is generated. If a class or property p in the global ontology is only copied from pi
in a local ontology, then a tuple (p, pi) is produced. For instance, for the class Book.title (in
the global ontology), which is merged from Book.booktitle in R1 and Article.title in R2,
we generate a tuple in the mapping table: (Book.title, Book.booktitle, Article.title).
Table III lists all the mappings in our example.
Now that we have the one-to-one mappings M1 between the XML source schemas and their
local ontologies and the one-to-one mappings M2 between the local ontologies and the global
Books
name title
rdfx:contained
Author
rdfx:contained
Book Authors rdfx:contained
Publication
property Class rdfs:domain rdfs:subClassOf
Person
rdfx:contained
Legend
Figure 11. The global ontology G that results from merging R1 and R2.
43
TABLE III
MAPPING TABLE BETWEEN THE GLOBAL ONTOLOGY AND LOCAL ONTOLOGIESRDF expressions in the RDF expressions in R1 RDF expressions in R2
global ontologyBooks Books -Book Book ArticleBook.title Book.booktitle Article.titleAuthors - WritersAuthor Author WriterAuthor.name Author.name Writer.fullname
ontology, we can compose M1 and M2 to get the mappings M between the source schemas
and the global ontology. Table IV shows the results.
TABLE IV
MAPPING TABLE BETWEEN THE GLOBAL ONTOLOGY AND XML SOURCESCHEMAS
RDF expressions in the XPath expressions in S1 XPath expressions in S2
global ontologyBooks /books -Book /books/book /writers/writer/articleBook.title /books/book/booktitle /writers/writer/article/titleAuthors - /writersAuthor /books/book/author /writers/writerAuthor.name /books/book/author/name /writers/writer/article/fullname
44
2.4.3 Data Integration Semantics
In this subsection, we discuss the semantics of the data integration in our proposed frame-
work including the semantics of the XML (local) databases, the mapping table, and the RDFS
(global) database. The discussion of the syntax and semantics of queries is postponed until
Section 2.5. In what follows, we refer to a fixed, finite set Γ of constants, which is shared by all
data sources. We also refer to a finite set U of URIs.
There are two types of databases in the framework, i.e., the local XML databases and the
global RDF database. An XML database is an XML instance tree, and an RDF database is an
RDF instance graph.
Definition 2.3 (XML instance tree) Given an XML schema S = (VS , ES , λS), an instance
of S is an XML instance tree G = (VG , EG , τ, λG), where VG is a set of vertices, EG is a set of
edges, and
(1) τ is a typing function τ : VG 7→ VS , such that (a) ∀v ∈ VG, τ(v) ∈ VS , and (b) ∀(vi, vj) ∈
EG, (τ(vi), τ(vj)) ∈ ES .
(2) λG is a labeling function, such that (a) ∀v ∈ VG, λG(v) ∈ Γ ∪ {ε}, and (b) ∀(vi, vj) ∈ EG,
λG((vi, vj)) = λS((τ(vi), τ(vj))).
Definition 2.4 (RDF instance graph) Given an RDF schema S = (VS , ES , λS), where
VS = C ∪ P , an instance of S is an RDF instance graph G = (VG , EG , τ, λG), where VG is
a set of vertices, EG is a set of edges, λG is a labeling function λG : VG ∪ EG 7→ AS ∪ U ∪ Γ,
45
and τ is a typing function τ : VG ∪ EG 7→ VS ∪ {“rdf:Property”} ∪ {“rdfs:literal”}, such that
∀e = (vi, vj) ∈ EG, we have
(1) if τ(e)=“rdf:Property”, then λG(e)=“rdfx:contained” or “rdfs:subClassOf”, λG(vi) and
λG(vj) ∈ U , τ(vi) and τ(vj) ∈ C, and (τ(vi), τ(vj)) ∈ ES ;
(2) if τ(e) ∈ P , then λG(e) = λS(τ(e)), λG(vi) ∈ U , τ(vi) ∈ C, λS((τ(e), τ(vi))) = “rdfs:domain”,
λS((τ(e), τ(vj)))=“rdfs:range”, and
– λG(vj) ∈ U , when τ(vj) ∈ C;
– λG(vj) ∈ Γ, when τ(vj)=“rdfs:literal”;
The semantics of the mappings depends on the assumptions adopted. In the view-based
approach, there are three assumptions for the inter-schema mappings, namely soundness, com-
pleteness, and exactness (70). In particular, given a database D, a set of view definitions V
over D, and view extensions E of V, we say the views V are sound if VD ⊇ E , complete if
VD ⊆ E , and exact if VD = E . It is common to use the soundness assumption for view-based
data integration (70). Given that our framework adopts a GaV approach, it is natural to as-
sume an exact semantics, that is, the sources are complete with respect to the global database.
However, the definition for these assumptions differs from our framework, where mappings are
represented by element correspondences in the mapping table.
Given an entry ti = (gi, si,1, ..., si,n) in the mapping table M(G,S1, ...,Sn), where gi ∈ G
and si,j ∈ Sj (1 ≤ j ≤ n), the semantics of the mappings can be captured by the concept
of valuation. Given the global database B of G and local databases Dj of Sj (1 ≤ j ≤ n), a
46
valuation of ti is a function σ, which maps ti to a tuple (vi, vi,1, ..., vi,n), where vi ∈ B, and
vi,j ∈ Dj (1 ≤ j ≤ n), such that τB(vi) = gi and τDj (vi,j) = si,j for j ∈ [1..n]. Under the exact
assumption, the semantics of the mapping tableM = {t1, ..., tm} is captured by a conjunction of
all the equalities (between the valuation of each global element and the union of the valuations
of its mapped local elements), that is:
∧1≤i≤m[σ(gi) = σ(si,1) ∪ ... ∪ σ(si,n)], such that for 1 ≤ k, l ≤ m,
(1) (gk, gl) ∈ EG ⇔ (σ(gk), σ(gl)) ∈ EB, and
(2) (sk,j , sk,l) ∈ ESk⇔ (σ(sk,j), σ(sk,l)) ∈ EDk
, for each j ∈ [1..n].
The definition of the semantics of sound (or complete) mappings is the same as the above
definition, except for the substitution of = by ⊇ (or ⊆). For simplicity, we abbreviate the
preceding assertion to σ(G) = σ(S1) ∪ ... ∪ σ(Sn). The global database B is then any database
such that σ(G) = σ(S1) ∪ ... ∪ σ(Sn) holds for the local databases D1, ...,Dn. Figure 12 shows
the global database (instances) for the data sources of Example 2.1.
2.5 Query Processing
2.5.1 Query Languages
RDQL (RDF Data Query Language) uses an SQL-like syntax. More specifically, the Select
clause identifies the variables to be returned to the application. The From clause specifies the
RDF model using an URI. The Where clause specifies the graph pattern as a list of triple
patterns. The And clause specifies the Boolean expressions. Finally, the Using clause provides
a way to shorten the length of the URIs. By overlooking the notion of namespace (i.e., URI)
47
Book#1
"b1"
"b2"
"t1"
"t2"
Books#1
Author#1
"a1"
Authors#1
title
title
title
title
name
Book#2
Book#3
Book#4 Author#2
"a2"
name
Author#3
"a3"
name
contained
Author#4
"w1"
name
Author#5
"w2"
name
Author#6
"w3"
name
"t2"
title
Book#5
conta
ined
contained
contained
contained
contained
contained
contained
conta
ined
contained
contained
contained
contained
contained
contained
contained
contained
Figure 12. The global database of G.
and the And clause, we get a conjunctive RDQL (c-RDQL) expression, which can be expressed
in a conjunctive formula:
ans( ~X) :- p1( ~X1), ..., pn( ~Xn).
where ~Xi = (xi, x′i) and pi is an RDF property of xi having the value x′i.
XQuery is a typed functional language that has an FLWR (i.e., For, Let, Where, Return)
syntax. For simplification, we assume that the XML query posed by the user is formulated
only in the form of FLWR expressions (19). In other words, we do not consider nesting FLWR
expressions, although they are allowed in XQuery. In particular, a conjunctive XQuery (c-
XQuery) is of the form:
ans( ~X) :- p1( ~X1), ..., pn( ~Xn).
48
where ~Xi = (xi, x′i) and pi is an XPath /e1/.../en connecting xi to x′i. That is, each predicate
represents an expression xi/e1/.../en/x′i, where ei(1 ≤ i ≤ n) is an edge label along the path
from xi to x′i.
In both query definitions, ans( ~X) is the head of the query, denoted headq, and the remaining
part is the body of the query, denoted bodyq. We say that the query is safe if ~X ⊆ ~X1∪ ...∪ ~Xn.
The answer qD to a query q over a database D is the result of evaluating q over D. The
query evaluation is based on the concept of valuation and depends on the data model and the
query language used. Informally, a valuation ρ over the variables var(q) of a query q is a total
function from var(q) to constants (or URIs for RDF queries) in the domain Γ of the database,
where q is evaluated (2), as follows:
• In the XML model: given a c-XQuery q of the form ans( ~X) :- p1( ~X1), ..., pn( ~Xn) over an
XML instance graph D, we have
qD = {ρ( ~X)|ρ is a valuation over var(q) and pi = (ρ(xi), ρ(x′i)) is a fact in D, for each
~Xi = (xi, x′i), where i ∈ [1..n]}.
• In the RDF model: given a c-RQL query q of the form ans( ~X) :- p1( ~X1), ..., pn( ~Xn) over
an RDF instance graph D, we have
qD = {ρ( ~X)|ρ is a valuation over var(q) and pi is a path connecting ρ(xi) and ρ(x′i) in
D, for each ~Xi = (xi, x′i), where i ∈ [1..n]}.
49
Example 2.2 Consider two queries q1 and q2. In particular, q1 is expressed over the global
ontology G in c-RDQL, to retrieve all the (Author, Book) pairs. The c-XQuery query q2 is
issued on local XML source S1, to retrieve all (Author, Book) pairs.
q1: ans(x, y) :- name(u, x), title(v, y), contained(u, v).
q2: ans(x, y) :- /name(u, x), /booktitle(v, y), /author(v, u).
By evaluating q1 over the global database B (shown in Figure 12) and q2 over D1 (shown in
Figure 8), we obtain the following answer sets to both queries.
qB1 = {(a1, b1), (a2, b2), (a3, b2), (w1, t1), (w2, t2), (w3, t2)},
qD12 = {(a1, b1), (a2, b2), (a3, b2)}.
We finally assume that all the concepts in the local ontologies are mapped to the concepts
in the global ontology during the ontology integration process. That is, the mappings are
total, one-to-one mappings from the local RDF ontologies to the global ontology. However, it
is possible that some concept c or property p in the global ontology gets mapped to a local
ontology but not to another local ontology. This may lead to null values when a query involves
c or p. However, we do not consider this case in our discussion.
2.5.2 Certain Answers and Query Containment
The concept of certain answers has been introduced in view-based query processing to
represent the results of answering a global query (the query over the global schema) using view
extensions (1). In our framework, where the mappings are correspondences between elements
of the global ontology and elements of the source schemas, the concept of certain answers is
50
books
book "t1" booktitle
name author "w1"
book "t2" booktitle
author
book "t2" booktitle
author
name
name
"w2"
"w3"
writers
writer "a1" fullname
title article
writer "a2" fullname
article
writer "a3" fullname
article "b2"
title
title
"b2"
"b1"
Figure 13. The retrieved database on S1 w.r.t. S2 and that on S2 w.r.t. S1.
redefined. We call the query posed on the global ontology a global query, and the query posed
over a local data source a local query. As previously discussed, these two queries are processed
in two different directions, i.e., the global-to-local direction and the local-to-local direction. The
certain answers to a global query are called global certain answers, while those to a local query
are called local certain answers.
Before we discuss the formalism for these two types of certain answers, we revisit the concept
of global database, from which we retrieve the global certain answers, and we introduce the
concept of retrieved database, where the local certain answers are computed.
Given the local data sources D1, ...,Dn and the mapping table M(G,S1, ...,Sn) between
the global ontology G and local source schemas S1, ...,Sn. The global database B is such that
σ(G) =⋃
(1≤i≤n) σ(Si) holds on D1, ...,Dn. Likely, the retrieved database Bk on a local source
Sk w.r.t. all the other local sources is the one satisfying σ(Sk) =⋃
(1≤i≤n,i6=k) σ(Si), whereas,
the retrieved database Bk,l on Sk w.r.t. a particular local source Sl is the one satisfying σ(Sk) =
σ(Sl) (refer to Section 2.4 for the semantics of σ). Figure 13 shows an example of the retrieved
51
database on S1 w.r.t. S2 (on the left side) and the one on S2 w.r.t. S1 (on the right side), for
S1 and S2 as presented in Figure 8.
Based on the concept of global database and that of retrieved database, we formally define
both types of certain answers next.
Definition 2.5 (Certain answers) Let G be the global ontology of n XML source schemas
S1, ...,Sn respectively with databases D1, ...,Dn, M be the mapping table, q be a global query
posed over G, and qk be a local query on Sk. The global certain answers to q with respect
to D1, ...,Dn based on M are the results of evaluating q over the global database B, denoted
certM(q) = qB. The local certain answers to qk with respect to D1, ...,Dk−1,Dk+1, ...,Dn
based on M are computed by evaluating qk over the retrieved database Bk on Sk, denoted
certM,k(qk) = qBk .
While the global certain answers constitute the answer to a global query, the answer to
a local query qk contains both the local certain answers and those retrieved from the local
database Dk, that is, ans(qk) = certM,k(qk) ∪ qDk .
Query containment is a fundamental problem in database research. In general, query con-
tainment checks whether two queries are contained in each other. This problem has been studied
in the following three cases.
The first case is query containment in a single database D, over which the two queries are
posed, that is, D1 = D2 = D. Given a single database schema S over which q1 and q2 are
posed, we say q1 is contained in q2, denoted q1 ⊆ q2, if they have the same output schema and
52
qD1 ⊆ qD2 for every database D of S. The two queries q1 and q2 are said to be equivalent, denoted
q1 ≡ q2, if qD1 ⊆ qD2 and qD2 ⊆ qD1 (2).
The second case is query containment in data integration systems, where both queries are
posed over the global database. The data sources are usually homogeneous in the sense that the
same syntax is used. Given that the sources are expressed as views over the global database,
two queries are said to be equivalent relative to the same set of data sources, if for any source
databases they have the same set of certain answers. The query containment problem in this
case is called relative query containment (82).
The third case is also in homogeneous data integration systems, where data sources are
defined as views of the global schema, but the two queries are formulated in terms of different
alphabets. In particular, there are two kinds of queries, i.e., the queries qΣ over the alphabet
Σ of the global schema and the queries qV over the alphabet V of the view definitions. The
query containment in this case is called view-based containment and is discussed for different
situations such as containment between qΣ1 and qΣ
2 , between qΣ1 and qV2 , between qV1 and qΣ
2 ,
and between qV1 and qV2 (28).
In our case, we are interested in two kinds of containment, specifically the containment
between a global query q and a union of local queries q1, ..., qn, and the containment between two
local queries qk and ql. The first kind of containment, which we call global query containment,
is the same as the containment between qΣ1 and qV2 . Whereas the second kind differs from the
containment between qV1 and qV2 , in the sense that qk and ql refer to different alphabets but qV1
and qV2 are expressed over the same alphabet. We call the containment between qk and ql P2P
53
query containment, because of its likeness to query processing in a P2P system. Next we give
the formal definitions for these two containments in our framework.
Definition 2.6 (Global query containment) Let G be the global ontology over n XML source
schemas S1, ...,Sn, M be the mapping table, q be a global query posed over G, and q′ be a union
of local queries q1, ..., qn respectively over S1, ...,Sn. We say q is globally contained in q′,
denoted q ⊆M q′, if for any databases D1, ...,Dn, we have certM(q) ⊆ qD11 ∪ ... ∪ qDn
n . We say
q and q′ are globally equivalent, denoted q ≡M q′, if q ⊆M q′ and q ⊇M q′.
Definition 2.7 (P2P query containment) Let G be the global ontology over n XML source
schemas S1, ...,Sn, M be the mapping table, qi be a local query posed over Si, and qj be a local
query over Sj. We say qi is P2P contained in qj, denoted qi ⊆M qj, if for any databases
D1, ...,Dn, we have certM,i(qi)∪qDii ⊆ certM,j(qj)∪q
Dj
j . We say q and q′ are P2P equivalent,
denoted qi ≡M qj, if qi ⊆M qj and qi ⊇M qj.
Example 2.3 Consider the following three queries q, q1, and q2 respectively on the global on-
tology G, local XML source S1, and local XML source S2. Also consider the mapping table M
shown in Table IV.
q: ans(x, y) :- name(u, x), title(v, y), contained(u, v).
q1: ans(x, y) :- /name(u, x), /booktitle(v, y), /author(v, u).
q2: ans(x, y) :- /fullname(u, x), /title(v, y), /article(u, v).
By executing q on the global database B, q1 on D1 and on the retrieved database B1, and q2
on D2 and on the retrieved database B2, we obtain the following answers to the three queries.
54
certM(q) = qB: {(a1, b1), (a2, b2), (a3, b2), (w1, t1), (w2, t2), (w3, t2)}
qD11 : {(a1, b1), (a2, b2), (a3, b2)}
certM,1(q1) = qB11 : {(w1, t1), (w2, t2), (w3, t2)}
qD22 : {(w1, t1), (w2, t2), (w3, t2)}
certM,2(q2) = qB22 : {(a1, b1), (a2, b2), (a3, b2)}
Therefore, by Definition 6 and Definition 7, we have q ≡M (q1 ∪ q2) and q1 ≡M q2.
2.5.3 Query Rewriting
In a data integration system where the sources are described as views over the global schema,
query processing is called view-based query processing, which has two approaches, i.e., view-based
query answering and view-based query rewriting (27; 58). Likewise, there are two approaches to
answering a query in our framework, where mappings are expressed by correspondences. The
first approach utilizes the notion of (global or local) certain answers, as previously discussed.
The alternative approach is by query rewriting. Specifically, to answer a global (or local)
query q, the query is rewritten into a union of the queries over all the sources, using the
mappings. The integration of the answers retrieved from each source constitutes the answer to
q.
As mentioned before, there are two directions of query processing in our framework. We
expect that query rewriting in both directions is equivalent, in the sense that the rewriting
is globally (or P2P) equivalent to the original query. We present next two query rewriting
algorithms, i.e., GLRewriting for global-to-local query rewriting and LLRewriting for local-
to-local rewriting, which will ensure the equivalence of the rewritten queries.
55
Algorithm GLRewriting
Input: 1. q1 over the global ontology G: ans( ~X) :- p1( ~X1), ..., pm( ~Xm);2. M between the global ontology G and local XML schemas S1, ...,Sn.
Output: q2: Union of the c-XQueries over S1, ...,Sn.1 q2 = null;2 For i = 1 to n do3 headq = headq1 ; bodyq = null;4 For j = 1 to m do5 (c1, c2) = name of the class/property bound to (x1, x2), for ~Xj = (x1, x2);6 Search M to find (d1, d2) such that {(c1, d1), (c2, d2)} ⊆ πG,Sj (M);7 If a path p exists from d1 to d2 in Sj then8 add p(x1, x2) to bodyq;9 Else if a path p exists from d2 to d1 in Sj then10 add p(x2, x1) to bodyq;11 Else add p(x, x1) and p′(x, x2) to bodyq, where x is a new variable bound to
the lowest ancestor d of d1 and d2, and p (p′) is the path from d to d1(d2);12 q2 = q2 ∪ q;
Figure 14. The GLRewriting algorithm.
We see that the algorithm GLRewriting adopts a strategy similar to the “unfolding”
strategy used by query processing in a GaV-based relational data integration system (70).
However, instead of substituting the predicates in a query q with the corresponding views, the
substitution of predicates in GLRewriting is guided by the correspondences in the mapping
table M, as stated in Lines 5 to 11. The calculation of the class or property (Line 5) bound
to different variables in q1 is as follows. For each predicate p(x1, x2): (1) if p is a property
connecting two classes c1 and c2, we say that x1 is bound to c1 and that x2 is bound to c2; (2)
if p connects a class c to a value (or literal) v, we say that x1 is bound to c and that x2 is bound
56
to p. Also, we note that the algorithm uses the relational algebra projection operator π (Line
6).
Example 2.4 Given a global query
q : ans(x, y) :- name(u, x), title(v, y), contained(u, v).
we use GLRewriting to rewrite q into a union of subqueries, each on a local XML source
(refer to the mapping table M of Table IV). For illustration, we only look at the rewriting of q
into a subquery q1 over the local source S1.
In particular, Line 5 computes the bound classes or properties of the variables (u, v, x, y) as
(Author, Book, Author.name, Book.title). By looking into M, we find the corresponding ele-
ment sequence of (Author, Book, Author.name, Book.title) in S1 to be (/books/book/author,
/books/book, /books/ book/author/name, /books/book/booktitle). From Lines 7 to 11,
we compute the predicates in the body of q1 as follows.
q1: ans(x, y) :- /name(u, x), /booktitle(v, y) /author(v, u).
Note that for the predicate contained(u, v) in q, we generate in q1 a predicate /author(v, u),
where the order of the two variables is switched. This results from the computation performed
by Lines 9 and 10. In particular, u and v are respectively bound to Author and Book, which
respectively correspond to XML paths /books/book/ author and /books/book. From S1, we
find that /author is the path from v to u, not the path from u to v.
Example 2.5 We give one more example to illustrate query rewriting when Line 11 is used.
Consider the following setting, where a local XML schema S1 (on the right side) is mapped to
57
Student
advises
Advisor faculty
f_name advisee
a_name
Local XML schema S 1 Global RDFS ontology G
Figure 15. A part of XML data integration setting.
the global RDFS ontology G (on the left side), as indicated by the dashed lines. The two classes
Advisor and Student are respectively instantiated with the name of faculty and the name of
advisee, that is, the mapping table contains two correspondences:
(Advisor, /faculty/f name)
(Student, /faculty/advisee/a name).
Now we consider rewriting a global c-RDQL query q: ans(x, y) :- advises(x, y). into a local
c-XQuery query q′ over S1. It is apparent that x and y are bound to Advisor and Student,
thus corresponding to /faculty/f name and /faculty/ advisee/a name, respectively. Be-
cause /faculty/f name and /faculty/ advisee/a name share the same ancestor /faculty,
by using Line 11 we add two predicates /f name(u, x) and /advisee/a name(u, y) to the body
of q′, generating the following local c-XQuery query q′:
ans(x, y) :- /f name(u, x), /advisee/a name(u, y).
58
Algorithm LLRewriting
Input: 1. q1 over a local XML schema S1: ans( ~X) :- p1( ~X1), ..., pm( ~Xm);2. M between the global ontology G and local XML schemas S1, ...,Sn.
Output: q: A query over local XML schema S2.1 headq = ans( ~X); bodyq = null;2 For j = 1 to m do3 (c1, c2) = name of the element bound to (x1, x2), for ~Xj = (x1, x2);4 Search M to find (d1, d2) such that {(c1, d1), (c2, d2)} ⊆ πS1,S2(M);5 If a path p exists from d1 to d2 in S2 then6 add p(x1, x2) to bodyq;7 Else if a path p exists from d2 to d1 in S2 then8 add p(x2, x1) to bodyq;9 Else add p(x, x1) and p′(x, x2) to bodyq, where x is a new variable bound to
the lowest ancestor d of d1 and d2, and p (p′) is the path from d to d1(d2);
Figure 16. The LLRewriting algorithm.
Algorithm LLRewriting differs from GLRewriting only in finding the elements bound to
the variables (Line 3) and in finding the corresponding elements from the mapping table (Line
4). Unlike in global-to-local rewriting, the result of using LLRewriting is a single c-XQuery.
Taking into account the definitions of global and P2P query containment, we prove below
that the algorithms GLRewriting and LLRewriting yield equivalent queries.
Theorem 2.1 Given a global query q over the global ontology G, its rewriting q′ as computed
by GLRewriting is globally equivalent to q, that is, q ≡M q′.
Proof sketch. To prove q ≡M q′, where q′ = q1 ∪ ... ∪ qn, we will check whether
certM(q) = qD11 ∪ ... ∪ qDn
n , given the mapping table M(G,S1, ...,Sn). Taking into account the
semantics of M, given any sequence u of values from the global database B, which makes bodyq
59
true, we can always have a sequence v of values from D1, ...,Dn, since σ(G) = σ(S1)∪ ...∪σ(Sn).
By GLRewriting, the sequence v is exactly the one that makes bodyqi true, where i ∈ [1..n].
Therefore, we have qB ⊆ qD11 ∪ ... ∪ qDn
n . Similarly, we can show that qB ⊇ qD11 ∪ ... ∪ qDn
n . By
the definition of certain answers, we conclude that certM(q) = qD11 ∪ ... ∪ qDn
n . ¤
Similarly, we have:
Theorem 2.2 Given a local query q1 over a local XML source S1, its rewriting q2 over the
local XML source S2 computed by LLRewriting is P2P equivalent to q1, that is, q1 ≡M q2.
We discuss here an interesting property, namely reversibility, of the local-to-local query
rewriting. Informally, consider a local query q1, which is rewritten into another local query
q2. If q2 can be rewritten back to a query q′1 (on the same source as q1) such that q1 ≡ q′1,
we say q′1 is a reverse query of q1. In the case that q2 and q′1 are computed using the same
rewriting algorithm, we say that the algorithm is reversible, if every query that is rewritable by
the algorithm has a reverse rewriting.
More generally, we consider a P2P data integration system with a cyclic path of P2P map-
pings, informally annotated as p1,M12, p2, ...,M(n−1)(n), pn,Mn1, p1, and an equivalent query
rewriting algorithm translating a query q1 (over p1) along this path until it comes back to
p1 with the resulting query q′1. In the spirit of equivalent query rewriting, we expect that
it is the case that q1 ≡ q′1, and furthermore, (q1 ≡M q2), ..., (qn ≡M q′1) ⇒ q1 ≡ q′1 and
q1 ≡ q′1 ⇒ (q1 ≡M q2), ..., (qn ≡M q′1). In other words, we expect that there exists a logical
relationship between P2P query containment/equivalence and a reversible rewriting algorithm.
60
2.6 Summary
XML and its schema languages do not express semantics but rather the document structure,
such as information about nesting. Therefore, semantically equivalent documents often present
different document structures when they originate from different applications. In this thesis,
we provide an ontology-based framework that aims to make XML documents interoperate at
the semantic level while retaining their nesting structure. The framework consists of two key
aspects: data integration and query processing.
For data integration, a global RDFS ontology is generated by merging the local RDFS
ontologies that are generated from each of the XML documents. At the same time, the mappings
between the global ontology and local XML schemas are manually established. We extend RDFS
by defining additional metadata that can encode the nesting structure of an XML document.
For query processing, we propose two query rewriting algorithms: one algorithm translates an
RDF query (posed on the global ontology) to an XML query; the other algorithm translates an
XML query (posed on one of the individual XML data sources) to another XML query (posed
on a different XML data source). In doing so, we discuss the problem of query containment
for two query languages, namely conjunctive RDQL (c-RDQL) and conjunctive XQuery (c-
XQuery). It is shown that both query rewriting algorithms are equivalent, in terms of both
global and P2P query equivalence.
In the future, we will extend query processing in our framework, by taking into account
other data models, such as relational and RDF data sources. We will further study query
containment in the case of more expressive query languages, e.g., the complete RDQL and
61
XQuery. The concept of reversibility of query rewriting, especially in P2P data integration
systems, is also a direction for future research.
CHAPTER 3
HYBRID PEER-TO-PEER DATA INTEGRATION
3.1 Introduction
The Semantic Web has been proposed to add semantics to web content and to enable
interoperability among heterogeneous data sources. Both Extensible Markup Language (XML)
and Resource Description Framework (RDF) can be used to represent information on the Web.
However, there exists a wide gap between the two languages, since RDF data has domain
structure (the concepts and the relationships between concepts) while XML data has document
structure (the hierarchy of elements) (59).
An example is shown in Figure 17, in which the RDF schema R explicitly specifies two
concepts, Book and Publisher, as well as the publishedBy relationship. Figure 17 also shows
two XML schemas S1 and S2. Each of these XML schemas contains two concepts: book
and author (equivalently denoted by article and writer in S2). Conceptually, these two
XML schemas are quite similar. Structurally speaking, however, they are very different: S1
(book-centric schema) has the author element nested under the book element, whereas S2
(author-centric schema) has the article element nested under the writer element.
Furthermore, the wide diversity of possible XML schemas for a single conceptual model also
results in wide diversity for the XML queries. For instance, a user who wants to “List all the
publications” from two data sources corresponding to S1 and S2 may write the XML path ex-
62
63
books
book *
author * @booktitle
@name
writers
article *
@title @fullname
writer *
books
book
author
"b1"
book
author
"b2"
"a1" "a3"
writers
writer
article
"a1" "a2"
"b1" "a2"
A local XML schema S 1 XML document D 1
"books.xml"
writer writer
article article
"b2"
"a3"
"b2"
A local XML schema S 2 XML document D 2
"writers.xml"
author
Book Publisher
Literal
ISBN
Literal
pulishedBy
booktitle
A local RDF schema R
Literal
name
local RDF data
Book Publisher
"b3"
ISBN
"0123456789"
pulishedBy
booktitle
"p1"
name
(Defined in namespace: http://examples.org/local#)
Figure 17. An example of heterogeneous XML and RDF data sources.
pressions, respectively, as /books/book/@booktitle and /writers/writer/article/@title.
We notice that although the two XML path expressions refer to semantically equivalent con-
cepts, they follow two distinct XML paths. In contrast, schemas defined on the conceptual level
(known as conceptual schemas or ontologies) are flat in document structure, and therefore the
user can formulate a query without considering the structure of the source (we refer to such
queries as conceptual queries). RDF Schema (RDFS), DAML+OIL, and OWL are examples of
languages used to create conceptual schemas.
There are currently several attempts to use conceptual schemas (3; 5; 38; 39) and conceptual
queries (29; 31) to overcome the problem of structural heterogeneities among XML sources. In
this chapter, we propose a framework called PEPSINT (PEer-to-Peer Semantic INTegration
framework) to semantically integrate heterogeneous XML and RDF data sources in a P2P
environment. We discuss the architecture of PEPSINT, and present a solution for semantic
64
integration and query processing in the P2P heterogeneous environment. In brief, we make the
following contributions in this chapter:
• We propose a P2P schema-based data management framework, PEPSINT, built on a
hybrid P2P architecture, in which the global RDF ontology (constructed using the global-
as-view approach (70)) in the super peer behaves not only as a central control point over
the peers but also as a mediator for query translation from peer to peer.
• For the purpose of semantic integration, we propose an approach that preserves the do-
main structure of RDF and the document structure of XML. Specifically, the semantic
integration of XML and RDF data sources is implemented at the schema level (through
the schema matching process) and at the instance level (through the query answering
process).
• We also provide a set of query rewriting algorithms that can propagate a user’s query
across the heterogeneous XML or RDF data sources in PEPSINT. In our framework,
mappings connect the peer to the super peer, thus making query processing within the
network transparent to a user in any peer.
The rest of this chapter is organized as follows. Section 3.2 gives a review of related work.
In Section 3.3 we describe the architecture of PEPSINT and its main components. Section 3.4
discusses schema-based integration of RDF sources and (structurally dissimilar) XML sources.
Query processing in PEPSINT is covered in Section 3.5. Finally, we draw conclusions and
discuss future work in Section 3.6.
65
3.2 Related Work
The research community has, to date, produced several P2P data management systems that
aim to enable interoperability among distributed heterogeneous data sources.
The Edutella project (85) provides an RDF-based metadata infrastructure for P2P net-
works based on the JXTA framework (52). In Edutella, connections between peers are encoded
into a network topology known as the Edutella super-peer topology, which is similar to the
hybrid architecture used in PEPSINT. A Datalog-based query exchange language called RDF-
QEL is proposed to serve as a common query interchange format. Thus a wrapper translates
local query languages such as SQL and XPath into RDF-QEL. Edutella does not support XML
sources directly, though the RDF data sources may be serialized in XML format.
PeerDB (86) is an agent-based P2P data management system where each peer holds a
relational database. The metadata for relations that are sharable with other peers is specified
in a local export dictionary. Unlike PEPSINT, there are no established mappings between peers.
Thus, query reformulation between peers in PeerDB is assisted by agents through a relation-
matching strategy ; this is a process of matching the metadata between relations in different
peers. XML and RDF data are not considered in the current implementation of PeerDB.
SEWASIE (11) is another agent-based P2P system that aims to integrate Information
Nodes (SINodes), where each node acts as an autonomous mediator-based system. It contains
two types of agents: query agents that are responsible for query processing and answering;
and brokering agents (peers) that handle the mappings between nodes. Each brokering agent
directly controls at least one SINode and handles the creation and maintenance of semantic
66
relationships among concepts from different information nodes in the system. SEWASIE does
not currently support RDF data sources.
Hyperion (6) proposes an architecture for a P2P data management system for relational
databases (one stored at each peer). Similarly to PEPSINT, mapping tables and mapping
expressions (mapping tables that allow variables) are used to store connections between local
schemas in peers. A query manager uses the mapping tables and mapping expressions to
rewrite a query posed in terms of the local schema; the rewriting process produces a query that
is run over the schema of acquainted peers. Unlike PEPSINT, only relational data sources and
relational queries are supported by Hyperion.
The Piazza system (59) is a P2P data management system that, like PEPSINT, supports
interoperation of both XML and RDF data sources. Furthermore, both systems preserve doc-
ument structure of XML sources during interoperation of these sources. The differences from
PEPSINT are: (1) Piazza is based on the pure P2P architecture in which peers are connected
directly, whereas PEPSINT is built on top of a hybrid architecture with a super peer containing
the global ontology. This is a tradeoff between efficiency and autonomy (11). (2) Piazza uses a
(declarative) XQuery-based mapping language for mediating between nodes, whereas PEPSINT
utilizes mapping tables to store schema correspondences, which we believe results in easier con-
struction and maintenance of mappings. (3) The Piazza system achieves its interoperability in
a low-level (syntactic) way, i.e., through the interoperability of XML and the XML serialization
of RDF. For this reason, the user has to write an RDF query in terms of an XQuery. The
query rewriting in Piazza is based on pattern matching between an XQuery expression and the
67
mappings. In contrast, PEPSINT supports RDF queries at the conceptual level (RDQL), as
well as XQuery. Query translation is realized by a collection of query rewriting algorithms.
3.3 The PEPSINT Architecture
There are two types of P2P architectures (84): the pure P2P architecture, in which no
central point of control exists and peers are autonomous but can communicate directly with
each other; and the hybrid P2P architecture that contains at least one central point of control.
The global control point(s) maintain either network control or the references to the remaining
peers. Based on the hybrid P2P architecture, PEPSINT contains two types of peers: the super
peer, containing the global RDF ontology, and the peers, containing local schemas and local
data sources. Each peer represents an autonomous information system and connects with the
super peer by establishing P2P mappings. As shown in Figure 18, the PEPSINT architecture
has four main components.
XML to RDF wrapper. Since XML is characterized by having a hierarchical document
structure while RDF has a flat document structure, it is hard for the user to directly map a
local XML schema to the global RDF ontology. To solve this problem, an XML to RDF wrapper
is used to transform the XML schema into a local RDF schema, which is then mapped to the
global ontology. This is a process that conceptualizes the XML elements into RDF concepts
while keeping their nesting information (by using a specialized RDF property).
Local XML and RDF schemas. The local XML and RDF schemas residing in peers
contain both data and metadata. For the purpose of semantic integration, we represent a local
RDF schema as a labeled digraph (from now on referred to as RDF schema graph). The domain
68
mapping table
local XML
schema
Global RDF
ontology
peer 1 super peer
mapping
table
local RDF
schema
mapping table
peer n
XML to
RDF wrapper
local
XML
schema
peer i
mapping
table
XML to
RDF
wrapper
Query processing in
data-integration fashion
Query processing in
hybrid P2P fashion
Mapping process
Q 1
Q 2n '
Q 2i '
Q 2
Q 11 '
Q 1i '
Q 1n '
Figure 18. The PEPSINT architecture.
structure is explicitly represented by labeled vertices (concepts) and labeled arcs (relationships
between concepts). Likewise, a local XML schema is represented as a labeled tree (from now on
referred to as XML schema tree) that specifies nesting relationships between labeled vertices
(elements).
Global RDF ontology. The global RDF ontology in the super peer is a virtual mediated
schema integrated from distributed local RDF schemas (using the global-as-view approach (70)).
In PEPSINT, the global ontology has two roles: (1) It provides the user with a uniform and
complete view of data sources in the distributed peers; and (2) it serves as a mediator for
query translation from one peer to other peers. The global RDF ontology is a fairly simple
ontology—it does not contain high-level axioms, such as those available to DAML+OIL or
OWL.
69
Mapping table. A mapping table stores mappings between local schemas and the global
ontology. We use XML path expressions to represent the elements contained in an XML schema,
and RDF path expressions to represent the concepts and relationships in an RDF schema.
The operation of PEPSINT can be divided into two phases: mapping (or design) phase and
query (or runtime) phase, as respectively indicated by the hollow arrowed lines and the solid
and dashed arrowed lines in Figure 18. To realize semantic integration of XML and RDF data
sources, domain structure and document structure must be preserved in both phases.
1. Mapping phase. Whenever a new peer joins the PEPSINT network, the peer gets
registered and indexed in the super peer by establishing mappings from its local schema to the
global ontology. The mappings are established through a process of schema matching 1 and
stored in the mapping table of the peer. During the process of schema matching, the global
ontology is extended by integration of the local schemas. As previously mentioned, the domain
structure and document structure of local schemas are encoded in the mappings.
2. Query phase. PEPSINT provides two query processing modes. (1) In the data-
integration mode, the user poses a query (source query) on the global ontology in the super
peer, which is then reformulated into multiple subqueries (target queries) over the XML and
RDF sources in the peers (one subquery for each source). By executing the target queries and
integrating their results, the system returns an answer to the user at the site of the super peer.
1Schema matching is a basic problem in many database application domains, and currently it mustbe performed manually. A taxonomy covering most of the existing approaches to schema matching hasbeen devised (99).
70
(2) In the hybrid P2P mode, the user can pose a source query on the local XML or RDF source
in some peer. Locally, the query will be executed on the local source to get a local answer.
Meanwhile, the source query is reformulated into a target query over every other peer through
transitive mappings (compositions of mappings from the original peer to the super peer and
mappings from the super peer to the other target peers). By executing the target query, each
peer returns an answer to the original peer, called the remote answer. The local and remote
answers are integrated and returned to the user at the site of the originating peer.
Query translation is achieved by using the mappings in conjunction with a collection of
query rewriting algorithms. We discuss the mapping and query phases in greater detail in
Section 3.4 and Section 3.5, respectively. Running examples based on the schemas in Figure 17
will be used for illustration.
3.4 Mapping Process
In PEPSINT, the data sources residing at the peers may be either XML data modeled by
an XML schema language (e.g., XML Schema) or else RDF data whose classes and properties
are described using RDF Schema (RDFS). As previously mentioned, mappings between local
schemas and the global ontology are established by the schema matching process during the
registration of a peer to the super peer. The key operation in this process is the preservation
of the domain structure of RDF sources and the document structure of the XML sources.
3.4.1 Mapping Local RDF Schemas to the Global Ontology
Schema matching takes the global RDF ontology G (in the super peer) and a local RDF
schema R (in the peer) as the inputs and returns a set of mappings M between the elements of
71
G and the elements of R as the output. Meanwhile, the global ontology is updated by merging
or adding metadata from the local RDF schema.
Elements in an RDF schema include concepts and roles (also known as classes and properties
in RDFS terminology). When matching the local RDF schema with the global RDF ontology,
for each element pL in the local RDF schema, if there already exists in the global ontology
a semantically equivalent element pG, the two elements will be merged and a correspondence
such as (pL, pG) will be generated. Otherwise, the element pL will be copied into the global
ontology as pG, and a correspondence (pL, pG) will be generated as well. We define a group
of operations on the ontology to implement schema matching between two RDF schemas, e.g.,
merging of classes, merging of properties, merging of relationships between classes, and copying
a class and/or its properties. A concrete example is given in our previous work (39).
3.4.2 Mapping Local XML Schemas to the Global Ontology
By transforming the participating local XML schema into a local RDF schema, we can
convert the problem of matching an XML schema with the global ontology into the problem of
matching an RDF schema with the global ontology, which is discussed in Section 3.4.1.
Books
Literal Literal
rdfx:contained
Local RDFS ontology R 1
Author
rdfx:contained
Book Article
rdfx:contained
Writers
rdfx:contained
Writer
Local RDFS ontology R 2
booktitle name title
Literal
fullname
Literal
Figure 19. RDF schemas transformed from local XML source schemas.
72
Book Author
Literal
Books Authors
Literal
rdfx:contained
rdfx:contained
rdfx:contained rdfx:contained
title
name
Literal
ISBN
Publisher
publishedAt
Literal name
RDF path RDF path XML path expressions XML path expressions
expressions in G expressions in R in S1 in S2
Books – /books –
Book Book /books/book /writers/writer/article
Book.title Book.booktitle /books/book/@booktitle /writers/writer/article/@title
Book.ISBN Book.ISBN – –
Book.publishedBy Book.publishedBy – –
Publisher Publisher – –
Publisher.name Publisher.name – –
Authors – – /writers
Author – /books/book/author /writers/writer
Author.name – /books/book/author/@name /writers/writer/@fullname
Figure 20. The global ontology and its mapping table.
The schema transformation is carried out by the XML to RDF wrapper. The XML to RDF
wrapper converts XML attributes and simple elements to RDF properties; it converts XML
complex elements to RDF classes. The wrapper also encodes the element-attribute relationship
and the element-subelement relationship in XML schema respectively as the class-to-literal
relationship and the class-to-class relationship in the resulting RDF schema.
We choose to define a new, specialized RDF property rdfx:contained (the prefix rdfx stands
for the new name space “http://pepsint.org/rdfx#”) to explicitly denote nesting relation-
ships. In particular, given that two XML elements ei (parent element) and ej (child element)
are respectively converted into two RDF classes, ci and cj , the property rdfx:contained of ci is
then generated to connect ci to cj . Figure 19 shows the resulting local RDF schemas R1 and R2
73
that are respectively converted from the two XML schemas S1 and S2 shown in Figure 17. Fi-
nally, the global ontology G integrated from S1, S2 and R (in Figure 17) and its mapping table
are shown in Figure 20. The grayed concepts or roles are the ones merged from local sources.
We notice that both the rdfx:contained property in G and the mappings in the mapping table
encode the document structure of XML sources, so that either of them can be exploited for
tracking XML document structure in future query translations.
3.5 Query Processing
3.5.1 Assumptions
For the simplicity of discussion, we make the following assumptions.
1. We assume the mappings from a local schema to the global ontology are total, one-to-one
mappings. On the other hand, the mappings from the global ontology to the whole set of local
schemas are total but not one-to-one mappings, since a concept in the global ontology might
be merged from multiple concepts of different local schemas (as a result of schema matching).
The mappings from the global ontology to a single local schema are one-to-one but they may
be partial mappings, which means a query run at a local source may result in an incomplete
answer.
2. We also assume that XML queries conform to a subset of XQuery (19), which we call
PXQuery (Partial XQuery) in this chapter. PXQuery consists of a non-nested FLWR expression
that includes four clauses: for, let, where, and return; the where clause may only contain
comparison operators. Other limitations of PXQuery include: (1) Only a single XML document
is involved in the query; (2) No new XML fragments are introduced in the query; (3) The path
74
expressions contained in the clauses only use child axes; (4) No type declarations, functions,
order clauses, and predicate filters are used.
3. To represent RDF queries, we use RDQL, which uses an SQL-like syntax (62). RDQL
consists of the following clauses: SELECT, FROM, WHERE, AND, and USING. We assume only com-
parison operators are used in the AND clause of the RDQL query. The FROM and USING clauses
are not the focus of our attention since they are not involved in query translation.
For the sake of convenience, we associate a PXQuery query Q with
(VQR , VQW , CQ), where VQR and VQW are the two sets that respectively contain all XML path
expressions in the return clause and in the where clause, and CQ contains the constraints whose
items are in the form of vRc, where v ∈ VQW , c stands for a constant, and R is a comparison
operator (e.g., =, <, >, ≤, ≥, and 6=). Likewise, we also associate an RDQL query Q with a
triple (PQS , PQW , CQ), where PQS and PQW respectively contain all RDF path expressions in
the SELECT clause and in the WHERE clause, and CQ contains the constraints whose items are in
the form of pRc, where p ∈ PQW , c stands for a constant, and R is a comparison operator.
3.5.2 Query Answering in Data Integration Mode
Query answering in data integration mode includes the following steps. We use a running
example for illustration.
1. Analyzing the source RDQL query to convert it from a string to a triple Qin :
(PQSin
, PQWin
, CQin). In order to get the RDF path expressions in PQSin
and PQWin
, we have to
match the triple patterns (specified in the WHERE clause) with the RDF graph corresponding to
the local RDF schema. CQin contains all the constraints specified in both the triple patterns
75
of the WHERE clause and the AND clause. Because of space limitations, we ignore the detailed
process of pattern matching in this chapter.
Example 3.1 To “find the publications written by a1”, the user poses a query over the global
ontology as shown below on the left hand side (the prefix go stands for the name space "http://
examples.org/global#", where the global ontology is defined). The resulting Qin elements are
listed on the right hand side.
SELECT ?title PQSin
={Book.title}
WHERE (?book, <go:title>, ?title), PQWin
={Book, Book.title, Author,
(?book, <rdfx:contained>, ?author), Author.name}
(?author, <go:name>, ?name) CQin={(Author.name, eq, "a1")}
AND (?name eq "a1")
2. Rewriting the source query into target subqueries over the RDF or XML sources, by
applying the query rewriting algorithm: RDQL2RDQL or RDQL2PXQuery (once for each source),
which utilizes mapping information stored in the mapping table of Figure 20. The output Qout
of a query rewriting in algorithm is a triple of the form (PQSout
, PQWout
, CQout) for the RDF source
or (VQRout
, VQWout
, CQout) for the XML source. From Qout, we can compose the target query that
is executable over the local source. Below is the result of this step for Example 3.1.
For the local RDF source R:
PQSout
={Book.booktitle}, PQWout
={Book, Book.booktitle}, CQout={}.
The target RDF query is: SELECT ?booktitle
WHERE (?book, <lo:booktitle>, ?booktitle)
76
For the local XML source S1:
VQRout
={/books/book/@booktitle}, VQWout
={/books/book, /books/book/@booktitle,
/books/book/author, /books/book/author/@name},
CQout={/books/book/author/@name, =, "a1"}.
The target XML query is: for $book in doc("books.xml")/books/book
where $book/author/@name = "a1"
return $book/@booktitle
For the local XML source S2:
VQRout
={/writers/writer/article/@title}, VQWout
={/writers/writer/article,
/writers/writer/article/@title, /writers/writer, /writers/writer/@fullname},
CQout={/writers/writer/@fullname, =, "a1"}.
The target XML query is: for $writer in doc("writers.xml")/writers/writer
where $writer/@fullname = "a1"
return $writer/article/@title
3. Building an answer to the source query (on the global ontology G) by assembling
the fragment results returned from local sources. We need to not only union the fragments
(returned from different sources) while removing identical records, but also join the records
based on some common key attribute. In addition, null values will be filled into the records
that just partially cover queried attributes. The result of an RDQL query is a table containing
URIs or string constants corresponding to the path expressions in the SELECT clause. For
example, the answer to the query of Example 3.1 is a table containing a single tuple ("b1"),
which is the union of results from S1 and S2. The record ("b3") returned from R is filtered
77
out since the target query over R loses the query constraints in query rewriting, caused by the
partial mappings from G to R (i.e., R has no correspondence for the class Author in G).
3.5.3 Query Answering in Hybrid P2P Mode
We only focus on the case of translating a source query in PXQuery from a peer to all the
other peers, since the translation of a source RDQL query is similar to what is done in data
integration mode (except for the transitive mappings). Query answering in hybrid P2P mode
includes the following steps.
1. Analyzing the source PXQuery query to convert it from a string to a triple
Qin : (VQRin
, VQWin
, CQin).
Example 3.2 To “list all the publications”, the user poses a query (over the local source S1)
as shown below on the left hand side. The resulting Qin components are listed on the right hand
side.
for $book in doc("books.xml")/books/book VQRin
={/books/book}
return $book VQWin
={}, CQin={}
2. Rewriting the source query into a target query over all the other connected RDF or
XML sources, by utilizing the query rewriting algorithm: PXQuery2RDQL or PXQuery2PXQuery
(once for each source) and the transitive mappings between the original data source and the tar-
get data source. The output of the query rewriting algorithm is a triple Qout : (VQRout
, VQWout
, CQout)
for the target XML data source or (PQSout
, PQWout
, CQout) for the target RDF data source.
An XML query must take into account the document structure of the XML source. The
answer to an XML query is returned as a set of subtrees, each of which is rooted from one of the
78
queried nodes (i.e., vertices in VQR). For instance, the answer to the XML query in Example 3.2
is the subtree rooted from book in S1 (see Figure 17). Therefore, the query rewriting algorithm
also outputs a tree T with its children being the resulting subtrees of the answer. The result of
this step by following Example 3.2 is shown below.
For the local RDF source R:
PQSout
={Book}, PQWout
={}, CQout={}.
The target RDF query is:
SELECT ?book, ?title
WHERE (?book, <lo:booktitle>, ?title)
Book Publisher
Literal
ISBN
Literal
pulishedBy
booktitle
Literal
name
T
For the local XML source S2:
VQRout
={/writers/writer/article},VQWout
={}, CQout={}.
The target XML query is:
for $writer in doc("writers.xml")/writers/writer
for $article in $writer/article
return
<book booktitle="{$article/@title}">
<author name="{$writer/@fullname}"/>
</book>
writers
@title @fullname
writer * T
article *
3. Building an answer to the source query (against the original data source) by computing
the union of the local answer (returned from the original queried peer) and the remote answers
(returned from remote peers). To construct the remote answers, different methods are used for
queries that target XML sources versus queries that target RDF sources. In the former case,
79
because RDQL cannot represent document structure, the remote answer is built by organizing
(based on the structure specified by T ) the instances returned from executing the target RDQL
query. Whereas in the latter case, the remote answer is formed by simply executing the target
PXQuery query that already represents the same structure as specified by T . For Example 3.2,
the final answer to the source query is shown below, where the three resulting lines come from
the local sources S1, S2, and R, respectively.
<book booktitle="b1"> <author name="a1"> </book>
<book booktitle="b2"> <author name="a2"> <author name="a3"> </book>
<book booktitle="b4"> </book>
3.6 Summary
In this chapter, we propose a P2P schema-based data management framework called PEPSINT.
This framework aims to semantically integrate distributed heterogeneous XML and RDF data
sources. We discuss the construction of the architecture, maintenance of mappings, and query
processing in PEPSINT. In particular, semantic integration is implemented at schema-level
through the schema matching process and at instance-level through the query answering process.
A key aspect in these two processes is the preservation of domain and document structure, which
is realized by extending the RDF metadata space and providing a set of query rewriting al-
gorithms. Because of this preservation, the user query can be correctly propagated across the
heterogeneous XML and RDF data sources in PEPSINT, so that information access within the
network is transparent to the user.
80
As for future work, we will: (1) Develop a proof of correctness for the query process. (2)
Design and implement a semantic web application (e.g., for bibliographic data exchange) in
PEPSINT to validate and evaluate the system. (3) Do a performance comparison of PEPSINT
with other P2P data management systems.
CHAPTER 4
PURE PEER-TO-PEER DATA INTEGRATION
4.1 Introduction
Research on peer-to-peer (P2P) computing techniques is flourishing with a number of pro-
posals on the related open issues such as robustness in dynamic P2P systems, reliability of
participants (peers), network performance, data coordination, and semantic issues (83). Among
these issues, data interoperability is fundamental, especially in the case of fine-grained (e.g.,
content-based) searches in a P2P network of data sources. Thus, P2P data management (or
integration) systems (PDMS) arise by combining schema-based data integration with a P2P
infrastructure (15; 59). In addition, the use of ontologies has been recognized as an effective
approach to promote the interoperability among distributed sources, by resolving their data
heterogeneities at a semantic level (39; 64; 87; 111). These two research trends lead to the
emergence of ontology-based P2P data management systems (OPDMS).
P2P ontology mapping and query processing are two important issues in an OPDMS. While
ontologies are used in local sources as a uniform conceptual metadata representation, which re-
solves the syntactic heterogeneity among sources in different peers, schematic (or structural) and
semantic heterogeneity may still exist. Therefore, ontology mappings are established between
peers to provide a common understanding of their data sources (64). Based on such ontology
mappings, a variety of data management tasks, such as data integration, query processing, and
81
82
data exchange, can be performed within the whole OPDMS. In this chapter, we propose a
framework for OPDMS and discuss the issue of query processing in this framework. In partic-
ular, we propose a P2P query rewriting algorithm that takes into account integrity constraints
specified on local data sources.
In our work, local RDFS1 ontologies are used to uniformly represent heterogeneous source
schemas. To represent the semantic mappings among these metadata (ontologies), we propose
a mapping language, namely the P2P Mapping Language (PML), which uses a meta-ontology
called RDF Mapping Schema (RDFMS). We also discuss the process of P2P query answering
in a layered framework, which we propose to manage any peer. In spite of its simplicity in
comparison with some mapping languages (e.g., Semantic Bridging Ontology used in MAFRA
(74)), PML is adequately expressive to represent most types of ontology mappings including
the equivalent, broader (more generalized), narrower (more specialized), union, and intersection
mappings. Furthermore, PML is extensible to define complex (e.g., many-to-many) mappings
and new mapping types (e.g., a sibling mapping based on two broader mappings), due to the
extensibility of RDFMS as is defined on top of RDFS. We define a first order logic (FOL)
semantics for PML, as well as for queries, which lays a unified foundation for query rewriting.
We consider a particular class of queries on RDFS ontologies, namely conjunctive RQL (c-RQL)
queries, and propose a P2P query rewriting algorithm.
1http://www.w3.org/TR/rdf-schema/
83
The rest of the chapter is organized as follows. In Section 4.2 we describe existing related
work. Section 4.3 gives an overview of our approach. In Section 4.4, we discuss in detail the
P2P mapping language PML, as well as the meta-ontology RDFMS, which is used for mapping
representation. The algorithm for P2P query answering, specifically for P2P query rewriting
based on the P2P mappings, is given in Section 4.5. Finally, Section 4.6 concludes and discusses
future work.
4.2 Related Work
Semantic data integration using conceptual models, such as E-R models and ontologies, has
been widely investigated in the literature (70; 87; 111). Many P2P data management systems
(PDMSs) have been recently proposed, such as the LRM model (15) and Piazza (59). Our
framework as proposed in this chapter is closer to Piazza, which deals with the integration of
XML data and XML serialization of RDF data from different peers. Piazza uses an XQuery-
based mapping language to represent schema mappings. Query answering is realized by pattern
matching between the tree representing the XQuery and the tree representing the mappings.
Examples of OPDMS include the SWAP architecture (46), and based on it, the Bibster
system (57). Our ontology-based query rewriting algorithm in OPDMS is similar to the com-
puteWTA algorithm proposed by Calvanese et al. (26) for query reformulation, both assuming
consistent ontology mappings. However, unlike in computeWTA, we allow partial ontology
mappings, i.e., it is not necessary to map all the atoms in the query to be rewritten. This
assumption is practically meaningful since the user’s burden in mapping two peers can be thus
reduced.
84
The representation of ontology mappings should facilitate the use of mappings for data
management tasks, including data exchange and query processing. Issues related to ontology
mapping have been studied widely (64; 87). For example, Lehti et al. propose an OWL-based
model particularly for XML data integration (69). For representing RDF schema mappings,
Omelayenko has proposed the use of a meta-ontology, RDFT (90). However, it is unclear how
execution specific constraint information and data transformation dimension are attached to
the bridges. Context OWL (C-OWL) (20) and the Semantic Bridging Ontology (SBO) (74) are
two similar ontology mapping languages, with the former based on an extended OWL syntax
and semantics and the latter represented in DAML+OIL. Both languages define a set of bridge
rules with an explicit semantics. However, the utilization of such rules for query processing
remains an open issue.
In the case that mappings are defined as (relational) views, query processing is often referred
in literature as view-based query answering or rewriting (58). However, few of view based query
processing algorithms address the issue of query writing over ontologies, which usually allows
for more expressive constraints specification than most schemata languages do.
4.3 System Overview
4.3.1 The Layered Peer Architecture
In a P2P data management system, a peer manages its local data source as in a traditional
database system. In addition, a peer also has to possess the ability of communicating with
the other peers by providing and consuming services. To this end, we propose for each peer
85
a layered architecture (as shown in Figure 21), by which distributed peers form a pure P2P
network.
GUI
Query Module Mapping Module
XML/RDB Wrapper
Local Ontology & Mappings
RDFS & RDFMS
RDF/XML
User
Local Data
Source
Query Module
Peer 1
Peer 2
Query Module
Peer 3
Application Layer
Service Layer
Representation Layer
Syntax Layer
Figure 21. The layered peer architecture.
The peer architecture consists of four layers, in which each upper layer achieves its func-
tionality based on the lower ones. In particular, the syntax layer provides a uniform syntax
(RDF/XML) for serializing the local ontology and its instances. A wrapper is used to convert
the local source schemas and data into such local ontologies. The representation layer contains
the local ontology in RDFS and its mappings in RDFMS. The service layer implements schema
mapping and query processing, which are two main services that a peer can provide to the
network. The application layer contains a GUI (Graphic User Interface) for the user to initiate
query requests. The adoption of a layered peer architecture simplifies the resolution of peer-to-
86
peer heterogeneities into level-to-level dependencies, thus facilitating the data interoperation
by making the layers more maintainable and reusable.
4.3.2 An Illustrative Example
proceedings
publication
id
type
pid year title
001 2000 t1 Faculty
Literal
firstname
book
conf
publication *
title
type
department
publication
id
type
("p01")
("book")
("p02")
("conference")
lastname Literal
Literal
Literal
p.dtd
Peer p1 (XML) Peer p2 (RDB) Peer p3 (RDF)
p.xml
Faculty "t1"
"Luis"
"H."
"t3"
"t4"
firstname
lastname
conf
conf
book
f.rdfs
f.rdf
faculty *
name
pub
faculty
("M. Case")
("p01 p02")
name
faculty
("J. Adams")
("p01")
name
author_proc
aid pid
001 001
002 001
author
aid affiliation name
001 UC H. Luis
id
pub
pub
title ("t1")
title ("t2")
CS Department
002 2000 t3
002 UC M. Case
003 UIC J. Adams
003 001
001 002
Figure 22. A motivating example for P2P data integration.
As shown in Figure 22, the three autonomous peers p1, p2, and p3 contain three data sources,
which are heterogeneous in both syntax and schemata. In particular, Peer p1 contains the infor-
mation of faculty and publications in XML (p.xml) and DTD (p.dtd). The publication element
(pub) that is defined in IDREFS refers to one or more publication IDs (id). Such referential
constraints define inclusion dependencies as in relational databases. Peer p2 is a relational
87
database containing conference proceedings. The attributes aid and pid in author proc are
foreign keys referring to author.aid and proceedings.pid, respectively, defining inclusion
dependencies. Peer p3 contains an RDF document (f.rdf) with its RDF schema (f.rdfs)
defined in RDFS. In comparison to XML data, we say that an RDF data is flat because there
are no nesting structure and order constraints among the classes and properties.
In addition to syntactic heterogeneity, a notable structural difference among these data
sources is that the semantically equivalent terms are formulated in different forms. That is, the
two types of publications−−book and proceedings−−are designed as values (instances) of an
attribute in p1, as relation names in p2, and as property names in p3.
4.3.3 RDF Metadata Representation
The source schemas specify metadata about different data sources, in terms of elements
and attributes in XML schemas, relations and attributes in relational schemas, and classes
and properties in RDFS. A heterogeneous P2P integration system should provide a uniform
metadata representation to facilitate the P2P mapping process. For this purpose, wrappers are
used to transform heterogeneous schemas into the uniform representation (17; 40; 66).
In our approach, we choose to use RDFS to represent local metadata as a local ontology.
The following description summarizes the method of model-based schema transformation in our
previous work (40). For transformation from relational to RDFS, we represent relations as RDF
classes and attributes as RDF properties. For transformation from XML to RDF, we convert
complex-type elements into RDF classes and simple-type elements (with no subelements but
character contents) and attributes into RDF properties. The target RDF schema shall also
88
Department
name
Faculty
pub
title
type
rdfx:contained
property
Class
L e
g e
n d
rdfs:domain
aid
Author name
affiliation
title Proceedings
pid
year
Faculty
book
lastname
firstname
conf
Publication id
rdfx:contained
Literal
Author_proc aid_1
pid_1
Literal
Literal
rdfs:range
Local ontology O 1 in Peer p 1 Local ontology O 2 in Peer p 2 Local ontology O 3 in Peer p3
Figure 23. local RDFS ontologies.
include the XML or relational referential constraints, which are necessary to be preserved for
correct query translation between different data sources. We represent these constraints by two
RDF properties (corresponding to the two attributes involved in a referential constraint) shar-
ing the same value. Figure 23 shows the results of schema translation of the three sources in the
example of Figure 22. Notice that the nesting relationship between two XML elements is pre-
served by a new particular RDF property rdfx:contained, where rdfx is the new namespace
(39).
4.3.4 P2P Mapping and Query Answering
In our framework, the P2P inter-schema mappings result from a process of matching the two
participating source schemas (99). In our previous work (113), we proposed a thesaurus-based
89
RDF schema matching algorithm by utilizing WordNet.1 In our approach, an inter-schema
mapping specifies correspondences between RDF classes or properties from two different source
schemas. The different types of mappings (e.g., equivalent, broader, or narrower) are determined
according to the comparison of the semantics of the mapped classes or properties. The mapping
information is stored in terms of instances of an RDF meta-ontology RDFMS (RDF Mapping
Schema), using in addition a mapping language, PML (P2P Mapping Language).
The process of P2P query answering includes three aspects: query execution, query rewrit-
ing, and answer integration. The user poses a query on a peer, which is first executed on that
peer. Meanwhile, the query is also forwarded to each of the linked peers, where the query is
rewritten into a new query that is executed locally and propagated further. Finally, answers
from every peer are returned to the host peer, where they are integrated to produce the answer.
For the purpose of query answering, we use a first-order relation based method to interpret
the inter-schema mappings. Actually, in our approach, both the mappings and heterogeneous
queries are interpreted by a set of first-order relations, so as to provide a unified environment
for query rewriting.
4.4 P2P Mappings
In this section, we discuss the representation of P2P semantic mappings using an RDF-based
meta-ontology, namely RDFMS, which even if incomplete is expressive enough to specify most
1http://www.cogsci.princeton.edu/˜wn/
90
Map
EquivalentMap
BroaderMap
NarrowerMap
UnionMap
IntersectionMap
Literal
constrainedBy
leftElement
rightElement
property
Class
Legend:
rdfs:domain
rdfs:range
rdfs:subClassOf
Namespace: http://example.org/rdfms Prefix: rdfms
Figure 24. The meta-ontology of RDFMS.
commonly used mapping types. We also describe a mapping language PML, which uses an
FOL semantics and serves as an interface for the user to define and manipulate the mappings.
4.4.1 RDFMS Meta-Ontology
As shown in Figure 24, RDFMS provides one-to-one mappings such as equivalent (repre-
sented by EquivalentMap), broader (BroaderMap), and narrower (NarrowerMap). Regarding
the case of one-to-many mappings, RDFMS defines UnionMap and IntersectionMap respec-
tively for two types of logic combinations (i.e., and and or) of the elements on the multiple-
element side. All these types of mappings are defined as classes inheriting from a common
class Map, which has three general properties that are also inherited by its subclasses. The
leftElement and rightElement properties are used to connect the mapped elements.
In order to represent the mapping expression (99) that a P2P mapping may carry, the
property constrainedBy is defined, whose data type is specified as Literal. An example of
91
the use of this property is &c1 (see Figure 25), which is used to confine the retrieval of the
instances from Peer p1 since Faculty is mapped to Author using NarrowerMap. Following the
example in Figure 23, we obtain the P2P mappings among the three local RDF schemas, as
shown in Figure 25. Note that every P2P inter-schema mapping is an instance of the RDFMS
meta-ontology.
Department
name
Faculty
pub
title
type
aid
Author
name
affiliation
title
Proceedings
pid
year
Faculty
book
lastname
firstname
conf
Publication
id
Literal
Author_proc aid_1
pid_1
Literal
Literal
NM-1
&c1
EM-1
BM-2
&c2
EM-2
EM-3
IM-1
&c3
BM-2
Abbreviations: &c1 = "Author.affiliation = 'UC'" BM: BroaderMap
&c2 = "Publication.type = 'conference'" EM: EquivalentMap
&c3 = "Author.affiliation = 'UIC'" IM: UnionMap
property
Class
Legend:
rdfs:domain
rdfms:leftElement
rdfms:rightElement
rdfms:constrainedBy
Local ontology O 1 in Peer p 1 Local ontology O 2 in Peer p 2 Local ontology O 3 in Peer p3
rdfx:contained
rdfx:contained
Figure 25. An example of P2P mappings represented in RDFMS.
92
4.4.2 P2P Mapping Language – PML
We define a set of mapping atoms for defining different types of P2P semantic mappings,
according to the structure of the RDFMS meta-ontology. Listed below are mapping atoms and
their corresponding RDFMS representation.
• EM(c1, c2): there exists an instance m of EquivalentMap, such that c1 = m.leftElement
and c2 = m.rightElement.
• BM(c1, c2): there exists an instance m of BroaderMap, such that c1 = m.leftElement
and c2 = m.rightElement.
• NM(c1, c2): there exists an instance m of NarrowerMap, such that c1 = m.leftElement
and c2 = m.rightElement.
• UM(c1, c2): there exists an instance m of UnionMap, such that c1 = m.leftElement
and c2 = {x|x = m.rightElement}, or c1 = {x|x = m.rightElement} and c2 =
m.leftElement.
• IM(c1, c2): there exists an instance m of IntersectionMap, such that c1 = m.leftElement
and c2 = {x|x = m.rightElement}, or c1 = {x|x = m.rightElement} and c2 =
m.leftElement.
• CON(m, e): given an instance m of Map or its subclasses, we have e = m.constrainedBy.
We note that c1 and c2 in EM, BM, and NM correspond to RDFS classes or properties, whereas
c1 and c2 in UM and IM can correspond to a set of classes or properties, to which the logic
connectors and and or are applied, respectively.
93
Assuming a finite set of class names C and a finite set of property names P, we define a FOL
(first order logic) semantics for the mapping atoms, in terms of the following two predicates:
• C EXT(c, x), where the resource x is in the proper extent (i.e., direct instance) of class c.
• P EXT(x, p, y), where (x, y) is the proper extent (i.e., direct instance) of property p.
In our definition, a P2P mapping is allowed to connect not only two classes or two properties but
also a class and a property. An interpretation ∆ of every P2P mapping atoms varies according
to the type of objects that are mapped, as given below.
• ∆EM(c1, c2) implies:
∀x C EXT(c1, x) ↔ C EXT(c2, x), if c1, c2 ∈ C;
∀x1∀x2∀y P EXT(x1, c1, y) ↔ P EXT(x2, c2, y), if c1, c2 ∈ P;
∀x∀y C EXT(c1, y) ↔ P EXT(x, c2, y), if c1 ∈ P, c2 ∈ C.
• ∆BM(c1, c2) implies:
∀x C EXT(c1, x) → C EXT(c2, x), if c1, c2 ∈ C;
∀x1∀x2∀y P EXT(x1, c1, y) → P EXT(x2, c2, y), if c1, c2 ∈ P;
∀x∀y C EXT(c1, y) → P EXT(x, c2, y), if c1 ∈ P, c2 ∈ C.
• ∆NM(c1, c2) implies:
∀x C EXT(c1, x) ← C EXT(c2, x), if c1, c2 ∈ C;
∀x1∀x2∀y P EXT(x1, c1, y) ← P EXT(x2, c2, y), if c1, c2 ∈ P;
∀x∀y C EXT(c1, y) ← P EXT(x, c2, y), if c1 ∈ P, c2 ∈ C.
94
• ∆UM(c1, c2) implies:
∨i(∆EM(c1, ai)), where ai∈c2, if c1 ∈ C ∪ P, c2 ⊆ C ∪ P;
∨i(∆EM(ai, c2)), where ai∈c1, if c1 ⊆ C ∪ P, c2 ∈ C ∪ P.
• ∆IM(c1, c2) implies:
∧i(∆EM(c1, ai)), where ai∈c2, if c1 ∈ C ∪ P, c2 ⊆ C ∪ P;
∧i(∆EM(ai, c2)), where ai∈c1, if c1 ⊆ C ∪ P, c2 ∈ C ∪ P.
The following is the interpretation ∆M1,2 for the mappings M1,2 between p1 and p2 in
Figure 25.
∀x1∀x2∀y P EXT(x1, name, y) ↔ P EXT(x2, name, y),
∀x1∀x2∀y′, P EXT(x1, title, y′) ↔ P EXT(x1, title, y
′),
∀x C EXT(Faculty, x) → C EXT(Author, x),
∀x C EXT(Publication, x) ← C EXT(Proceedings, x)
The FOL interpretation for ontology mappings enables standard reasoning on mappings as
well as the definition of more complex P2P mappings. For example, we can define a sibling
mapping SM such that SM(c1, c2) ⇔ NM(c1, c3) ∧ NM(c2, c3). Another example is the defini-
tion of a many-to-many mapping by composing two UnionMaps. Furthermore, an example for
reasoning on mappings can be such as ∆BM(c1, c2) ∧ ∆NM(c1, c2) ⇔ ∆EM(c1, c2). However,
reasoning on mappings is not the focus of the this chapter. Instead, we concentrate on how to
use mappings for the purpose of query processing, specifically on query rewriting.
95
4.5 P2P Query Processing
4.5.1 Query Languages
Since the metadata of every source schema is expressed as a local ontology in RDFS, we
may be able to interpret a local query over the source schema in terms of a conjunctive query,
namely a conjunctive RQL query (c-RQL) (32), over the local ontology. An c-RQL Q is of the
form
ans(x) :– R1(x1), ..., Rn(xn).
where Ri is either C EXT or P EXT for i ∈ [1..n], and x ⊆ x1 ∪ ... ∪ xn. As usual, the ans part
is called the head of the query, denoted headQ, and the rest is called the body of the query,
denoted bodyQ. In this chapter, we assume that we only consider the class of local queries that
can be expressed in c-RQL. The following gives two examples of translating local XPath (47)
and relational queries into c-RQL queries, while ignoring the detailed procedure for the space
limit.
Consider an XPath query /department/faculty [@name="M. Case"] as posed over p.xml
in p1. The result of this query is the XML document tree (referred to as answer tree) rooted from
the first faculty element (See Figure 22). By considering the answer structure and semantics
of the query (for correct query rewriting), we can interpret the XPath query as follows. Note
that all the elements and/or attributes involved in the answer tree and in the predicates (of an
XPath query) are covered in the resulting c-RQL query.
ans(x, y, z) :– P EXT(x, name, y), P EXT(x, pub, z),
96
y = "M. Case".
As another example, consider a relational conjunctive query posed on Peer p1 to “find all
the publications written by authors from UIC”, as shown below.
ans(y) :– proceedings(x, y, z), authorproc(u, x),
author(u, v, w), w = "UIC".
The following is the first-order relation based interpretation for the preceding relational
conjunctive query.
ans(y) :– P EXT(x1, pid, x), P EXT(x1, title, y),
P EXT(x2, aid 1, u), P EXT(x2, pid 1, x),
P EXT(x3, aid, u), P EXT(x3, affiliation, w),
w="UIC"
4.5.2 Query Rewriting
The P2P query answering in our framework is a process of propagating a local query (ini-
tiated from a host peer p1) to every connected peer along the links. As previously mentioned,
this process includes three aspects: query execution, query rewriting, and answer integration.
Query rewriting can be seen as a function Q2 = f(Q1,M), where Q1 is the local query, M is the
set of P2P mappings, and Q2 is the resulting remote query. Based on the uniform first order
logic interpretation for both P2P mappings and user queries, the computation of f is realized
by the algorithm P2PRewriting as sketched below.
97
Algorithm P2PRewriting (Q, M)Input: a conjunctive query Q over ontology O1; the mappings M between O1 and O2.Output: a conjunctive query Q′ over O2.headQ′ = headQ; bodyQ′ = null;Let ∆Q be the corresponding c-RQL of Q;Expand ∆Q into Q∗ using the constraints over O1;Let φ be bodyQ∗ ;For each R(x) of φ
For each ψ ∈ MLet R′(x′) be the result of applying ψ on R(x);Add R′(x′) into bodyQ′ using a conjunction;
Let G be a query graph of φ and G′ be of bodyQ′ ;For each connected subgraph H ⊆ G
Find the corresponding subgraph H ′ of H in G′;If H ′ is not connected then
Expand H ′ using the constraints on O2 into a connected graph H ′′;If H ′′ exists then add into bodyQ′ all R′
i that contributes to the expansion of H ′;Else output null;
Output Q′;
Figure 26. The P2PRewriting algorithm.
The rest of our discussion elaborates on this algorithm by giving a concrete example. Sup-
pose that the user poses a query Q over Peer p1 (in a P2P network as shown in Figure 23):
“listing all papers written by H. Luis”, which is formulated as follows:
//publication[//faculty[contains(@pub, @id) and @name="H. Luis"]]
The first step of rewriting Q is the interpretation of Q as ∆Q. As previously mentioned,
the interpretation of an XPath query has to consider its answer structure. In this example,
the answer to Q covers the XML node publication and its children id, title, and type,
98
according to the schema structure in p1 (see Figure 22). Based on the local RDFS ontology of
Peer p1, ∆Q is computed as follows, .
ans(x, y, z) :– P EXT(p, id, x), P EXT(p, title, y),
P EXT(p, type, z), P EXT(q, pub, x),
P EXT(q, name, "H. Luis")
The expansion of ∆Q uses the classic chase algorithm that “chases” a tableau query with
dependencies on a relational database (2). The following shows the resulting Q∗ of expanding
∆Q using the constraints on the ontology O1, and its rewriting Q′ resulted from the application
of the mapping constraints M1,2. We note that the application of a mapping ϕ → ψ to a query
predicate ϕ follows the way of standard logical implication ϕ,ϕ → ψ ⇒ ψ.
Q∗ : ans(x, y, z) :– P EXT(p, id, x), P EXT(p, title, y),
P EXT(p, type, z), C EXT(Publication, p),
P EXT(q, pub, x), P EXT(q, name, "H. Luis"),
C EXT(Faculty, q)
Q′ : ans(y) :– P EXT(p, title, y), C EXT(Proceedings, p),
P EXT(q, name, "H. Luis")
The query graph of a query is constructed by adding a node for each atom in the query and
adding an edge between two nodes if their corresponding atoms contain the same variable. In
the last step, the algorithm finds that the query graph of Q∗ is connected, whereas the one of
99
Q′ is not connected. Hence, Q′ has to be expanded (using the chase algorithm too) according
to the constraints on O2, resulting in the following final rewriting Q′ of Q:
ans(y) :– P EXT(p, title, y), C EXT(Proceedings, p),
P EXT(q, name, "H. Luis"),
P EXT(p, pid, y2), P EXT(q, aid, y1),
P EXT(x, aid 1, y1), P EXT(x, pid 1, y2)
It will not be difficult to obtain the corresponding relational conjunctive query of Q′, which
is then executed over the RDB in Peer p2 to retrieve a local answer from p2. Similarly, we can
rewrite Q′ to a query Q′′ over O3 and get a local answer from p3. The (global) answer of Q,
after the integration of all local answers, is as follows, where the null values are caused by the
fact that the P2P mappings are partial (i.e., not all atoms referred by the query are mapped):
<publication id="" title="t1" type=""/>
<publication id="" title="t3" type=""/>
<publication id="" title="t4" type=""/>
We did not describe all the details because of space limitations.
In order to retrieve correct data from the P2P network, it is required that the remote query
Q2 rewritten from a local query Q1 be equivalent to Q1. The query rewriting satisfying this
condition is called equivalent query rewriting (58), which is defined for homogeneous relational
data integration. Query equivalence in terms of answer equivalence has also been defined (58).
Such equivalence, however, will have a different (less strict) meaning in the context of a hetero-
100
geneous P2P network. Informally, we say that two answers (to two different queries on different
data sources) are equivalent if they are structurally and semantically equivalent. However, such
equivalence does not entail identical answers. Although not proved in this chapter, our P2P
query rewriting guarantees semantic equivalence, which is based on the concept of reversibility
(39). To achieve semantic equivalence the following is needed: the correctness of source schema
representation in RDFS, a valid P2P ontology mapping, and the preservation of the answer
structures.
4.6 Summary
In this chapter, we describe an ontology-based approach to the data interoperability problem
in a heterogeneous P2P network. RDF techniques are used in our framework, through the use
of the RDFS local ontologies for metadata representation and the use of the RDFMS meta-
ontology for inter-schema mapping representation. Our contributions include a definition of the
syntax of PML (based on the RDFMS meta-ontology), a definition for its semantics in terms of
first-order relations, and a query answering algorithm that considers constraints in local data
sources.
For future work, we will further study the following aspects: (1) Due to the locality of P2P
systems, mappings between different pairs of peers may be designated by different people. This
can result in inconsistency between different inter-schema mappings. In addition, given two
large-size source schemas to be mapped, the user may hope some inferencing can be performed
to derive new mappings from existing mappings automatically. In fact, the problem of map-
ping consistency and that of mapping inference are essentially the same in the case where the
101
inferencing involves multiple sets of inter-schema mappings (6). (2) In a P2P network, peers
are designed as autonomous nodes, and any peer can accept user queries. In such settings, an
established inter-schema mapping, say from Peer p1 to Peer p2, may be used both for query
rewriting from p1 to p2 and for that from p2 to p1. Given that the inter-schema mappings
are directional and a uniform query rewriting algorithm is deployed in the P2P system, the
utilization of a single inter-schema mapping for query rewriting in different directions have to
be treated differently. This arises because of the problem of bidirectionality of P2P mappings
(59).
CHAPTER 5
DATA INTEROPERABILITY IN THE SEMANTIC DESKTOP
5.1 Introduction
In 1945, Vannevar Bush put forward the first vision of personal information management
(PIM) system, Memex, by pointing out that the human mind “operates by associations”, and
we should “learn from it” in building Memex (23). The Hypertext systems (see the survey of
Conklin (33)), which flourished in the 80’s, reinforced this vision and yielded the current World
Wide Web, in a broader scope. Recently, with the Semantic Web vision (13), a number of PIM
systems associated with that vision, hence called Semantic Desktop, have been proposed. By
summarizing these proposals and taking into account the characteristics of personal information
(PI), we propose the following principles that a PIM system should follow:
Semantic data organization. Almost all existing approaches are trying to go beyond the
hierarchical directory model. The critical factors of semantic data organization include ade-
quate annotations, explicit semantics, meaningful associations, and a uniform representation. A
semantic-rich data organization has several advantages. First, the annotations and associations
(as the superimposed information over the coarse data (75)) form the context of the PI, thus
making the data more easily understandable. Second, the superimposed information also allows
for a finer and more flexible manipulation (e.g., browsing and querying) of the data. Third,
an explicit formal semantics for the data can facilitate reasoning on the data and deriving new
102
103
PI Space (C:\)
papers
WISE03-1.pdf
WISE03-submission.pdf
WISE03-camera.pdf
JoDS05.pdf
WISE
myself.jpeg
talks
WISE03.ppt
IDEAS04.ppt
AP2PC04.ppt
Super-invited.ppt
photos
talk.jpeg
with sam.jpeg
emails
Final submission of WISE.eml
Meeting on Monday.eml
WISE photos.eml
Register for WISE.eml
Figure 27. An example of files in a PI space.
knowledge. Finally, the uniform representation can support the integration of data that may
be heterogeneous.
Flexible data manipulation. A PIM system can provide integration, exchange, navigation,
and query processing of the stored personal information. The framework of PIM, including the
data model, query language, and user interface, should provide multiple ways to manipulate data
in a powerful and flexible manner. Furthermore, a PIM system should possess the capability for
seamless communication (or interoperability) with external sources (possibly in another PIM
system), e.g., in a peer-to-peer (P2P) way (103).
Rich visualization. Multiple visualizations can help the user in understanding data. Instead
of providing separate views of the data as most traditional applications do, a PIM system should
support data visualization from different perspectives, to offer a comprehensive view. Examples
include association-centric visualization (98) and time-centric visualization (51; 50).
Example 5.1 Figure 27 presents a fragment of PI space, which consists of four directories
of files in the hard drive C:\. The papers directory contains four papers of the format pdf,
104
photos\WISE contains three pictures taken at the WISE ’03 conference, talks contains four
Powerpoint files that are respectively the slides of four talks, and emails contains four saved
email messages. Even if the concrete contents of all these files are unknown, we can tell from
their names (or the names of their respective directories) that several of them appear to be related
to one another. Unfortunately, their storage in different and possibly unrelated directories does
not show such inter-relationships, thus resulting in possible difficulties in locating the wanted
information. Some keyword-based searching techniques, e.g., offered by the Google Desktop
Search,1 can retrieve all files that are relevant to WISE. However, without further inspection
of the contents of each file, the user may not be able to discover certain associations between
them, e.g., that file JoDS05.pdf is an extended journal paper of WISE03-camera.pdf.
From this example, we can see that the lack of semantic associations among the stored data
could be a handicap for data and knowledge discovery. In this chapter, we focus on issues of
semantic data organization and management in PIM, by taking the following approach:
1) We propose a layered framework for PIM, in which multiple ontologies playing a variety
of roles are employed. Specifically, the resource layer stores all the PI resources (using URIs),
metadata of the PI, and all kinds of associations using RDF. The domain layer contains the
ontologies specific to various domains that are used to structure the data and categorize the
resources. The application layer, built on top of the domain layer, is where the user constructs
different application ontologies for different purposes of data usage. This layered architecture
1http://desktop.google.com
105
enables: i) a semantics-rich environment for personal information management; ii) a flexible and
reusable system, by decoupling the domain and application ontologies, so that the construction
of application ontologies for different applications can reuse the underlying domain ontologies.
We argue that this provides certain advantages over the use of a single domain model for all
the PI (e.g., (44)).
2) We discuss in detail how to utilize superimposed information for semantic organization,
focusing on the construction of resource-file and resource-resource associations. We also present
the idea of 3D navigation, which is a combination of the vertical, horizontal and temporal
navigation in the PI space. The idea is inspired by some existing PIM systems including
MyLifeBits (51) and Placeless Documents (45), and is demonstrated in a browser.
3) We describe in detail the architecture of our semantic desktop system, named MOSE
(Multiple Ontology based Semantic DEsktop) and the challenges that we are addressing in the
course of its implementation.
4) In our framework, the basic unit for the user to manage the Semantic Desktop is the per-
sonal information application (PIA). Each PIA aims to accomplish or assist a specific task (e.g.,
bibliography management, paper composition, and trip planning). The PIAs can be standalone,
with their own application ontology, user interface, and workflows. Meanwhile, they can com-
municate with each other as if in a P2P network, by means of the connections (mappings)
established between their application ontologies. In this sense, different PIAs interoperate at a
semantic level. We discuss the personal information application (PIA) development, and hence
to the inter-desktop information sharing and data integration by means of PIA-based desktop
106
services. We also describe query processing in our framework in two cases: within a single PIA
or between two PIAs, in a P2P query processing mode.
The rest of the chapter is structured as follows. In Section 5.3, we describe the layered
framework and its main components. The semantic organization of the PI (including the
concepts of annotation, association, and representation) is discussed in Section 5.5. Section 5.6
and Section 5.9 focus on two main ways of data manipulation, namely, navigation and query
processing. Finally, we conclude in Section 5.10.
5.2 Related Work
The term of semantic desktop was first coined by Decker and Frank, who also stated the
need for a “networked semantic desktop” that is enabled by several key emerging technologies
including: the Semantic Web, P2P computing, and online social networking (41). The state-
of-the-art of semantic desktop has been comprehensively summarized by Sauermann (104).
Among the existing approaches to PIM in desktops, the Gnowsis project aims at a semantic
desktop environment that supports P2P data management based on desktop services (103).
Similarly to MOSE, Gnowsis uses ontologies for expressing semantic associations and RDF
for data modeling. SEMEX is another personal data integration framework that uses a fine-
grained annotation based on schemas, similar to our ontology-based framework (44). However,
a single domain model is provided as the unified interface for all data access. MyLifeBits (51),
Haystack (98), and Placeless Documents (45) are three PIM systems that support annotations
and collections. The concept of collection is essentially the same as the conceptualization (using
ontologies) of resources in our framework.
107
Existing interfaces provide a workspace for the end user to develop applications. Such ap-
plications have their own data model, data presentation, and control logic. Of such interfaces,
Haystack’s end user interface is the closest to the PIA designer presented in this chapter (98).
Both use channels as units of content; however, the PIA designer supports parameterized chan-
nels in its MVC-based application development environment, which enable the specification of
the business logic (i.e., the controller) of an application. Furthermore, the PIA designer provides
a way to compose distributed desktop services that are defined and implemented based on PIAs.
Other interfaces for personal data management are based on Wikis and include SemperWiki
(91) and WikSAR (8). However, they resemble a hypertext composer (or content manager)
providing the user with a means to put pieces of information together as a Wiki page.
5.3 The Layered Multi-Ontology Framework
Our framework follows the principle of superimposed information, i.e., data or metadata
“placed over” existing information sources (75). This concept seems particularly useful for
the organization, access, interconnection, and reuse of the information elements. We propose
for PIM a layered ontology-based framework, as shown in Figure 28, with the following data
components:
Personal information space. The personal information space may contain structured data
(e.g., relational), semi-structured data (e.g., XML), or unstructured data. Unstructured data
can be textual or non-textual (as in video, audio, or picture files). Furthermore, textual files
can be classified as simple-content or complex-content. More specifically, simple-content files
have no references to other files. Typical examples include people contacts and Bibtex entries.
108
Resource
-file index
PI Space
Textual
Nontextual:
Simple-content
Complex-content
Contacts
Bibtex
...
Papers
Reports
Emails
Slides
...
(Video, Audio, Pictures, ...)
Domain
Ontology 1
Domain
Ontology 2
Domain
Ontology m
Domain Layer
Application
Ontology 1
Application
Ontology 2
Application
Ontology n
Application Layer
Resource
repository
(RDF)
File Metadata
Resource Layer
Relational database, XML
PIM 1
PIM 2
Application Layer
Application
Ontology i . . .
PIM 3
Application Layer
Application
Ontology j . . .
Association
Figure 28. An ontology-based framework of a PIM system.
In contrast, complex-content files have a flexible scheme of presentation, and may contain
references to other files, e.g., by means of citations or hypertext links (33). For example, a
paper in the PI space may cite another paper (existing in the PI space or an external space),
which, in turn, could cite other papers.
File description. We annotate each file using a file description (or metadata) consisting of
a set of properties of the file. Each item in the file description is a property-value pair. The
file description is the first-level (direct) annotation for the individual files, and has the same
109
scheme (structure) for the same type of files. For example, the following fragment contains a
typical description of a JPEG file.
Dimensions: 3072 × 2048 pixels
Device make: Canon
Color space: RGB
Focal Length: 75
......
Domain ontologies. A number of ontologies are published on the Web. Examples of such
ontology libraries include DAML Ontology Library,1 the Semantic Web Ontologies,2 and the
Protege OWL ontologies.3 The ontologies in these libraries are typically designed and organized
for different domains such as Conference, Person, Photo, and Email. In our framework, the
domain ontology layer is designed to be loosely-coupled with the other layers, to enable the
insertion and removal of ontologies as “plug-ins”.
Resource-file index and RDF repository. One of the roles of domain ontologies is to pro-
vide the basis for data classification. In order to establish the connections between the files and
the concepts in the domain ontologies, we treat each file as a resource, which is then classified
as an instance of one or more concepts. The resource-file index is a local database storing these
1http://www.daml.org/ontologies/
2http://www.schemaweb.info
3http://protege.stanford.edu/plugins/owl/owl-library/
110
connections between resources and files. Furthermore, the various types of associations among
resources (as instances of association of concepts in the domain ontologies) are stored in an
RDF repository. The resource-file index and the RDF repository are both in the resource layer,
providing resource instances for the domain ontologies in the domain layer above.
Application ontology. Above the domain layer is the application layer, which contains the
ontologies for different applications. The domain ontologies, as an intermediate layer between
the applications and the data, are meant to enhance the reusability and flexibility of the frame-
work. More specifically, the application ontologies are defined as views of the domain ontologies,
which can be reused for the construction of different application ontologies. In our framework,
each personal information application (PIA), is associated with an application ontology, has
access to relevant data, and is functionally independent of other applications. It may be infea-
sible to have a single ontology to cover various applications, e.g., for trip planning and paper
writing. Instead, as many PIAs as needed can be designed in one or more PIM systems, where
the PIAs can interoperate (e.g., through P2P query processing) for the purpose of integrating
relevant information. This issue is elaborated on in Section 5.9.
Besides the data components described above, a PIM system also needs some functional
components to perform all kinds of data and metadata processing, to make the framework
work as a whole. Such components include an indexer (for establishing and managing the
indexes of the files), a wrapper (for identifying and extracting resources from the files), and an
ontology designer (for importing and editing an ontology). Because of space limitations, we do
not elaborate further on these components.
111
5.4 System Architecture
File description R-F index Ontology and
resource repository
File system
Wrapper library (for PDF, PPT, and DOC. etc)
Annotator
Indexer
Classifier
Application APIs
Jena API Ontology matcher
PIA Browser PIA Designer
User Interfaces
Data flow
Control flow
Ontology designer
files
text text
<property,value>
<property,value>
resources
R-F associations resources triples
Query processor
Resource Browser
files
Semantic Desktop Server
Data and Metadata Repositories
triples
Figure 29. The architecture of MOSE.
Figure 29 presents the architecture of MOSE (Multiple Ontology based Semantic DEsktop).
The following describes the primary components of the framework.
Our framework goes beyond the hierarchical directory based organization by means of two
types of ontologies: domain ontologies and application ontologies. The former represent the
conceptualization of different domains, thus providing a foundation for personal data classifi-
112
cation. The latter are designed to serve as the data model underlying personal information
applications (PIAs), which are developed by the end user. More details of how these ontologies
cooperate to enable a semantically powerful data manipulation in the semantic desktop are
given in 5.7.
File wrappers. The semantic organization is mainly based on a series of analysis and process-
ing on text documents in the personal information space. That is, we do not consider the
non-textual features of a file, although such features may facilitate data annotation (18). A
file wrapper is used to retrieve text from various types of files, such as PDF, PPT, and DOC.
The other functionality of file wrappers is to obtain from the file system the system-defined
properties of a file, e.g., its MIME type, size, and date.
Annotator. The annotator is responsible for creating and enhancing the annotation (or meta-
data) of a file. It is fed with the results of file wrappers, including the retrieved text and its
standard properties, based on which it associates the file with property-value pairs. Most of
current data annotators need input from users, although sometimes part of the annotations can
be obtained from the file content. In practice, a semi-automatic annotator is often provided,
such as the “easy” annotation mechanism of MyLifeBits (51). In MOSE, the annotations are
stored in a database, called file description.
Classifier. The classifier is one of the most important components for the semantic organiza-
tion in the framework. Given a file and its file description, the classifier provides the following
operations: (1) Identification of the file as a resource with a unique URI (Universal Resource
113
Identifier); (2) Examination of the file content to explore the resources that are contained or
referred to by the file; (3) Population of domain ontologies with all discovered resources; (4) De-
termination of the associations between resources, called resource-resource (R-R) associations.
These resources and their associations are maintained in a resource repository.
Indexer. After being classified, a file is indexed in terms of the resources discovered in itself
(e.g., the names of the authors in a publication). Such resource-file indices are stored in a
repository, called R-F index, for the future use for query answering. There are three types of
R-F indices (also called R-F associations): identification, containment, and reference, which are
obtained by the first and second operations of the classifier. Given a query of keywords posed
by the user, the query processor of MOSE can first locate the corresponding resources and then
find the files that are identified as, containing, or referring to such resources, by means of the
R-F index.
Ontology designer and matcher. At the center of the framework of MOSE are the multiple
application and domain ontologies stored in the ontology repository. We provide an ontology
designer for the management of concepts and roles of individual ontologies, and an ontology
matcher for the maintenance of inter-ontology relationships (i.e., ontology mappings). Con-
sidering that most semantic desktop end users may lack the knowledge of particular ontology
languages (e.g., RDFS or OWL), the ontology designer should hide the details of such languages
but enable users to work with the conceptualization of their domains of interest. In addition,
114
to improve the precision of an automatic ontology mapping process, the ontology matcher may
be able to combine different ontology matching strategies (37; 64).
5.5 Semantic Data Organization
The layered architecture of our PIM framework described previously enables the reusability
and the organization of semantically rich data for PIM. In this section, we discuss in detail
the mechanisms that our framework uses to support the semantic organization of the PI space,
including those for semantic annotation, association, and representation.
5.5.1 Annotation
Given that the data in the PI space is the base information, all the other data components
in our framework are actually superimposed information over this base. The most fundamen-
tal function of the superimposed information is to provide semantic annotations of the base
information to enable powerful and accurate data access. We discuss the following two aspects:
File description. It is especially important to provide the searcher with a detailed description
of the nontextual files. When performing a keyword-based searching, the searcher matches the
submitted keywords (e.g., “Canon”) or key-value pairs (e.g., “Maker:Canon”) with the property-
value pairs of the file description, to find the right files requested by the user. Even for textual
files, taking into account such metadata will improve the effectiveness of full-text searching.
Domain ontologies. Given that a file is identified as a resource, we are able to annotate the
file using a domain ontology, by associating the resource with a concept of an ontology. The
domain ontology provides not only a context for understanding the data, but also semantic clues
for the precise data retrieval. For example, the user can query the PI using a query language
115
for RDF instead of using keywords. We note that a file can be an instance of more than one
concept, according to different classification criteria.
5.5.2 Association
In our framework, semantic associations are used to relate all the data (base information)
and metadata (superimposed information). There are two classes of associations: the resource-
file associations that are actually the resource-file indexes and the resource-resource associations
that are instances of the domain ontologies and are stored in the RDF repository.
Resource-file associations. In addition to the ontological resources that are used to identify
(through data classification) the files, a (textual) file may contain and refer to a number of
resources. Therefore, the resource-file associations can be one of the following: identification,
containment, and reference.
Example 5.2 Suppose that the user has saved an email message, which is an announcement of
a seminar, as shown in Figure 30. First, the email message can be classified as an instance of
the concept Email, provided that the concept exists in some domain ontology. Then, the system
can generate for the concept SeminarAnnouncement and its properties a new instance (i.e.,
resource), which is associated with the saved email by the relationship containment. Finally, a
reference association can be established between the resource http://www.tliap.nus.edu.sg/ (e.g.,
of the concept WebsiteAddress) and the email message.
The process of setting up the resource-file associations is the one of recognizing resources
from the file description and/or the file content and then mapping them to the ontological
116
Figure 30. An example of an email message.
concepts. The user may determine the degree to which the resources should be extracted from
a file and its description. For instance, in the previous example, the user can further create
resources for the title and abstract of the seminar, and for the biography of the presenter. It is
expected that this process (as well as the process of discovering resource-resource associations,
as discussed later) can be maximally automated, to reduce the user’s burden. For this purpose,
we may utilize the following methods:
• Keyword extraction. From the text of a file, keywords can be extracted based on a
thesaurus or be highlighted manually by the user. Each keyword can be considered a
resource contained by the file. The matching of the resources with the concepts in the
domain ontologies can be guided by a thesaurus such as WordNet.1
1http://wordnet.princeton.edu
117
• Hyperlink analysis. For the textual files that include hyperlinks to classified resources
(e.g., a citation of a paper or a link to a webpage), we create for each hyperlink a reference-
type resource-file association, as well as a resource-resource association between the re-
ferring resource and the referred one.
• Natural language processing. We can utilize known techniques (e.g., (12)) to parse
each sentence of a text or its summary obtained by means of text summarization (76).
For each resulting triple 〈subject, predicate, object〉, we try to match it with the patterns
〈s, p, o〉 in the domain ontologies, where p is a property of the concept s and has a value
typed of o. If such pattern exists, a resource-resource association of type property and of
the form 〈subject, predicate, object〉 is generated.
• History. As the framework proceeds with such classification and cognition, more and
more knowledge about this process can be accumulated and reused by a new process.
Resource-resource associations. We borrow from the Object Oriented Design (OOD) tech-
niques the following four types of relationships between objects: instantiation (i.e., member-
ship), property, aggregation (i.e., whole/part), and generalization (i.e., inheritance). These four
relationships, which are used in object models, are adopted to describe the associations among
concepts as well as resources in our framework. Note that “property” refers to a pattern as
identified, for example, using natural language processing techniques, which corresponds to a
user-defined property. For example, writes can be a property of the concept Author, connecting
Author to the concept Book. Table V summarizes the resource-resource associations in our
framework.
118
TABLE V
RESOURCE-RESOURCE ASSOCIATIONS.Resource-resource Intra- Inter- Intra- Inter- Domain-
associations domain domain application application applicationaggregation
√ √ √property
√ √ √instantiation
√ √ √generalization
√ √ √ontology mapping
√ √ √
By using the previously described techniques, we can discover the resources and their asso-
ciations implied in the PI, and classify them into the domain ontologies, thus populating the
ontologies. In the example of Figure 30, it is possible to extract a pattern 〈Singapore, implements,
ITS〉, which could then be classified as an instance of an ontological pattern such as 〈Organization,
implements,System〉, where Organization and System are two concepts, and implements is a prop-
erty. Note that the user is allowed to choose the granularity of this knowledge (resource and
associations) discovery process, ranging from only taking the whole file as a single resource to
analyzing the detailed contents of the file.
In addition, ontology mappings may be established between correspondences that connect
concepts in different domain and application ontologies. Currently, we consider equivalence as
the only semantics for the mapping between two concepts, although richer semantics of the
mappings could be considered (64).
119
5.5.3 Representation
In our framework, all information, including file descriptions, the resources in the repository,
and the resource-file indexes, are represented in the Resource Description Framework (RDF),1
a W3C proposed standard. For the schema of these data (i.e., the application and domain
ontologies), we use the vocabulary language for RDF, RDF Schema (RDFS).2 The RDF model
is a semantic network, where the nodes denote the resources and the edges are properties that
represent the relations between resources. The network can also be seen as a set of statements
(triples) in the form of (subject, predicate, object). RDFS is used to define the vocabulary
(in terms of classes and properties) of the RDF data, such as rdfs:Class, rdf:Property, and
rdf:type. Table VI summarizes the RDFS vocabularies that are used to represent different types
of associations.
The use of RDF as the data model and RDFS as the ontology language in our framework
is motivated by the nature of the RDF as a Web resources description mechanism and the
fact that the PI is represented as a set of interrelated resources. In contrast, XML is not
chosen because it cannot represent semantic associations (42). Certainly, OWL (Web Ontology
Language), as built on top of RDFS, is more expressive for ontology representation. However,
the use of a slightly extended version of RDFS is adequate for representing resource-file and
resource-resource associations.
1http://www.w3.org/RDF/
2http://www.w3.org/TR/rdf-schema
120
TABLE VI
RDF PROPERTIES FOR THE REPRESENTATION OF ASSOCIATIONS.Relationship RDF property Commentsaggregation rdfx:contained rdfx is the abbreviation of the namespace, where the
property contained is defined. For example, <#a,rdfx:contained, #b> means that a contained b.
property User-defined prop-erties
For example, <#wise03talk, presentedBy, #xiao>means that wise03talk is connected to xiao by theassociation presentedBy.
instantiation rdf:type For example, <#xiao, rdf:type, #Person> means thatthe resource xiao is an instance of the concept Person.
generalization rdfs:subClassOf rdfs:subPropertyOf is used for property generaliza-tion.
The extension to RDFS is as follows: we define in a namespace (abbreviated using the prefix
rdfx) a new RDF property, contained, which is used to represent the aggregation relationship.
For the representation of the instantiation and generalization relationships, we use rdf:type and
rdfs:subClassOf, respectively. The property relationship is represented naturally by an RDF
property defined in the user-defined namespace. 5.6 gives a concrete example of an RDF
representation.
5.6 Semantic Data Navigation
It is critical for a Semantic Desktop to provide the user with the capability to access the
stored data in a variety of ways. The user may want to browse the information by means of the
flexible and intelligent navigation in the information space, including the base and superimposed
information. The user may also desire that certain query facilities (e.g., keyword-based searching
or certain query languages) be provided by the framework. In this section, we discuss the
121
navigation in the data space of a Semantic Desktop. Query processing is discussed in the next
section.
The semantic data organization in our framework enables the navigation in the PI space,
making use of useful hints (e.g., the context of a concept being browsed) so as to facilitate the
user’s understanding of data. More specifically, by taking into account the layered architec-
ture, the semantic navigation in our framework can be performed in three directions: (1) In
vertical navigation, the user follows a path across layers. Two cases are possible for this way
of navigation: top-down from the application ontologies to the stored files and bottom-up from
the stored files to the application ontologies. (2) In horizontal navigation, the user follows links
of concepts (or resources) within one layer. Typically, there are three cases of horizontal nav-
igation, corresponding to each layer: application-to-application navigation, domain-to-domain
navigation, and file-to-file navigation. (3) In temporal navigation, the user can navigate by
following references in chronological order, each being a resource for the same real world object
with a time stamp associated with it. For example, the user may want to look at different
versions of a research paper.
All the base and superimposed information in the framework forms a directed graph, where
the vertices are the resources in the ontologies and the files stored in the PI space, and the
edges are the associations between the resources and files. We say that the three directions of
navigation together provide the capacity of a 3-dimension (3D) navigation mechanism, which
can facilitate the construction of a browser. For instance, suppose the user is browsing a specific
application ontology in a visualized browser. When the user clicks on the node of a concept in the
122
<#wisephotomsg, rdf:type, #Email>
<#wise03photomyself, rdf:type, #Photo>
<#wise03conf, rdf:type, #Conference>
<#wise03talk, rdf:type, #ConferenceTalk>
<#wise03papercamera, rdf:type, #InProceedings>
<#jods05, rdf:type, #Article>
<#cruz, rdf:type, #Person>
<#xiao, rdf:type, #Person>
Photo
Publication
Book
Person
editor
booktitle
Article
InProceedings
Misc
Literal
Literal
Literal
volume
pages
Publication Ontology
t a k e O n
Literal
Literal
Literal
w i d t h
h e i g h t
t i t l e
Date
Photo Ontology
Receiver
attends
Sender
sentBy
title sentOn
attached
Email Ontology
Person
Date Literal
Attachment
Application Document
Talk
Picture
<"c:\emails\WISE photos.eml", rdfx:identification, #wisephotomsg>
<"c:\emails\WISE photos.eml", rdfx:contains, #wise03photomyself>
<"c:\emails\WISE photos.eml", rdfx:reference, #wise03conf>
<"c:\photos\myself.jpeg", rdfx:identification, #wise03photomyself>
<"c:\talks\WISE03.ppt", rdfx:identification, #wise03talk>
<"c:\talks\WISE03.ppt", rdfx:reference, #wise03talk>
<"c:\papers\WISE03-camera.pdf", rdfx:identification, #wise03papercamera>
<"c:\papers\JoDS05.pdf", rdfx:identification, #jods05>
Resource-file index RDF repository
Conference
Paper Person
ConferenceTalk
Talk
InvitedTalk
Place
Talk Ontology
Conference
Date
PictureOfPerson
PictureOfScene
Person
Date Place
Person
Talk
takenBy
Ontology for attending a conference Ontology for picture management
PictureOfEvent
event
subject
takenOn
takenAt
presentedAt
writtenBy
publishedAt
receivedBy
sentBy
receivedBy
wisephotomsg
wise03photomyself
wise03papercamera
wise03conf
wise03talk cruz xiao
Literal title
programOf
presentedBy p r e s e n t e d A
t
presentedOn
sentBy receivedBy
presentedBy
editor editor
programOf
attached
presentedBy
jods05
extends
Journal
extendedVersion
extends
A p
p l i
c a
t i o
n L
a y
e r
D o
m a
i n L
a y
e r
R e
s o
u r c
e L
a y
e r
mapping
rdfs:subClassOf
User-defined property
Figure 31. Data organization in the application, domain, and resource layers. All ontologiesare represented in RDFS. Two application ontologies for PIAs, i.e., picture management andpublication management, are constructed. Below them are four ontologies for the domains of
Email, Talk, Publication, and Photo, respectively. At the bottom, the resource-file andresource-resource associations are represented as triples or in a graph.
123
Receiver Sender
sentBy
title sentOn
attached
Person
Date Literal
Attachment
Application Document
receivedBy
attends
Conference
Person
writtenBy receivedBy
sentBy
Application Ontology
Domain Ontology
1. title
WISE photos
2. attached
wise03photomyself
3. sentOn
12/30/2003
4. sentBy
cruz
5. receivedBy
xiao
Figure 32. The browser for PIM.
ontology, the browser can then choose to display the instances of the concept thus selected (by
vertical navigation), the context of the concept in the domain (also by vertical navigation), and
the associated concepts in other application ontologies (by horizontal navigation). Compared
to the traditional navigation approach that is based on hierarchical directories, 3D navigation
is based on semantic associations, similarly to those that humans establish between concepts.
124
Example 5.3 Consider the scenario shown in 5.6. The spirit of 3D navigation is demonstrated
in the browser of Figure 32. The current resource (file) that the user is browsing is an email
message (i.e., wisephotomsg), which has some photos attached, which were taken at WISE ’03.
The concepts that this resource belong to are highlighted (in white) so as to show the contexts
to which they belong. All associated resources are categorized and shown on the right tabbed
pane, which provides a guidance for the user in navigating the PI space. The bottom-right pane
shows the timeline of different versions of the current resource (if they exist) or all the resources
belonging to the same concept as the current resource.
5.7 Personal Information Applications
5.7.1 Motivation
Example 5.4 A PhD student majoring in Chemistry has collected quite a few publications
related to her research area and is now compiling a literature survey. The publications are
stored as PDF files in different directories. For the literature survey she looks at a group of
selected papers. For each of those papers, she would like to read some of the interesting papers
that are referenced in that paper, which have already been downloaded and stored in the local
desktop. To locate those papers, she can browse the directory hierarchy, use the search capacity
provided by the operating system (if she can remember the file names), or use desktop search
tools, such as the Google Desktop Search1 or MSN Desktop Search.2
1http://desktop.google.com
2http://toolbar.msn.com
125
As the literature survey progresses, the student becomes tired of switching between windows,
and wants to develop a bibliography management system such that the above mentioned func-
tionalities are integrated in a single interface. However, she finds it challenging to implement
such a system, which requires several components, including a database to store and retrieve
the citation relationships between pairs of publications. She asks the help of a friend majoring
in Computer Science, who develops for her such a standalone application in Java. Now, the
student is able to browse through her publications and the citation network easily. However, she
would like to share that application with her advisor and with the other students in the project
but is not able to do that. Furthermore, she would have liked to be able to access the publications
that the other group members have discovered and stored, but cannot do that either.
When all the papers have been discovered and interrelated she would finally like to integrate
the bibliography application with an application for paper composition. The paper composition
application would gather several pieces of information such as related literature, experimental
results, and comments/corrections from the advisor. However, she discovers that the two appli-
cations do not interoperate and she has to manually “import” the information that is gathered
by the bibliography application into the paper composition application.
There are several key considerations in the design of a PIA development tool. First, end
users may either be ignorant of programming skills or be reluctant to write such programs in
the context of organizing the information in their desktop. Therefore, the PIA development
environment, if provided, should hide the programming details from the user. The second con-
sideration has to do with the flexibility and expressiveness of the designer. Even though we do
126
not expect to invent another programming language, there are some fundamental functionali-
ties that we need to make available, such as data access, data presentation, and business logic.
Finally, there is the need to share the information related to the same application (or task)
between two end users, as well as the need to reuse and to interoperate among existing PIAs.
Based on its semantic data organization, MOSE provides a semantic tool for end users
to develop PIAs—the PIA designer. In this chapter, we describe how we exploit the MVC
(Model-View-Controller) methodology (67) for the personal information development in the
PIA designer, which addresses the issues above illustrated. In particular, we discuss how PIAs
can be formalized as desktop services and how such services can facilitate the data interoperation
and intergration across semantic desktops.
5.7.2 MVC-based PIA Development
The resource explorer allows for the “global” exploration of the resources and ontologies
in a desktop. However, views need to be tailorable for the users’ diverse tasks, as we see in
Example 5.4. To this end, MOSE provides a tool, the PIA designer, whose main objective is
its flexibility.
Each PIA can work in a standalone mode, with its own application ontology, user interface,
and work flows, aiming at a specific task (e.g., bibliography management, paper composing,
or trip planning). Meanwhile, different PIAs can communicate with each other as in a P2P
network, by means of the connections (mappings) established between their application on-
tologies. In MOSE, a PIA can present two modes: development mode and execution mode.
The interfaces corresponding to these two modes are respectively the PIA designer (for the
127
development mode) and the PIA browser (for the execution mode), which can switch from one
to another at anytime.
The development of a PIA uses the MVC (Model-View-Controller) methodology. In par-
ticular, in the development of a PIA, the “Model” can be an application ontology that has
been composed as a view over domain ontologies; the “View” consists of one or more compo-
nents that present data in different forms such as graph, text, and list; the “Controller”, which
is the business logic of the PIA, is a set of “if-then” rules, which enable the interaction and
synchronization between different data components. The data associated with components to
be displayed are retrieved from the repositories of ontologies and instances by queries named
parameterized channels.
The specifications of a PIA, as defined by the user by means of the PIA designer, including
the model, view, and business logic, can be serialized in XML. It is called the PIA definition.
Now, the user can run a PIA in the PIA browser, which interprets and executes the PIA in
either an “online” mode (by directly switching from the designer to the browser) or an offline
mode (by loading from the PIA’s permanent serialization). The separation of the declarative
specifications from the interpretative execution greatly benefits the communication between
semantic desktops in terms of PIA interoperation, as we will see in the following sections.
5.7.3 Implementation
We have implemented a prototype of PIA designer using Java, as shown in Figure 33.
Following the three basic elements of an application, the following describes three stages of the
application development.
128
Figure 33. The PIA designer.
Modeling. In the first stage, the user loads the application ontology from the ontology
repository, which represents the model underlying the PIA to be designed; it will be graphically
shown in the Data Model pane. The application ontology is mapped to the domain ontologies,
under which the resources representing personal information are classified. Actually, the appli-
cation ontology is constructed as a view over the domain ontologies in a “global as view” (GaV)
129
approach (70). This mapping process should not require the users’ programming expertise, but
only their awareness of the task and their knowledge of the domain.
Visualization. The second stage involves the design of the layout of the PIA, with one
or more visual components, each of which can be associated with a stream of data for its
presentation. The user drags the desired visual components from the Visual Component pane to
the PIA Browser Workspace pane. Examples of such components include TextPane, List, Table,
Graph, and File. The associated data can be resources, strings, files, and whatever as instances
of the ontologies; they are retrieved by queries, called channels (introduced in (98)), on the
application ontology. Some components, such as Button, Label, TextInput, and MessageBox, are
used to facilitate the interaction between the user and the PIA browser. A special component
called Services is used for desktop service composition, as discussed in Section 5.8.
Controller and parameterized channels. In the final stage, the controller (or business
logic) of a PIA is specified so as to realize rich interactions between the data and their views,
and to synchronize several visualizations. These controllers manage all possible updates of the
model and handle the events from the user interface, using “if-then” rules (more sophisticated
controls will be considered in future work) of the following form:
if Component1.event1(x1) and ... and Componentn.eventn(xn)
then Component1.action1(y1); ...; Componentm.actionm(ym);
endif
where xi, i ∈ [1..n], are parameters passed from the events, and yi, i ∈ [1..m], are the channels
that result in the actions. It often happens that the response of a component to some event needs
130
to take xi as a parameter to execute yi, especially when updating the data that is sensitive to
xi in a visual component. For this purpose, we introduce the concept of parameterized channel,
which are channels that have their contents determined by the parameters at runtime. In
MOSE, where channels are queries over ontologies, the parameter of a channel can be bound to
a variable or a constant in the query. By means of parameterized channels, an event started from
a component can pass any values to another component, thus enabling interactions between
different components.
Example 5.5 As shown in Figure 33, at the top left corner, the user loads the application
ontology (for publications), to develop a PIA for bibliography management. The application’s
user interface uses a Graph for displaying the citation network of papers, a TextPane for the
paper’s details, a List for the paper’s authors, and a TextPane for the author’s details.
To associate data with their proper visualization, the user defines the following channels,
in the syntax of RDQL (RDF Data Query Language), which has an SQL-like grammar (62).
Each channel is in the form of string, which can then be fed into an RDQL interpreter (e.g.,
provided by the Jena API) for execution.
1. ch 1(): “SELECT ?a, ?b WHERE (?a, cites, ?b)”
2. ch 2(x): “SELECT ?a, ?b, ?c, ?d WHERE (” + x + “, title, ?a), (” + x + “, writtenBy,
?b), (” + x + “, year, ?c), (” + x + “, citedAs, ?d)”
3. ch 3(x): “SELECT ?a WHERE (” + x + “, writtenBy, ?b), (?b, name, ?a)”
4. ch 4(x): “SELECT ?a, ?b WHERE (” + x + “, institute, ?a), (” + x + “, email, ?b)”
131
As an example of parameterized channel, the second query, ch 2(x), returns the title, author,
year, and citation entry of a publication, which is bound to parameter x.
The data computed by executing a channel will present different forms depending on what
visual component is used to visualize this data. For example, a Graph shows the data represented
as a graph, where nodes are resources and edges are their associations. To construct such a
graph, the nodes representing the same resource will be merged into a single one.
The following rules are defined to specify the controller, in which the first rule has no pre-
conditions, thus being triggered at the very beginning of the PIA’s run.
1. PaperGraph.update(ch 1())
2. if PaperGraph.isSelected(x) then PaperDetail.update (ch 2(x))
3. if PaperGraph.isSelected(x) then AuthorList.update (ch 3(x))
4. if AuthorList.isSelected(x) then AuthorDetail.update (ch 4(x))
5.8 Services-based Desktop Interoperation
As mentioned before, two PIAs can communicate in a P2P fashion based on the application
ontology mappings established between them. It is required in this case that the two PIAs are
designed for a similar task, for which they have their application ontologies partially or fully
overlapping. We say that this way of PIA interoperation (or integration) is on the semantic
level and is oriented to data models. Previous work has discussed such P2P ontology-based
query processing (26; 114; 115). In this section we discuss another type of PIA interoperation,
which is realized by means of desktop services, thus called service-oriented interoperation.
132
The notion of desktop service was first introduced into the vision of semantic desktop by
the Gnowsis system (103). However, to our best knowledge, there has been no definition and
formalization of desktop services. Next we give our own definition of what constitutes a desktop
service in terms of parameterized channels, and describe how this service-based mechanism
facilitates the data interoperation and integration in our semantic desktop vision. We assume
PIA-based desktop services in our discussion, and use both terms, PIA and desktop service,
interchangeably.
In general, a service (e.g., Web service1) must have its interface (i.e., input and output)
defined, while keeping the implementation of its operation hidden from the service consumer.
Intuitively, a PIA in MOSE consists of a set of visual components bound to parameterized
channels. In this sense, we can see a channel as the minimal unit of service, taking the para-
meters as input and its resulting data as output. Starting from this point, we are able to give
a definition of service based on the definition of parameterized channel.
Formally speaking, a parameterized channel q is a triple 〈M, I, O〉, where M is the under-
lying model (i.e., application ontology), I is a set of parameters (i.e., input), O is the set of
tuples resulted from execution of the channel (i.e., output). A desktop service s is a 5-tuple
〈Q, I, O, V, C〉, where
• Q = {q1, ..., qm, s1, ...sn}, is a set of channels q1, ..., qm or services s1, ...sn, where m ≥ 0,
n ≥ 0, m + n ≥ 1, and si 6= s, i ∈ [1..n];
1http://www.w3.org/2002/ws/
133
AO-2
PIA-2
PIA Browser
PIA-1
SemDesk 2
AO-4
PIA-4
PIA-3
SemDesk 4
PIA-2
AO-3
SemDesk 3
PIA-3
AO-1
SemDesk 1
PIA-1
AO-2
PIA-2
PIA Browser
PIA-1
SemDesk 2
AO-4
PIA-4
PIA-3
SemDesk 4
AO-3
SemDesk 3
PIA-3
AO-1
SemDesk 1
PIA-1
PIA-2
PIA-1
(a) Remote execution of services (b) Local execution of services
Request
Response
PIA-# PIA definition
PIA-# PIA implementation
Figure 34. Desktop services composition and execution.
• I ⊆ I1 ∪ ...∪ Im ∪ I ′1 ∪ ...∪ I ′n is the input, where Ii is the input of qi, i ∈ [1..m], and I ′i is
the input of si, i ∈ [1..n].
• O ⊆ O1 ∪ ... ∪ Om ∪ O′1 ∪ ... ∪ O′
n is the output, where Oi is the output of qi, i ∈ [1..m],
and O′i is the output of si, i ∈ [1..n].
• V = {v1, ..., vl} is the set of visual components, with vi being the component of oi, where
oi ∈ O, i ∈ [1..l].
• C = {c1, ..., ck} is a set of rules representing the control flows among the components.
The above recursive definition, based on the units of channels, allows for a flexible compo-
sition of desktop services. Besides its self-defined channels qi, i ∈ [1..m], a PIA can reuse any
134
services si, i ∈ [1..n], and embed them in itself, by establishing which channels oj of si to be
shown in which view vj , j ∈ [1..l]. Then, the controller C consisting of if-then rules is used to
specify the composition (control and data flows) among these channels or services in the PIA.
Because of space limitations, we do not elaborate on the different types of service composition
(e.g., “sequential” and “parallel” flows) (96). Instead, we describe next how the service-oriented
inter-desktop communication is implemented, by means of service composition and execution,
in the two cases that are depicted in Figure 34.
The first case, as shown in Figure 34(a), is called remote execution of desktop services. In the
example, there are four services (PIA-1 to PIA-4), with their respective application ontologies
(AO-1 to AO-4). Suppose that PIA-4 is the starting point of the service execution, where the
user interacts with the PIA browser. All requests for both the data and the execution of other
services (defined and implemented in other desktops, but composed by the current service) are
driven by events from such interactions. Whenever a nested remote service (e.g., PIA-2 or PIA-
3) is triggered by the current service, a request for execution will be sent to the remote desktop
(e.g., SemDesk 2 or SemDesk 3), where the remote service will be executed. As a response to
the request, the remote service returns its execution results to the current service.
While the first case is similar to what happens with Web services, the second case of desktop
service execution (called local execution, as shown in Figure 34(b)) is quite different. In partic-
ular, whenever a service nested in the current service is activated, it will be locally interpreted
and executed by the PIA browser in the current desktop. However, the local execution of a
135
remote service (e.g., PIA-2) needs permission to access relevant data (e.g., AO-2) from a remote
desktop. If so, the data is then duplicated in the local desktop via a secure data transfer.
We note that the essential difference between the two cases of desktop service execution is
related to a tradeoff between control permission and data access. This flexibility is important
in a semantic desktop setting. Depending on their available resources, some desktops may be
reluctant to take a heavy workload while some others may be concerned with the privacy of
their data. Therefore, a desktop (when acting as a server) can choose whether to contribute
its computing power or share its data.
5.9 Semantic Query Processing
Unlike navigation, which is an interactive process, query processing is performed without
further intervention from the user. To retrieve relevant data from the PI space, the user’s
request may be posed as a sequence of keywords or as a query formulated in a certain query
language.
The keyword-based search matches the input keywords and the vector of words in the
candidate documents, calculates the similarity for each of the matches, and returns to the
user the results after ranking them (102). The results of a search are usually evaluated using
the statistical criteria such as precision, recall, or a combination of them. The shortcoming of
keyword-based search is that the semantic associations between relevant data are not considered.
In contrast, query languages can provide a semantically richer access interface, thus facilitating
the data retrieval and improving the accuracy of the answers. However, a query is usually
136
performed based on an exact match between the query and the data, so that the recall of the
answers is influenced, in the sense that some relevant but not matched data is not retrieved.
Since the two approaches complement each other, it is desirable to provide both of them.
In this section, however, we mainly focus on query processing in our framework. We choose to
express the queries in RDQL (62); they can query both the resources and their associations.
We discuss how to process a query submitted by the user in two cases: within a PIA and across
different PIAs.
5.9.1 Query Processing in a PIA
In our framework, the user query is formulated in RDQL (RDF Data Query Language),
which uses an SQL-like syntax (62). To reduce the user’s burden, a graphic means can be used
to facilitate the user’s query formulation. For simplicity, we use a subset of RDQL that we
call conjunctive RDQL (c-RDQL), which can be expressed as a conjunctive formula: ans( ~X) :-
p1( ~X1), ..., pn( ~Xn), where ~Xi = (xi, x′i) and pi is an RDF property of xi having the value x′i.
In our framework, an application ontology is constructed over one or more domain ontologies,
and the files in the PI space are formalized as instances of the concepts in the domain ontologies.
If we consider the application ontology as the global ontology (since the user query is posed
on it), the whole system can be seen as a GaV data integration system (70). Therefore query
processing in a single PIA is performed as in a GaV system. In particular, when the user
poses a query (in RDQL) over the application ontology, the RDQL query is then rewritten into
a new RDQL query in terms of the domain ontologies, based on the mappings between the
global ontology and domain ontologies. By executing the rewritten query on the corresponding
137
domain ontologies, resources (files) that match the query are then returned as answers to the
query.
There are a number of algorithms for query rewriting in relational or XML data integration
systems (58). In a GaV based integration system, query processing is performed using a “un-
folding” strategy (70). More specifically, for rewriting a query (e.g., a conjunctive query) that is
posed on the global schema or ontology, we simply substitute the predicates in the body of the
query with the corresponding view definitions. In our framework, where the mappings between
the application ontology and the domain ontologies are expressed as RDF class or property
correspondences, the algorithm for query rewriting is similar to this strategy.
By assuming that there are no integrity constraints over the application ontologies and the
user queries are formulated in c-RDQL, we give the formal description of our query rewriting
algorithm in a single PIA, which we call ADREWRITING (for rewriting from Application
ontologies to Domain ontologies), as follows. We note that we do not consider the namespaces
of ontologies for simplicity of the description.
Example 5.6 Suppose the user wants to list all conference papers with their authors and jour-
nal version, using the query q1 : ans(x, y, z) :- writtenBy(x, y), extendedV ersion(x, z), which
is posed on the application ontology of publication management. For the variables (x, y, z),
we get the classes that they refer to as (Paper, Person, Journal), as indicated by Line 3. By
looking into M, we find the corresponding class sequence as (Publication:InProceedings, Publica-
tion:Person, Publication:Article), where the names before the colons are domain ontology names.
From Lines 5 to 10, we compute the predicates in the body of q2 as follows.
138
Algorithm ADRewriting
Input: 1. q1 over the application ontology G: ans( ~X) :- p1( ~X1), ..., pm( ~Xm);2. M: the mapping table between G and domain ontologies S1, ...,Sn.
Output: q2: A c-RDQL query over S1, ...,Sn.1 headq2 = ans( ~X); bodyq2 = null;2 For i = 1 to m do3 (c1, c2) = name of the classes referred to by (x1, x2), for ~Xj = (x1, x2);4 Search M to find (d1, d2) such that {(c1, d1), (c2, d2)} are two class correspondences
in M;5 Traverse S1, ..., and Sn by following all kinds of associations, to find the vertices,
v1, ..., vk, connecting from d1 to d2;6 If k = 0 then add p(x1, x2) (or p(x2, x1)) to bodyq2 , if there exists p connecting
d1 to d2 (or d2 to d1);7 Else for j = 1 to k − 1 do8 Add p(xj , xj+1) (or p(xj + 1, xj)) to bodyq2 , if p is not a mapping and connects
vj to vj+1 (or vj+1 to vj);9 Add p(x1, x1) (or p(x1, x1)) to bodyq2 , if p is not a mapping and connects d1
to v1 (or v1 to d1);10 Add p(xk, x2) (or p(x2, xk)) to bodyq2 , if p is not a mapping and connects vk
to d2 (or d2 to vk);11 q2 = headq2 :- bodyq2 ;
Figure 35. The ADRewriting algorithm.
q2: ans(x, y, z) :- editor(x, y), extends(z, x)
By executing q2 over the RDF repository as shown in 5.6, we get the answer:
{(#wise03papercamera, #xiao, #jods05), (#wise03papercamera, #cruz, #jods05)}.
5.9.2 A2A Query Processing
Application to application (A2A) query processing occurs when an application is attempting
to retrieve relevant data from another semantically related application, to answer a query. If
139
the PIAs are considered as connected peers (i.e., service providers for certain data access), the
A2A query processing is similar to that in peer-to-peer (P2P) systems (40; 59). Whether the
PIAs exist in a single desktop or are physically distributed makes no differences to the A2A
query processing.
A2A query processing consists of two steps of query rewriting. First, we rewrite the original
query q, which is posed on the application ontology G1, to a query q′ on the other application
ontology G2, according to the mappings between G1 and G2. Then, q′ is rewritten to a query
q′′ on the domain ontologies, to which G2 is mapped. Answers are obtained by executing
q′′ on the RDF repository. The second query rewriting is exactly the one described by the
algorithm ADRewriting, whereas the first rewriting is slightly different from ADRewriting.
In particular, unlike the total mapping from an application ontology to the domain ontologies,
some of the concepts in G1 may not be mapped to those in G2. Therefore, the answers returned
by q′′ may contain null values or Skolem functions for the unmapped concepts or properties.
The A2A mappings can be derived by composing the mappings between G1 and the do-
main ontologies, inter-domain mappings, and those between G2 and the domain ontologies. To
evaluate both query rewriting processes, we need to check the equivalence (or containment)
between a query and its rewriting. A correct query rewriting is the one that is equivalent to
(or maximally contained in) the query. These two issues (reasoning on mappings (14; 87) and
reasoning on queries (28; 82)) have been extensively studied and are beyond the scope of this
chapter.
140
5.10 Summary
In this chapter, we present our design of a PIM system. We propose a layered ontology-
based framework, which aims to provide a semantics-rich environment for personal information
organization and manipulation. The multiple ontologies existing in different layers of the archi-
tecture explicitly support the data semantics. Furthermore, the decoupling of the domain layer
and the application layer enhances the flexibility and reusability of the framework. Specif-
ically, we discuss in detail the semantic-enriched data organization, including the use of file
descriptions and domain ontologies as annotations, and the construction of resource-file and
resource-resource associations. We also introduce the idea of 3D navigation, which is used in a
desktop browser.
We also describe an MVC-based approach for personal information application (PIA) devel-
opment in MOSE. Based on that, we have formalized the concept of desktop service, building
on the notion of parameterized channel. Furthermore, we discussed how desktop services can
facilitate data interoperation and integration across distributed semantic desktops.
Finally, we discuss query processing in our framework in two cases: within a single personal
information application, PIA, and between two PIAs, using application to application (A2A)
communication. A formal query rewriting algorithm is presented for the single PIA case.
In the future, we will continue the study and implementation of our framework. It is clear
that a lot of the success of PIM systems lies on the successful automation of the different
mechanisms that are needed. In particular, we will look further into the automation of the
conceptualization of full-text files and that of matching resources to ontological concepts. Also,
141
we will elaborate on the idea of 3D navigation both by studying a model for temporal navigation
and by carrying out user studies. The study of A2A communication, including data exchange,
collaboration, and query processing will also be continued. While RDQL queries are expressive,
they may not be suitable for most users. We are therefore exploring visual queries that can
express a class of RDQL queries “appropriate” for the semantic desktop.
While the envisioned semantic desktop can be seen as a miniature of the prospective semantic
web, it has its particular features as well as challenges, such as automatic classification of
personal information into ontologies, context-aware information search, and flexible tools for
data manipulation and application development. In the future, we will work along the following
two directions: (1) We would like to make the outlined functionality accessible to most end users,
for example, by allowing natural language specifications to automatically formulate channels. In
this context, the previous work on conversion of natural language questions to formal queries is
of great interests (72). (2) We will also work on mechanisms for defining, publishing, discovering,
and composing desktop services so as to extend their current capabilities. Our goal is to provide
a semantic platform, where Web services and desktop services can be semantically integrated
in a seamlessly way, so as to achieve data integration and application interoperability across
semantic desktops.
CHAPTER 6
GEOSPATIAL DATA MANAGEMENT IN E-GOVERNMENT
6.1 Introduction
It is the objective of eGovernment to increase the cooperation among government orga-
nizations so as to enable effective overall assistance to citizens. To this end, achieving data
interoperability is a major objective (73). The advent of XML on one hand, and the emergence
of metadata standards on the other hand, will play an important role in achieving syntactic,
schematic (or structural), and semantic interoperability (16). However, years of autonomic
and uncoordinated development of classification schemes by government organizations pose
enormous challenges in achieving cooperation in areas such as land use planning, healthcare,
transportation, and social services.
In this chapter, our focus is on data interoperability of distributed geospatial data. To
illustrate the reach of our approach we focus on examples that are derived from land use
applications. The heterogeneity of data in such applications is extreme in that each county
and each municipality may have a different model for their databases—resulting in schematic
heterogeneity—and/or a different classification scheme for their land use data—resulting in
semantic heterogeneity). We have worked with the Wisconsin local government within the
scope of WLIS (Wisconsin Land Information System). In the state of Wisconsin, there are
hundreds of different land use data schemas and classifications associated with the land use
142
143
data sources of the different counties or municipalities, henceforth called local data sources,
therefore hindering the cooperation among the local governments to achieve comprehensive
land use planning across the borders of the different jurisdictions (112).
We propose an ontology-based approach to enable integration and interoperability of the
local data sources, which reconciles both schematic and semantic heterogeneities. An ontology
is a formal, explicit specification of a shared conceptualization; it can be either an axiomatized
set of concepts and relationship types or a taxonomy of entities (55). We call the first kind
of ontologies schema-like ontologies, since they can be associated with various constraints and
are allowed to have instances. In comparison, the second kind of ontologies usually include
the subconcept (or subclass) relationships between two entities, and are called taxonomy-like
ontologies in our discussion. In our approach, both ontologies co-exist.
The ontologies that we use to represent the structure of the local data sources, which we
call local ontologies, belong to the first type and can be obtained from the source schemas
through a schema transformation process. The second type of ontologies are the land use
ontologies, which are part of the local ontologies; they represent land use taxonomies, which
are used to classify land parcels in the local data sources according to their usage (for example,
agricultural, commercial, or residential). In addition, our approach uses a global ontology that
models the domain structure of the integration task and acts as a mediator among the local
sources. Similarly to the local ontologies, the global ontology contains a global land use ontology
that describes the land usage domain.
144
The key to our approach lies in establishing mappings between the concepts of the global
ontology and the concepts of the local ontologies. The process of establishing such mappings is
called alignment. When such mappings have been established, we say that the two ontologies are
aligned or matched. Using those mappings, a single query can then be expressed in terms of the
concepts of the global ontology (or of a local ontology) and be automatically rewritten and posed
against the other ontologies. Therefore, a single query can retrieve data from heterogeneous data
sources, thus allowing for land use planning across as many jurisdictions as needed, provided
that the corresponding ontologies have been aligned.
Whereas the local data sources are expressed in XML with a DTD schema, we express all
ontologies using RDF,1 which is at the core of several ontology languages such as OWL2 and
DAML+OIL.3 XML Schema (often seen as an enhanced DTD-like language with a well-defined
data typing system) is a semantic markup language for web data. The database-compatible
types that are supported by XML Schema provide a way to model data organized hierarchi-
cally. However, there are no explicit constructs for defining classes, properties, and relationships
between classes in XML Schema, therefore ambiguities may arise when determining the relation-
ship between two XML elements. For instance, the relationship (ownedBy) between an element
LandParcel and its child element Owner is implicitly indicated by their nesting relationship.
1http://www.w3.org/RDF/
2http://www.w3.org/2004/OWL/
3http://www.daml.org/
145
LandParcel
Land_id Owner
Owner_id
LandParcel Owner LandParcel
Land_id
Owner
Owner_id
ownedBy
a) land-centric schema b) owner-centric schema c) conceptual schema
Figure 36. An example of XML schematic heterogeneity.
The example in Figure 36 illustrates the fact that two different XML data schemas can
represent the same conceptualization, namely a many-to-many relationship between land parcels
and their owners, thus resulting in schematic heterogeneity. Specifically, the land-centric schema
has a LandParcel element containing the Owner child element. In the owner-centric schema,
the LandParcel element is nested as a child element under Owner. In contrast, the conceptual
schema in the same picture explicitly represents the underlying semantics of both XML schemas.
Because there is no nesting involved, we say that the conceptual schema is structurally flat.
Using that conceptual schema facilitates both the alignment and querying processes, as no
consideration needs to be given to the the structure of the source (4; 39).
In our approach, the global ontology provides an integrated view of the source schemas as
well as a uniform query interface. Mappings consisting of class or property correspondences
are then established between each local ontology and the global ontology. Given that both
the global and local ontologies contain two components (i.e., the schema and the taxonomy),
the mappings between the global ontology and a local ontology are accordingly subdivided into
146
two components: the schema-level mappings between the schemas of both ontologies and the
instance-level mappings between the land use taxonomies of both ontologies. Such mappings
can be respectively used to reconcile schematic and semantic heterogeneities.
Query processing in our approach involves query rewriting and can be performed in two
ways: global-to-local query processing and local-to-local query processing, based on the double
role played by the global ontology. Using its first role, we rewrite a query posed on the global
ontology into subqueries over the local sources—the global ontology acts as a uniform query
interface of the integration system. Using its second role, we translate a query posed on an XML
geospatial source to a query on any other XML geospatial source, taking the global ontology
as a mediator for the query rewriting and hence for the interoperation between local sources.
In addition to the ontology-based architecture for data integration and interoperation, we
make the following contributions in this chapter:
• We focus on the alignment process of the local land use ontology with the global land use
ontology and propose an ontology alignment algorithm based on a set of deduction rules,
which can be performed automatically (that is, without the intervention of users) when
certain pre-conditions are established.
• We propose a sound query rewriting algorithm based on the bidirectionality and compo-
sition of the mappings. The mappings are stored in a file using the RDF/XML syntax,
called the agreement file. The rewriting algorithm is used for both global-to-local and
local-to-local query processing. The algorithm can compute a contained rewriting of a
query in both cases. Query containment ensures that all the answers retrieved by exe-
147
cuting the rewriting are a subset of the answer to the original query, thus guaranteeing
precise query answering across distributed data sources (70).
The rest of the chapter is organized as follows. We summarize related work in Section 6.2.
The data heterogeneity issues that are associated with WLIS are presented in Section 6.3. In
Section 6.5, we discuss ontology alignment and focus on an automatic algorithm. The two cases
of query processing are illustrated in Section 6.6, where we also describe in detail the query
rewriting algorithm. We conclude in Section 6.7.
6.2 Related Work
Data integration and interoperability are critical for the implementation of eGovernment,
especially when access to distributed data is needed, such as in GIS (53), cross-jurisdictional
criminal investigation (78), coastal management (71), electronic elections (35), or air quality
management (61). In this chapter, we look at the semantic integration of heterogeneous geospa-
tial data using conceptual data models (70; 87; 111). In this scope, we discuss the issues of
ontology alignment and query processing.
6.2.1 Ontology Alignment
The work on ontology alignment considers related work on database schema matching, but
takes into consideration characteristics of ontologies (64; 87; 99; 106). Existing schema or
ontology matching techniques can be classified into three classes:
Element-level At the element level, matching can use various similarity measures based, for
example, on names of elements or their textual descriptions. A normalized numerical value
148
will be calculated for each of the matching candidates, and the best one is selected (10;
30; 92).
Structure-level The structure-level information that can be used by the matching process
include the graph or taxonomy underlying the schema or ontology. Graphs are used
as contextual information to map pairs of elements and the taxonomy can provide the
matching process with more semantics thus contributing to semantic-level matching. For
example, the AnchorPrompt algorithm compares the structures of the graphs that repre-
sent the ontologies or schemas and determines their similarity (89). Another example is
the idea of similarity flooding, which uses a hybrid matching algorithm that propagates
similarity through the graph (80). An example of semantic-level matching determines
the similarity of two concepts based on the similarities of their ancestors (100). In our
approach, we consider the semantic similarity of the concepts’ children, instead.
Instance-level Instance-level matching uses the actual contents (or instances) of the schema
or ontology elements. Examples include: 1) GLUE that employs machine-learning tech-
niques to determine mappings, particularly by using multiple learners that exploit the
information contained in the conceptual instances and in the taxonomic structure of the
ontologies, and then uses a probabilistic model to combine results of different learners
(43), 2) HICAL that exploits the data instances that overlap in two taxonomies to infer
mappings (63), and 3) the NLP-based method suggested by Fossati et al., where only the
instance documents associated with the nodes of ontologies are taken into account (48).
149
Other ontology or schema mapping tools, which combine some of the above mentioned
methods, include Chimaera (79), COMA++ (9), MAFRA (74), Clio (60), and PROMPT (88).
Regarding the two types of ontologies in WLIS, namely the schema-like ontologies (which
can have instances) and the taxonomy-like ontologies, the matching methods differ because
of the different characteristics presented by both types of ontologies. Specifically, the former
type contains various user-defined properties whereas the latter type usually include only the
sub-concept (or subclass) relationships between two entities. Therefore, the matching of the
latter type can benefit from most structure-level methods mentioned above.
6.2.2 Query Processing
When mappings are defined as (relational) views, query processing is often referred in the
literature as view-based query answering or rewriting (58). However, few view-based query
processing algorithms address the issue of query rewriting over ontologies. As compared with
schemas, ontologies allow for a more expressive specification of constraints than most schemata
languages do, thus raising issues that have been investigated by artificial intelligence research,
including deductive reasoning, ontology integration, knowledge discovery, and query approxi-
mation (107).
We divide ontology-based query processing techniques into two categories according to the
architecture being considered: a centralized architecture that uses a mediator and a peer-to-peer
architecture:
Centralized architecture We distinguish two types of query processing: GaV (global-as-
view) and LaV (local-as-view). In the first type, the ontology acts as a global schema. In
150
a system that exemplifies this kind of approach (95), queries are expressed using terms
from the vocabulary of the global description logic ontology. Query rewriting uses a
global-as-view (GaV) approach by translating the global query into an equivalent calculus
expression, which references only the objects available in the source databases. Instead,
in our approach we use RQL, a semantic-rich query language (65).
Within the second type (LaV), Amann et al. proposes a mediator architecture for the
querying and integration of XML data and introduce a new mapping language to express
the mappings between the global schema of the mediator and the XML resources, which
are defined as local views over the global schema. A query rewriting algorithms is proposed
that translates user queries according to existing source descriptions in XPath (4).
Peer-to-peer architecture Unlike the previous two approaches, SWAP handles queries in
a peer-to-peer (P2P) setting (46). The queries that are posed by the user on a local
node range from simple conjunction to recursion formulated in an RQL-related query
language. The local node rewrites the query into subqueries and distributes them to the
other peers, which will rewrite the queries in a similar fashion, then retrieving the answers
and gathering them. In this chapter we use a similar query answering approach, when
queries are posed on a local ontology and executed in a local-to-local fashion. In P2P
systems, a GLaV (global-local-as-view) approach (70) is commonly used, which can also
be applied to centralized architectures (49).
Our ontology-based query rewriting algorithm is similar to the computeWTA algorithm
proposed by Calvanese et al. for query reformulation (26) as both assume consistent ontology
151
mappings. However, unlike in computeWTA, we allow for the transformation of the values that
are contained in the query based on the instance-level ontology mappings. In this way, we can
address semantic heterogeneity.
Another approach considers constrained-based query processing in the Clio system for data
integration (117). It focuses mainly on schema mapping and data transformation between
nested schemas and/or relational databases by taking advantage of the schema semantics to
generate the consistent translations from source to target by considering the constraints and
structure of the target schema. In their approach, mappings are expressed using queries, whereas
in our approach mappings and queries exist independently.
6.3 Data Heterogeneities
Our application domain focuses on the Wisconsin Land Information System (WLIS) project,
which implements a distributed web-based system with heterogeneous data residing on local
and state servers (36).
As an example, Figure 37 shows two fragments of land parcel data, including their DTD
(on the left-hand side) and an XML fragment (on the right-hand side), which respectively exist
in the local systems of Eau Claire County and of Madison County. As we can observe, even
though the local XML sources present different structures and naming conventions, they share
a common domain with closely related meanings (or semantics), thus being ideal candidates for
an integration system.
The previous examples display syntactic homogeneity in that they both use XML but have
different structures, therefore displaying schematic heterogeneity. They may also encode their
152
<?xml encoding="ISO-8859-1"?> <LandUse><!ELEMENT LandUse (LandParcel)> <LandParcel><!ELEMENT LandParcel (AREA, BROAD, LU1, <AREA>1704995.587470</AREA>LU2, LU3, ..., JurisType, JurisName)> <BROAD>A</BROAD>
<!ELEMENT AREA (#PCDATA)> <LU1>AF</LU1><!ELEMENT BROAD (#PCDATA)> ......<!ELEMENT LU1 (#PCDATA)> <JurisType>County</JurisType>...... <JurisName>EauClaire</JurisName><!ELEMENT JurisType (#PCDATA)> </LandParcel><!ELEMENT JurisName (#PCDATA)> ......
</LandUse>
a) Local XML data source S1 of EauClaire County.
<?xml encoding="ISO-8859-1"?> <LandUse><!ELEMENT LandUse (LandParcel)> <LandParcel><!ELEMENT LandParcel (AREA, LAND USE, <AREA>1007908.5</AREA>PARCEL ID, ..., JurisType, JurisName)><LAND USE>9100</LAND USE>
<!ELEMENT AREA (#PCDATA)> <PARCEL ID>246710</PARCEL ID><!ELEMENT LAND USE (#PCDATA)> ......<!ELEMENT PARCEL ID (#PCDATA)> <JurisType>County</JurisType>...... <JurisName>Madison</JurisName><!ELEMENT JurisType (#PCDATA)> </LandParcel><!ELEMENT JurisName (#PCDATA)> ......
</LandUse>
b) Local XML data source S2 of Madison County.
Figure 37. Local XML land use data sources.
instances or values in different ways, thus displaying semantic heterogeneity, in the sense that
the same values may represent different meanings and that different values may have the same
meaning (111). Our discussion elaborates further on both kinds of heterogeneities. In the
example shown in Figure 37, we see that the two source schemas overlap on most elements and
153
TABLE VII
SEMANTIC HETEROGENEITY RESULTED FROM DIFFERENT ENCODINGS OFLAND USE DATA.
Local Source Element Name Land Use DescriptionType Value
Dane County RPC Lucode 91 Cropland PastureRacine County
Tag811 Cropland
(SEWRPC) 815 Pasture and Other AgricultureEau Claire County Lu1 AA General AgricultureCity of Madison Land use 8110 Farms
both have the same nesting depth. However, the elements of the land use codes are represented
differently in the two schemas: the schema S1 uses four elements broad, lu1, lu2, and lu3,
whereas S2 only uses a single element, namely land use. Furthermore, the values of such land
use codes (in the XML instances) are encoded in different ways, i.e., characters for S1 and
numbers for S2.
Land use codes in WLIS stand for land use types (or categories) and include, for example,
agriculture, commerce, industry, institutions and residences. Besides using different names in
different local source schemas, such land use codes use different classification schemes, thus
resulting in semantic heterogeneities across the local source schemas. This is illustrated by
Table VII, where there are four element names (Lucode, Tag, Lu1 and Land use) from four dif-
ferent classification schemas. The descriptions in the table show that different values represent
closely related land use types.
154
In order to integrate the distributed heterogeneous local geospatial data like in WLIS, it
is necessary to overcome data heterogeneity, which originates from having different state and
federal agencies involved in acquiring and storing geospatial data. The ontology-based solutions
to the data heterogeneity problem use either a single ontology approach, a multiple ontology
approach, or a hybrid approach.
In a single ontology approach such as SIMS (7) all information sources are directly related
to a shared global ontology. This approach requires that all sources provide nearly the same
view on a domain. In a multiple ontology approach such as OBSERVER (81) each informa-
tion source is described by its own ontology. It needs an additional representation formalism
defining the inter-ontology mapping between each pair of separate ontologies. Instead, we use a
hybrid approach. In our approach, a local ontology is generated for each local XML source that
represents its schema. In addition, a global ontology is defined to act as an integrated view and
a uniform access interface of the distributed data sources. Every local ontology is mapped to
this global ontology, by establishing the correspondences of their elements and attributes, which
results in an “agreement” on the local names. In addition to this schema level reconciliation,
it is also necessary to have a global land use taxonomy, to which the local land use taxonomies
are mapped, so as to achieve a common understanding of the semantics of the land use codes
used in local sources. All ontologies are represented using RDF and RDFS.
6.4 Architecture
In this section, we discuss the architecture (as shown in Figure 38) of our ontology-based
approach for heterogeneous geospatial data integration. We focus on ontology alignment and
155
User Query Interface
Global Ontology G
Agreement M 1 Agreement M n
User Query Interface
. . . . . .
User Query
Interface
Local Land-Use Ontology O 1
Local
Ontology
Instance I 1
Local DTD D 1 Local Land-Use
Hierarchy H 1
Local XML
Source S 1
Local Land-Use Ontology O n
Local
Ontology
Instance I n
Local DTD D n Local Land-Use
Hierarchy H n
Local XML
Source S n
. . .
Agreement Maker
Agreement Maker
Agreement
Maker
Ontology mapping
Local transformation
Local query processing
Global query processing
Figure 38. The ontology-based architecture.
briefly discuss query processing, leaving most of the latter subject to be presented in Section
6.6.
6.4.1 Schema Transformation and Ontology Mapping
The ontology-based data integration process contains two steps: schema transformation and
ontology alignment. In the first step, for each local source, we transform the local DTD schema
into a local RDFS ontology, the XML instances under this schema into instances of the local
ontology, and the XML taxonomy of land use categories into an equivalent RDFS taxonomy
156
(as part of the local ontology). In the second step, we map a local RDFS ontology and the
local land use taxonomy to the global ontology and its land use taxonomy, respectively. The
mappings are then stored in an agreement file to be used for query processing.
The global ontology in our system has two roles: (1) It provides the user with access to the
data with a uniform query interface to facilitate the formulation of a query on all the XML
sources; (2) It serves as the mediation mechanism for accessing the distributed data through
any of the XML sources.
6.4.2 Query Processing
Depending on the particular role of the global ontology in the architecture, we distinguish
the following two query processing cases:
Global-to-local query processing The query is posed on the global ontology, which acts as a
uniform interface to access the distributed data sources. The global query is rewritten into
multiple subqueries over individual local ontologies in local systems, where the subqueries
are executed. The answers to these subqueries are then returned to the global interface
and integrated to form the answer to the global query.
Local-to-local query processing As an autonomous system, the local system can accept
queries from the user, and answers them by forwarding the queries to other local data
sources through the global ontology. This case of query answering is similar to that in
peer-to-peer systems, in the sense that it can propagate the query to one or more peers,
or simply propagate the query to all of them (46).
157
The agreement files are the basis for both cases of query processing. The rewriting of
a query from one ontology to another needs to refer to the relevant agreement file to find
the corresponding concepts to the ones being queried. However, the mappings between a local
ontology and the global ontology and those between the local land use taxonomy and the global
land use taxonomy are used differently for query rewriting. We choose to use RQL (RDF Query
Language (65)) to express queries on ontologies. We discuss query processing in more detail in
Section 6.6.
6.5 Ontology Mapping
6.5.1 Schema Transformation
The first step of the integration of XML geospatial data sources is the transformation from
the XML source schema and data to an RDFS ontology and to RDF data. Due to the document
structure of XML, we may need to extend the RDFS vocabulary so as to be able to encode the
structure, which would otherwise be lost. In this chapter, we focus on the nesting structure,
while ignoring other implicit information such as the order information.
When the nesting structure represents a type hierarchy of elements (e.g., the land use taxon-
omy), the RDF property rdfs:subClassOf will be adequate to model such information, where
XML elements are represented by RDFS classes. However, it is common that nesting represents
an implicit relationship between two XML elements (e.g., the ownedBy relationship between an
element Owner with its child element LandParcel). In this case, in order to preserve the nesting
structure in the local ontology, we introduce a new RDF property, namely contained, which is
defined in the namespace rdfx). That is, while still representing the two XML elements using
158
RDFS classes, we use contained to connect the child-element class to the parent-element class.
Below we show the RDF/XML syntax for the contained property.
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:rdfx="http://www.example.org/rdf-extension#">
<rdf:Property rdf:about= "http://www.example.org/rdf-extension#contained">
<rdfs:isDefinedBy rdf:resource= "http://www.example.org/rdf-extension#"/>
<rdfs:label>contained</rdfs:label>
<rdfs:comment> The containment between two classes. </rdfs:comment>
<rdfs:range rdf:resource= "http://www.w3.org/2000/01/rdf-schema#Class"/>
<rdfs:domain rdf:resource= "http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdf:Property>
Elements and attributes are the two basic building blocks of a DTD. There are two types
of elements: simple types that cannot have element or carry attributes, and complex types that
can have elements and/or contain attributes. On the other hand, all attribute declarations
must reference simple types since attributes cannot contain other elements or other attributes.
A well-formed XML document contains the hierarchical structure of elements and attributes
with the following kinds of relationships: 1) element and attribute relationship, where only
complex-type elements can carry attributes and attributes can only be of simple types, and 2)
element and sub-element relationship, where only complex-type elements can allow elements as
their children, but child elements can be either simple types or complex types.
159
Taking into account XML elements, attributes and their relationships, the transformation
from XML to RDF can further include element-level transformation and structure-level trans-
formation, as follows:
Element-level transformation Element-level transformation defines the basic classes and
properties of the local RDFS ontology according to the transformation correspondences
shown in Table VIII, with the structural relationships between the elements not being
considered for the time being. No new RDF metadata need be defined here because
rdfs:Class and rdfs:Property are sufficient to express classes and properties. For
instance, to transform the DTD of S1 in Figure 37, we define two classes: LandUse and
LandParcel for the elements with the same name. The other elements become properties
of LandParcel, because they are simple-type subelements.
TABLE VIII
ELEMENT-LEVEL SCHEMA TRANSFORMATIONXML Schema concepts RDF Schema conceptsAttribute PropertySimple-type element PropertyComplex-type element Class
Structure-level transformation Structure-level transformation encodes the nesting struc-
ture of the XML schema into the local RDFS ontology (39). In particular, the nesting
160
LandUse
LandParcel
rdfx:contained
Literal
rdfs:domain
rdfs:range
area
broad
lu1
jurisType
jurisName
Property
Class
LandUseTag rdfs:subClassOf
A R . . .
. . .
lu2
lu3
LandUse
LandParcel
rdfx:contained
Literal
area
Land_use
jurisType
jurisName
LandUseType
. . .
. . .
AA AD . . . RB RC . . .
parce_id
1 9
11 12 19 91 9100 910
a) Local RDFS ontology O 1 for local source S
1 b) Local RDFS ontology O
2 for local source S
2
Figure 39. An example of local RDFS ontologies.
may occur between two complex-type elements or between a complex-type element and its
child (as a simple-typed element). Following the element-level transformation, the nest-
ing structure in the former case corresponds to a class-to-class relationship between two
RDFS classes, which are connected by the property rdfx:contained. In the latter case,
the XML nesting structure corresponds to the class-to-literal relationship in the local on-
tology, with the class and the literal connected by the corresponding property. Table IX
lists the correspondences between the XML elements and the classes or properties in the
local RDFS ontology.
As an example, Figure 39 shows the local ontologies (represented as graphs where nodes are
classes and edges are properties) transformed from the XML schemas in Figure 37. The land
161
TABLE IX
MAPPINGS BETWEEN XML SOURCE SCHEMA D1 AND LOCAL ONTOLOGY O1
XPath expressions in D1 RDF expressions in O1
/LandUse LandUse/LandUse/LandParcel LandParcel/LandUse/LandParcel/AREA LandParcel.area/LandUse/LandParcel/BROAD LandParcel.broad/LandUse/LandParcel/LU1 LandParcel.lu1... .../LandUse/LandParcel/JurisType LandParcel.jurisType/LandUse/LandParcel/JurisName LandParcel.jurisName
use taxonomies are transformed into a hierarchy of classes and incorporated as part of the local
ontologies, rooted from LandUseTag and LandUseType, respectively.
6.5.2 Ontology Alignment
The ontology alignment process takes as input a local ontology obtained using the previously
described transformation, and the global ontology. It then produces as output an agreement
containing the class and property correspondences between the two ontologies.
Corresponding to the schema (non-taxonomy) and taxonomy parts in both the global on-
tology and local ontologies (see Figure 39),
When performing the alignment, we must consider two cases, which correspond to the
schema and taxonomy components in the global ontology and local ontologies (see Figure 39):
1) the schema-level mapping between the schema parts of two ontologies, where a concept (or a
role) of one ontology is mapped to a concept (or a role) of another ontology, and 2) the instance-
162
level mapping, where two corresponding concepts use two different classification schemes for
their instances, e.g., land use codes with different underlying taxonomies in WLIS.
Ontology alignment is in general a challenging task, with its degree of difficulty depending
on the types of ontologies being considered (106). For instance, the mapping between two
taxonomies consisting of only subClassOf relationships (i.e., the schema-level mapping in our
setting) is believed to be much simpler than the one between two non-taxonomies containing
various properties and relationships (i.e., instance-level mapping). In this chapter, we primarily
discuss the mapping between two (land use) taxonomies, and propose an automatic alignment
algorithm that can deduce new mappings from existing ones based on certain rules.
6.5.2.1 Mapping Types
Figure 40 shows a fragment of two concrete land use taxonomies: the one on the left hand
side is from the local ontology O1 in Eau Claire County (as depicted in Figure 39), and the one
on the right hand side is from the global ontology G.
The two taxonomies are respectively rooted from LandUseTag and from LandUseCode. A
node in each taxonomy represents a class of land use, where the label contains its description
and the code (in parenthesis). The dashed lines are class correspondences that are established
based on the semantics of the classes, which represent the mappings. We consider the following
types in ontology mappings:
Semantic relationship Considering a set-theoretic semantics, the mapping between two classes
A and B (seen as two sets of instances) can be classified into five categories: superclass,
subclass, equivalent, approximate (or overlapping), and disjoint, respectively, A ⊇ B,
163
rdfs:subClassOf
Class
ontology mapping
LandUseTag
Agricultural (A)
Residential (R)
Commercial (C)
Industrial (I)
Public/Institutional (P)
Single Family Residences (RS)
Mobile Home Parks (RSP)
LandUseCode
Agricultural (9)
Residential (1)
Transportation (5)
Industrial (3)
Single Family (111)
Two Family (113)
Multiple Family (115)
Other Single Family (140)
Mobile Homes (142)
Seasonal Residential (190)
Multiple Family
Dwellings having 4
units or more (RM)
Home Occupations (RO)
Vacant residential parcels (RV)
Parking Lots (RZ)
Communication (4)
Commercial (2)
Institutional/Governmental (6)
Non-mobile Home Parks (RSP)
Other Residential (199)
Duplexes (RD)
Triplexes (RT)
Cropland/pasture (AC)
Non-pasture (AN)
Pasture (91)
Other (99)
a) Land use taxonomy in local ontology O 1
b) Land use taxonomy in global ontology G
=
=
= (1)
=
(2, 4)
=
=
=
(3)
=
Figure 40. An example of mapping between two land use taxonomies. The labels over theedges represent mappings types, followed (in between parentheses) by the deduction rule(s)
that can be applied, if any.
164
A ⊆ B, A = B, (A ∩ B 6= ∅) ∧ (A− B 6= ∅) ∧ (B − A 6= ∅), and A ∩ B = ∅. We will not
consider the category approximate in our query rewriting algorithm for reasons that will
be apparent later (but we will discuss in Section 6.6 ways in which this category can be
incorporated in query answering). We will also not consider disjoint mappings, as they
are not useful for query answering.
Cardinality Class correspondences are established pairwise between two ontologies (producing
one-to-one mappings). However, it is possible that a class from one ontology is mapped
to multiple classes from the other ontology, in a many-to-one mapping and that multiple
classes are mapped to a single class, in a one-to-many mapping. To express such mappings
we consider the union of the classes to which a single class (in the other ontology) maps.
For example, given two mappings A = B and A = C, we have that A = B∪C. This issue
will be further discussed in the query processing section.
Coverage We distinguish two types of mappings: fully covered and partially covered. Let
C and C ′ be two classes to be mapped, such that C1, ..., Cm are subclasses of C, and
C ′1, ..., C
′n are subclasses of C ′. We say that C (resp. C ′) is fully covered if for each
child Ci ∈ {C1, ..., Cm} (resp. for each child C ′j ∈ {C ′
1, ..., C′n}) there is a non-empty
subset of {C ′1, ..., C
′n} to which Ci is mapped (respectively there is a non-empty subset of
{C1, ..., Cm} to which C ′j is mapped).
165
6.5.2.2 Deduction Process
In our approach, the ontology mapping process is performed semi-automatically, that is,
partly established manually by the user and partly obtained automatically using an inference
process based on deduction.
This semi-automatic ontology mapping process follows two principles: (1) The deduction of
the mapping between two nodes (from the both taxonomies being mapped) is determined by
the mappings between their child nodes. In other words, the mapping between two ontologies
are performed in a level-wise fashion, driven by the inference rules that are defined based on
the mapping semantics. (2) The user intervention is needed in two cases: when the mapping
between two nodes has insufficient information to determine its type (for example, when some
of the children of one node have not been mapped) or when there is conflicting information (for
example, that a node is both a superset and a subject of the corresponding node).
We make the complete-partition assumption: for any class C in the taxonomy, its subclasses
C1, ..., Cn together form a complete partition of the class, that is, C = C1 ∪ ... ∪ Cn. For
instance, in the global taxonomy depicted in Figure 40, the two child classes Pasture(91) and
Other(99) of the Agricultural(9) class form a complete partition of Agricultural(9), since
Other(99) includes all agricultural lands that are not used for pasture.
We consider the following deduction rules:
Definition 6.1 (Deduction rules) Let C and C ′ be two fully covered classes, and C1, ..., Cm
and C ′1, ..., C
′n be the subclasses of C and C ′, respectively. Then, the mapping between C and
C ′ can be obtained according to the following rules:
166
1) C = C ′, if for each Ci ∈ C, Ci is mapped to some k-element subset C ′′ of {C ′1, ..., C
′n}
( 1 ≤ k ≤ n′), such that Ci =⋃k
l=1 C ′′l .
2) C ⊆ C ′, if for each Ci ∈ C, Ci is mapped to some k-element subset C ′′ of {C ′1, ..., C
′n}
( 1 ≤ k ≤ n′), such that Ci =⋃k
l=1 C ′′l or Ci ⊆
⋃kl=1 C ′′
l .
3) C ⊇ C ′, if for each Ci ∈ C, Ci is mapped to some k-element subset C ′′ of {C ′1, ..., C
′n}
( 1 ≤ k ≤ n′), such that Ci =⋃k
l=1 C ′′l or Ci ⊇
⋃kl=1 C ′′
l .
The deduction rules in Definition 6.1 can be proved to be sound and complete by an induc-
tion on the set-theoretic semantics of each rule, under the complete-partition assumption and
the assumption that the user-defined mappings are semantically correct.
User intervention is needed in all other cases. The above rules assume a full mapping
between C and C ′. However, they still hold for the case of a partial mapping, provided that
we define the following supplemental rule: 4) Suppose that a class C is partially covered by C ′,
and that S is the subset of subclasses of C that are not mapped to any children of C ′. Then, we
create a temporary and empty subclass ⊥ of C ′, and add a superclass mapping from each class
in S to ⊥.
In Figure 40, the symbols and numbers (in between parentheses) over the dashed lines (i.e.,
the class correspondences) indicate the mapping type and the adopted inference rule(s), re-
spectively. For example, “⊆ (2, 4)” over the mapping between the class Residential(R) and
Residential(1) means that Residential(R) is a superclass of Residential(1), which is com-
puted by rules 2 and 4. The application of rule 4 is due to the fact that SeasonalResidential
(190) is unmapped, thus making Residential(1) partially covered.
167
We note that the inference on ontology mappings, as discussed above, occurs between two
ontologies (one being local ontology and the other being the global ontology). In our system,
there is another case that also requires reasoning. We will discuss this issue in more detail in
Section 6.6.
6.5.2.3 Mapping Representation
As the result of matching two ontologies, an agreement file is created to store the ontology
mappings. There are a number of methods proposed to represent ontology mappings in different
situations, e.g., using axioms or using a meta-ontology (74; 115). In our system, it is natural
to use RDFS as the language for mapping representation, provided that all ontologies are
represented by RDF and RDFS. Actually, as we will see in the next section, storing ontology
mappings in RDF has certain advantages for query processing.
Given the three types of ontology mapping semantics, and the feature of multiple inheritance
of RDFS classes, it is sufficient to use the RDF property rdfs:subClassOf to represent all types
of ontology mappings. As for the non-taxonomical parts of two ontologies, different kinds of
mappings can also be established between two properties, namely superproperty, subproperty,
or equivalent mappings. Similarly, we can use the RDF property rdfs:subPropertyOf to
represent these property mappings. Figure 41 shows an example of the mappings between the
global ontology G and the local ontology O1 (also shown in Figure 39). The graph shows a
fragment of the mappings (indicated by the dashed lines) between the schema components of
both ontologies and the text shows a fragment of the corresponding mapping representation in
RDFS.
168
LandParcel
Literal
rdfs:domain
rdfs:range
jurisType
jurisName
luCode
ownedBy
Property
Class
rdfs:subClassOf
LandUse
LandParcel
rdfx:contained
Literal
area
broad
lu1
jurisType
jurisName
. . .
lu2
lu3
a) Local ontology O 1 for local source S
1
area
Owner
name dob ssn gender
. . .
Land
mapping
=
=
=
=
=
b) Global ontology G
<!DOCTYPE rdf:RDF [ <!ENTITY G "urn:ontologies-advis-lab:global-ontology#"><!ENTITY O1 "urn:ontologies-advis-lab:local-ontology-1#"><!ENTITY O2 "urn:ontologies-advis-lab:local-ontology-2#"> ]>
<rdfs:Class rdf:about="&G;LandParcel"><rdfs:subClassOf rdf:Class="&O1;LandParcel"/>
</rdfs:Class><rdfs:Class rdf:about="&O1;LandParcel">
<rdfs:subClassOf rdf:Class="&G;LandParcel"/><rdfs:subClassOf rdf:Class="&G;Land"/>
</rdfs:Class>......<rdfs:Class rdf:about="&O1;RT">
<rdfs:subClassOf rdf:Class="&G;115"/></rdfs:Class><rdfs:Class rdf:about="&O1;RM">
<rdfs:subClassOf rdf:Class="&G;115"/></rdfs:Class>......
Figure 41. A fragment of ontology mappings represented in RDFS.
169
6.6 Query Processing
6.6.1 Query Languages
In WLIS, users can pose queries either on the global ontology as a global query, or over any
of the integrated local sources as a local query. A typical query such as “Where are all the
multiple family land parcels in Wisconsin?” would be relatively straightforward when using one
single local source, whose schema and taxonomy is familiar to the user, but much more difficult
when posed over a large set of local data sources, as the users would have to manually rewrite
their queries for each of the local data sources and know the schema and taxonomies for each
data source. In this section, we describe how such queries can be automatically rewritten by
our integrated system, using the agreement files that are generated by the alignment process.
Among the many query languages for RDF data access, we use RQL (RDF Query Language),
which is a typed language following a functional approach (32). RQL is defined by a set of basic
queries and select-from-where ( sfw) filters, which can be used to express meta-schema, schema
and data queries. The sfw filters contain generalized path expressions and can be nested to
form more complex queries. For example, the above query, if posed over the global ontology,
can be expressed by the following RQL:
SELECT a, b, c
FROM {$x}xyCoordinates{a}, {$x}bounding{b}, {$x}jurisName{c},
{$x}state{d}, {$x}luCode{e}
WHERE d = "Wisconsin" and e = "115"
170
This query is in the from of a sfw filter, consisting of the SELECT, FROM, and WHERE clauses.
The SELECT clause defines a projection over the variables of interest. In the FROM clause, we
use basic schema path expressions composed of the property name (e.g., bounding) and data
variables (e.g., $x) or class variables (e.g., a). The condition in the WHERE clause filters the
answers. We focus on a particular subset of RQL, namely conjunctive RQL (c-RQL), which is
of the following form:
ans(x) :– R1(x1), ..., Rn(xn).
where x ⊆ x1 ∪ ... ∪ xn are variables or constants, and Ri(xi) (i ∈ [1..n]) stands for a class
predicate C(x) or a property predicate P (x, y). As usual, ans(x) is the head of the query,
denoted headQ, and R1(x1), ..., Rn(xn) is the body of the query, denoted bodyQ. Most RQL
queries can be expressed in c-RQL. For instance, the RQL query on multiple family land parcels
can be expressed in c-RQL as follows:
ans(a, b, c) :– xyCoordinates(x, a), bounding(x, b), jurisName(x, c),
state(x, "Wisconsin"), luCode(x, "115")
where for each path expression in the RQL query we use a corresponding predicate (e.g.,
xyCoordinates(x, a) for {$x}xyCoordinates{a}).
6.6.2 Query Rewriting and Answering
The query processing across the whole system can be performed in two directions, i.e.,
global-to-local and local-to-local. We propose a query rewriting algorithm, QueryRewriting,
which can be used in both cases. Query rewriting can be seen as a function Q′ = f(Q,M),
171
Algorithm QueryRewriting (Q, M)Input: a conjunctive query Q over ontology O; the mappings M betweenontologies O and O′.Output: a union Q of conjunctive queries Q′ over O′.1 headQ′ = headQ; bodyQ′ = null;2 Q∗ = QueryExpand(Q, Σ), where Σ is the set of constraints over O;3 Let φ be bodyQ∗ ;4 Let M1 be the part of schema-level mappings in M ;5 For each R(x) of φ6 For each ψ ∈ M1
7 Let R′(x′) be the result of applying ψ on R(x);8 bodyQ′ = R′(x′) ∧ bodyQ′ ;9 Q′ = QueryExpand(Q′, Σ′), where Σ′ is the set of constraints over O′;10 Let M2 be the part of instance-level mappings in M ;11 Q = ConstantMapping(Q′, M2);12 Return Q;
Figure 42. The QueryRewriting algorithm.
where Q is the query to be rewritten, called source query, M is the set of ontology mappings,
and Q′ is the resulting query, called target query. The algorithm is shown in Figure 42.
In the global-to-local case, the source query Q is a query posed on the global ontology G,
M is the set of mappings from G to every local ontology O1, ..., On, and the target query Q′ is
the union of multiple subqueries over O1, ..., On. In the local-to-local case, Q is a local query
posed on a local ontology Oi (i ∈ [1..n]), M is the set of mappings from Oi to one or more local
ontologies Oj (j ∈ [1..n] and j 6= i), and Q′ is the union of multiple subqueries over all Oj . In
the latter case, M is, in fact, a set of compositions of the mappings from Oi to G with those
from G to Oj .
172
In the rest of this section, we describe in detail the four main steps of the QueryRewriting
algorithm: 1) expanding the source query using the source ontology constraints, 2) rewriting the
expanded source query into an intermediate target query using the schema-level mappings, 3)
expanding the intermediate target query using the target ontology constraints, and 4) mapping
the constants of the expanded intermediate target query to obtain the final target query using
instance-level mappings.
6.6.2.1 Query Expansion
In the above description of the QueryRewriting algorithm, we notice that both the source
query Q and the intermediate target query Q′ are expanded using the ontology constraints,
respectively in Line 2 and Line 9. This query expansion process, as described by the QueryEx-
pand function of Figure 43, uses the strategy of applying the ontology constraints to “chase”
the query, similarly to the chase algorithm that is used in relational databases to compute
dependency implications or optimize queries (2). In relational databases, a database constraint
can be represented as a tgd (tuple generating dependency) in the form ∀x∃y ϕ(x) → ψ(x,y),
where ϕ and ψ are conjunctions of atoms. In an ontology setting, we consider three kinds of
constraints, namely, subclass, subproperty, and typing constraints, all of which can be repre-
sented as a tgd. Specifically, a tgd ∀x C1(x) → C2(x) corresponds to a subclass constraint
C1 ⊆ C2; a tgd ∀x∀y P1(x, y) → P2(x, y) corresponds to a subproperty constraint P1 ⊆ P2; and
a tgd ∀x∀y P (x, y) → A(x) (resp. ∀x∀y P (x, y) → B(y)) corresponds to a typing constraint
that the instances of x (resp. y) are of type A (resp. B).
173
Algorithm QueryExpand (Q, Σ)Input: a conjunctive query Q over ontology O; the constraints Σ over O.Output: The query Q after the expansion.1 Repeat2 Let φ be bodyQ;3 Let ψ : R1(x) → R2(x) be any dependency in Σ;4 If there exists a homomorphism h from R1(x) to φ, but not from
R1(x) ∧R2(x) to φ, then5 Extend h to a new homomorphism h′ from R1(x) ∧R2(x) to φ;6 Add h′(R2(x)) into bodyQ;7 Else exit repeat;8 End repeat
Figure 43. The QueryExpand algorithm.
Similarly to the chase algorithm, QueryExpand is a non-deterministic process that termi-
nates provided that the dependencies are acyclic and the applications of dependencies do not
introduce new variables into the query. Furthermore, it has been proved that the resulting
query Q′ = QueryExpand (Q, Σ) is equivalent to Q, denoted Q ≡ Q′, meaning that the answers
to both queries are the same over all the ontology instances that satisfy the constraints (2).
As an example, let us take the preceding query on multiple family land parcels, and denote
it using Q. As specified on the global ontology G, all the properties (e.g., xyCoordinates)
referred in Q belong to the class LandParcel, thus leading to the corresponding typing con-
straints. Such constraints can be represented by a tgd of the form ∀x∀y P (x, y) → A(x) (e.g.,
∀x∀y xyCoordinates(x, y) → LandParcel(x)). By applying them to Q, we obtain the following
expansion of Q:
174
ans(a, b, c) :– xyCoordinates(x, a), bounding(x, b), jurisName(x, c),
state(x, "Wisconsin"), luCode(x, "115"), LandParcel(x)
Furthermore, given that the LandParcel class is a subclass of Land in G, the corresponding
tgd (e.g., ∀x LandParcel(x) → Land(x)) of such constraint is still applicable to the above
query. The final resulting expansion Q∗ of Q is as follows:
ans(a, b, c) :– xyCoordinates(x, a), bounding(x, b), jurisName(x, c),
state(x, "Wisconsin"), luCode(x, "115"), LandParcel(x),
Land(x)
6.6.2.2 Query Mapping
The key to query rewriting lies in Lines 4 to 7 of the QueryRewriting algorithm, which maps
the expanded source query Q∗ to a new query Q′ over the target ontology, based on the set
of schema-level mappings in M . Similarly to the ontology constraints used by QueryExpand,
ontology mappings can be treated as constraints specified over the source and the target on-
tologies. Therefore, we express ontology mappings in a tgd, which, however, is used for query
mapping in a way that is different from the use of ontology constraints for query expansion.
More specifically, a tgd ψ : ∀x R1(x) → R2(x) means that the satisfaction of predicate
R1(x) by some instance suffices the satisfaction of the predicate R2(x) by that same instance.
In the following discussion, we consider two ontologies O1 and O2. If ψ represents an ontology
constraint R1 ⊆ R2, where R1, R2 ∈ O1, the instances that make ψ hold are all the instances
of O1. Therefore, the expansion of a query on O1 involving R1 and using ψ, as specified by
175
the QueryExpand function, does not, in fact, expand the answer to the query, even though R2
may have a larger set of instances than R2. In comparison, in the case where ψ stands for an
ontology mapping R1 ⊆ R2, where R1 ∈ O1 and R2 ∈ O2, ψ actually specifies a constraint for
the data interoperation (exchange) from O1 to O2. In this case, the instances in O1 and in O2
are usually different. This means that the instances satisfying R1(x) may not exist in O2 so
as to satisfy R2(x), thus demanding a potential data transfer from O1 to O2. In this sense,
ψ should be applied to the rewriting of a query referring to R2 to a query referring to R1, not
in the opposite direction. In other words, the query rewriting from O1 to O2 should use the
mapping constraints with their dependency implication being from O1 to O2.
In particular, the application of a dependency ψ : R2(x) → R1(x) to a query Q, as Line 7
of QueryRewriting indicates, is performed by taking the converse ψ′ of ψ (i.e., R1(x) → R2(x)),
followed by the operations specified in Lines 4 and 5 of QueryExpand. The resulting R′(x′) (in
Line 8 of QueryRewriting) is then h′(R2(x)) as in Line 6 of QueryExpand. The following shows
the result of mapping Q∗ (the expanded source query) to a query Q′ on the local ontology O1
according to the mapping M as presented in Figure 41:
ans(a, b, c) :– xyCoordinates(x, a), boundingBox(x, b), jurisName(x, c),
state(x, "Wisconsin"), lu1(x, "115"), LandParcel(x)
6.6.2.3 Rewriting Constants
Both the QueryExpand function and the query mapping process are performed at the schema
level. In comparison, the rewriting of the constants that are referred to in the query is based
176
Algorithm ConstantMapping (Q, M)Input: a conjunctive query Q over ontology O′ with constants c1, ..., cn fromO; the instance level mappings M between ontologies O and O′.Output: a union Q of conjunctive queries Q′ with constants from O′.1 Q = ∅;2 c = (c1, ..., cn);3 For each ci, with i ∈ [1..n]4 Ai={};5 Let C be the class standing for ci;6 For each C ⊇ C ′ or C = C ′ in M7 Ai = Ai ∪ {c′}, where c′ is the constant represented by C ′;8 If there is no C ⊇ C ′ or C = C ′ in M then9 Ai = {c};10 For each c′ ∈ A1 × ...×An
11 Q′ = Q;12 Substitute c in Q′ with c′;13 Q = Q∪Q′;
Figure 44. The ConstantMapping algorithm.
on the instance-level mappings between two ontologies, particularly the mappings between two
land use taxonomies. We describe next the instance rewriting process of Figure 44.
Following the previous example, we have c = {"Wisconsin", "115"}. From the mapping
between G and O1, as shown in Figure 40, we have that RT ⊆ 115 and RM ⊆ 115, therefore
A1 = {"Wisconsin"} and A2 = {"RT", "RM"}, according to Lines 3 to 9. Now that we have two
vectors of constants (c′ in the algorithm): {"Wisconsin", "RT"} and {"Wisconsin", "RM"}, we
can obtain the following union of queries Q according to Lines 10 to 13 of the ConstantMapping
function:
ans(a, b, c) :– xyCoordinates(x, a), boundingBox(x, b), jurisName(x, c),
177
state(x, "Wisconsin"), lu1(x, "RT"), LandParcel(x)
∪
ans(a, b, c) :– xyCoordinates(x, a), boundingBox(x, b), jurisName(x, c),
state(x, "Wisconsin"), lu1(x, "RM"), LandParcel(x)
6.6.3 Discussion
In this section we discuss several considerations that are related to our query processing
strategy and discuss some alternatives to our choices. First, we have assumed that the schema-
level mapping M between two ontologies are a full certain mapping relative to the query to be
rewritten. More specifically, all relation atoms (including classes and properties) in the body
of the query should have been mapped to some atom in the other ontology, with the mapping
type being ⊇ or ≡. In the case that nulls are not allowed in the queried atoms, this assumption
is necessary so as to get complete answers to the query. If nulls are allowed, the queried atoms
do not have to be fully mapped, since we can add null values at appropriate positions of the
answer returned by the target query.
Under the assumption of full mappings, we can prove the soundness of the QueryRewriting
algorithm on its computation of a rewriting (i.e., target query) Q contained in the source query
Q, denoted Q ⊆ Q, as sketched by the following: Let Q∗ be the expanded source query, Q′
be the intermediate target query, Q′′ be the expanded intermediate target query. Given that
Q ≡ Q∗, Q′ ≡ Q′′, and Q′′ ≡ Q (2), it suffices to prove that Q′ ⊆ Q∗. Suppose that t is an
instance in the answer to Q′, i.e., t ∈ Q′(O), where O is the local ontology instance. Then t
178
makes every predicate R(x) in bodyQ′ true. According to Lines 5 to 8 of the QueryRewriting
algorithm, we have that every predicate S(x) in bodyQ∗ is also made true by t. This means
that t ∈ Q∗(G), where G is the global ontology instance, therefore Q′ ⊆ Q∗. We note that we
obtain a contained rewriting, instead of a maximally contained rewriting (70). This is actually
due to our preference for high precision rather than for high recall, which we discuss below.
Second, there are two important steps involved in the local-to-local query rewriting: query
conversion and mapping composition. The query conversion deals with the conversion of a
query (e.g., in XPath) native to the local system to a query (in c-RQL) on the local ontology.
However, c-RQL can only represent a particular class of XML queries that have the same
expressive power as c-RQL. We give below an example of query conversion.
Consider an XPath query /LandUse/LandParcel[board="A"] as posed over O1 to retrieve
all the lands used as Agriculture. The result of this query is the XML document trees rooted
from the LandParcel element (See Figure 37). By considering the answer structure and se-
mantics of the query, we convert the XPath query into the following c-RQL query.
ans(x1, x2, x3, ...) :– area(x, x1), ..., jurisType(x, x2), jurisName(x, x3),
broad(x, x4), lu1(x, x5), lu2(x, x6), lu3(x, x7),
x4 = "A".
We note that all the elements and/or attributes involved in the XML answer tree and in
the predicates of the XPath query are covered in the c-RQL query.
The mapping composition is necessary to obtain the mappings M between two local source
schemas, based on which the QueryRewriting algorithm is performed. Similarly to the ontology
179
alignment, the composition depends on a set of inference rules, which derive a new mapping
from two or more existing mappings. The difference is that, while a mapping in the deduction
of the ontology alignment is between the same pair of ontologies, the two ontology mappings
involved in the composition are between two pairs of ontologies (O1, O2) and (O2, O3), with
a common intermediate ontology, O2. Let R1, R2, and R3 be three classes (or properties)
respectively from the ontologies O1, O2, and O3. Then, given the mapping from R1 to R2 and
that from R2 to R3, a new mapping R1 to R3 can be derived according to the following rules:
1) R1 = R3, if R1 = R2 and R2 = R3;
2) R1 ⊆ R3, if R1 ⊆ R2 and R2 ⊆ R3, or R1 = R2 and R2 ⊆ R3, or R1 ⊆ R2 and R2 = R3;
3) R1 ⊇ R3, if R1 ⊇ R2 and R2 ⊇ R3, or R1 = R2 and R2 ⊇ R3, or R1 ⊇ R2 and R2 = R3.
The last issue we discuss concerns the trade-off between the precision and recall of the query
processing. Currently, we do not consider the mapping type approximate (see Section 6.5.2).
Furthermore, the query rewriting algorithm only uses mappings that guarantee the correctness
of the query. For instance, given a mapping A ⊆ B and a query Q : {x|A(x)}, our query
rewriting algorithm would not rewrite Q to {x|B(x)}, unless the mapping was A ⊇ B or
A ≡ B. This ensures that we will not return to the user instances that do not belong to A.
But we may miss some instances of B that are also instances of A and should be included in
the answer to Q, thus lowering recall.
An alternative is to allow the approximate semantic relationship and to assign a score be-
tween [0..1] to every mapping based on the similarity of the mapped classes or properties. Thus,
180
query rewriting can calculate an estimated precision of the target query. The consideration of
the approximate type can increase the automation of the ontology alignment process. In par-
ticular, we can integrate existing similarity-based ontology (or schema) matching methods (as
described in Section 6.2) in our alignment algorithm, so that the user does not have to interact
with the alignment process to disambiguate those mappings that cannot be inferred by the de-
ductions rules. Instead, the disambiguation can be performed by the similarity-based method,
and a similarity score can be assigned to the mapping. We have partially implemented this idea
in our ontology alignment interface (using the criterion matching by definition). Using this
approach, while the answer to a query may contain positive negatives the recall will increase.
In practice, different scenarios impose different requirements on the mappings. For exam-
ple, an eCommerce application involving purchase orders requires a very precise and complete
translation of a query, whereas a search engine usually does not require an exact transforma-
tion (34).
6.7 Summary
In this chapter, we focused on data integration and interoperability across distributed
geospatial data sources. To illustrate the impact of our approach in a local eGovernment
setting, we showed practical examples that are derived from land use applications in the Wis-
consin Land Information System (WLIS) project. The data heterogeneities in such applications
include: schematic heterogeneities resulting from the fact that each county and each munici-
pality may have a different schema for their data and semantic heterogeneities resulting from
the different classification schemes for the land use data.
181
We propose an ontology-based approach to achieve the integration and interoperability of the
distributed geospatial data sources by solving both schematic and semantic heterogeneities. In
particular, we use a local ontology to represent both the local source schema and the taxonomy
used to encode instances—land use codes in our application—based on a schema transforma-
tion process. Similarly, the global ontology that models the eGovernment application domain
consists of two components: the g lobal schema and the g lobal land use taxonomy.
To achieve data interoperability, two different kinds of mappings are established between
the global ontology and each local ontology: schema mappings between the schema of both
ontologies and instance mappings between the (land use) taxonomies of both ontologies. The
schematic and semantic heterogeneities are reconciled by these two kinds of mappings. We base
the ontology mapping process on a deduction procedure.
We have discussed two modes of query processing in our system, g lobal-to-local and local-
to-local (or peer-to-peer). The former mode is used to accomplish the data integration task, by
rewriting a global query (on the global ontology) into the union of subqueries on the multiple
local ontologies. The latter mode enables peer-to-peer interoperation between any pair of
sources, by means of local-to-local query rewriting. Query rewriting in both modes uses the
previously established mappings. While global-to-local query rewriting uses the mappings from
the global ontology to all local ontologies, local-to-local rewriting is based on the composition
of the mappings from the local ontology of the source database to the global ontology, followed
by the mappings from the global ontology to the local ontology of the target database. We
182
propose a c-RQL (conjunctive RQL) query rewriting algorithm, such that the resulting target
query is contained in the source query, thus providing sound answers to the source query.
Future work will focus on the following two topics: 1) Ontology alignment, and in particular
the deduction-based method. Currently, we make some assumptions on the topology of the
ontologies. Without such assumptions, we may need to consider the combination of our bottom-
up deduction process with top-down reasoning on mappings (e.g., (100)). 2) We will further
extend our query rewriting algorithm, so that it can take into account “approximate” mappings.
In this case, the precision and recall of query answering will depend on the similarity of the
underlying mappings, thus making the ability to determine mapping similarities a critical task.
CHAPTER 7
CONCLUSIONS
Data heterogeneity is the primary handicap in achieving data interoperability among dis-
tributed data sources. To build an integrated system, either in a centralized architecture or in
a peer-to-peer architecture, we have to resolve the different levels of heterogeneities, including
syntactic, schematic, and semantic heterogeneities. The focus of this thesis is then on the ap-
plication of Semantic Web technologies, centered on ontologies, to data interoperability so as
to achieve semantic data integration. We have proposed an ontology-based approach for both
central and peer-to-peer data integration. In doing so, we discuss the three fundamental issues,
including metadata representation, mapping process, and query processing.
In summary, our work as presented by five scenarios in this thesis can be summarized as
follows:
1. In our ontology-based framework for centralized integration of XML data sources, we
consider the problem that semantically equivalent XML documents can present different
document structures, caused by the lack of explicit semantics in XML. The ontology-
based approach enables the interoperation of XML documents at the semantic level while
retaining their nesting structure. A global RDFS ontology is generated by merging the
local RDFS ontologies that are generated from each of the XML documents. By means of
the mappings established between the global ontology and local XML schemas, we are able
183
184
to process queries in two modes: from the global ontology to local sources and from a local
source to another one. For both modes, we propose a query rewriting algorithm, which
is shown to be an equivalent rewriting algorithm. In doing so, we discuss the problem
of query containment for two query languages, namely conjunctive RDQL (c-RDQL) and
conjunctive XQuery (c-XQuery).
2. We have proposed a hybrid peer-to-peer framework, PEPSINT, for the integration of XML
and RDF data sources. We discuss the construction of the architecture, maintenance of
mappings, and query processing in PEPSINT. The data integration is implemented at
the schema-level through the schema matching process and at instance-level through the
query answering process. A key aspect in both processes is the preservation of the domain
and document structure, which enables both the integration of source schemas and that
of answers from different local queries that may have different structures. Furthermore,
user queries can be correctly propagated across the network of heterogeneous XML and
RDF data sources, so that information access within the network is transparent to the
user.
3. An ontology-based approach has been proposed to solve the data interoperability problem
in a heterogeneous pure P2P network. RDF and related techniques are overwhelmingly
used in our approach, including the use of the RDFS local ontologies for metadata rep-
resentation and the use of the RDFMS meta-ontology for inter-schema mapping repre-
sentation. Based on the RDFMS meta-ontology, we introduce a P2P mapping language,
namely PML, which is used to express the mappings based on its first-order logic seman-
185
tics. The P2P query answering in the system considers constraints defined over local data
sources.
4. The data interoperability problem exists in the management of personal information
within and across desktops. We propose a layered multi-ontology based framework,
called MOSE, which aims to provide a semantics-rich environment for personal infor-
mation organization and manipulation. In particular, we focus on the semantic-enriched
data organization, including data annotation using domain ontologies, data association
by means of a network consisting of the various ontologies and their instances, and data
representation. We also propose an MVC-based approach for personal information ap-
plication (PIA) development using the PIA designer. Based on that, we have formalized
the notion of desktop service, based on the concept of parameterized channel. The data
interoperability in MOSE can be realized in two ways, one by means of desktop services
and the other by means of query processing across desktops. We discuss two cases of
query processing: within a single PIA and between two PIAs.
5. Finally, we illustrate the impact of our ontology-based approach to the data interoper-
ability problem in a local eGovernment setting. We propose an ontology-based approach
to achieve the integration and interoperability of the distributed geospatial data sources
by solving both schematic and semantic heterogeneities. Both kinds of heterogeneities
are reconciled by two different kinds of mappings that are established between the global
ontology and each local ontology: schema mappings between the schema of both ontolo-
gies and instance mappings between the (land use) taxonomies of both ontologies. We
186
propose a c-RQL (conjunctive RQL) query rewriting algorithm, for two cases of query
processing in the system.
CITED LITERATURE
1. Serge Abiteboul and Oliver M. Duschka. Complexity of Answering Queries Using Ma-terialized Views. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGARTSymposium on Principles of Database Systems (PODS 1998), pages 254–263, 1998.
2. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995.
3. Bernd Amann, Catriel Beeri, Irini Fundulaki, and Michel Scholl. Ontology-Based Inte-gration of XML Web Resources. In Proceedings of the 1st International SemanticWeb Conference (ISWC 2002), pages 117–131, 2002.
4. Bernd Amann, Catriel Beeri, Irini Fundulaki, and Michel Scholl. Querying XML SourcesUsing an Ontology-Based Mediator. In Proc. of the Confederated InternationalConferences DOA, CoopIS and ODBASE, LNCS 2519, Springer, 2002.
5. Bernd Amann, Irini Fundulaki, Michel Scholl, Catriel Beeri, and Anne-Marie Vercoustre.Mapping XML Fragments to Community Web Ontologies. In Proceedings of the 4thInternational Workshop on the Web and Databases (WebDB 2001), pages 97–102,2001.
6. Marcelo Arenas, Vasiliki Kantere, Anastasios Kementsietsidis, Iluju Kiringa, Renee J.Miller, and John Mylopoulos. The Hyperion Project: From Data Integration toData Coordination. SIGMOD Record, 32(3):53–38, 2003.
7. Yigal Arens, Craig A. Knoblock, and Chunnan Hsu. Query Processing in the SIMS Infor-mation Mediator. In The AAAI Press, May 1996.
8. David Aumueller and Soren Auer. Towards a Semantic Wiki Experience – Desktop Inte-gration and Interactivity in WikSAR. In Proc. of the 1st ISWC Workshop on TheSemantic Desktop, 2005.
9. David Aumueller, Hong Hai Do, Sabine Massmann, and Erhard Rahm. Schema andOntology Matching with COMA++. In Proc. of the ACM SIGMOD InternationalConference on Management of Data, pages 906–908, 2005.
187
188
10. Sonia Bergamaschi, Silvana Castano, and Maurizio Vincini. Semantic Integration of Semi-structured and Structured Data Sources. SIGMOD Record, 28(1):54–59, 1999.
11. Sonia Bergamaschi, Francesco Guerra, and Maurizio Vincini. A Peer-to-Peer InformationSystem for the Semantic Web. In Proceedings of the International Workshop onAgents and Peer-to-Peer Computing (AP2PC 2003), July 2003.
12. Matthew Berland and Eugene Charniak. Finding Parts in Very Large Corpora. In ACL,1999.
13. Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web. Scientific Ameri-can, May 2001.
14. Philip A. Bernstein. Applying Model Management to Classical Meta Data Problems. InProceedings of the 1st Biennial Conference on Innovative Data Systems Research(CIDR 2003), 2003.
15. Philip A. Bernstein, Fausto Giunchiglia, Anastasios Kementsietsidis, John Mylopoulos,Luciano Serafini, and Ilya Zaihrayeu. Data Management for Peer-to-Peer Com-puting: A Vision. In WebDB 2002, pages 89–94, 2002.
16. Yaser A. Bishr. Overcoming the semantic and other barriers to GIS interoperability.International Journal of Geographical Information Science, 12(4):229–314, 1998.
17. Christian Bizer. D2R MAP - A Database to RDF Mapping Language. In Proceedings ofthe 12th International World Wide Web Conference (WWW 2003), 2003.
18. Stephan Bloehdorn, Kosmas Petridis, Carsten Saathoff, Nikos Simou, Vassilis Tzouvaras,Yannis S. Avrithis, Siegfried Handschuh, Ioannis Kompatsiaris, Steffen Staab, andMichael G. Strintzis. Semantic Annotation of Images and Videos for MultimediaAnalysis. In ESWC 2005, pages 592–607, 2005.
19. Scott Boag, Don Chamberlin, Mary F. Fernandez, Jonathan Robie Daniela Flo-rescu, and Jerome Simeon. XQuery 1.0: An XML Query Language.http://www.w3.org/TR/xquery, W3C Working Draft.
20. Paolo Bouquet, Fausto Giunchiglia, Frank van Harmelen, Luciano Serafini, and HeinerStuckenschmidt. C-OWL: Contextualizing Ontologies. In Proc. of ISWC 2003,pages 164–179, 2003.
189
21. Ronald Bourret. XML and Databases. http://www.rpbourret.com/xml/XMLAndDatabases.htm,2004.
22. Dan Brickley, R.V. Guha, and Brian McBride. RDF Vocabulary Description Language1.0: RDF Schema. http://www.w3.org/TR/rdf-schema, Feburary 2004.
23. Vannevar Bush. As We May Think. The Atlantic Monthly, 176(1):101–108, 1945.
24. Andrea Calı, Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. On theExpressive Power of Data Integration Systems. In Proceedings of the 21st Inter-national Conference on Conceptual Modeling (ER 2002), pages 338–350, 2002.
25. Andrea Calı, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Paolo Naggar,and Fabio Vernacotola. IBIS: Semantic Data Integration at Work. In Proceedingsof the 15th Conference on Advanced Information Systems Engineering (CAiSE2003), pages 79–94, 2003.
26. Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Ric-cardo Rosati. What to Ask to a Peer: Ontolgoy-based Query Reformulation. InProceedings of the 9th International Conference on Principles of Knowledge Rep-resentation and Reasoning (KR 2004), pages 469–478, 2004.
27. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. View-Based Query Processing and Constraint Satisfaction. In The 15th Annual IEEESymposium on Logic in Computer Science (LICS 2000), pages 361–371, 2000.
28. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. View-based Query Containment. In Proceedings of the 22rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pages 56–67, 2003.
29. Sandro Daniel Camillo, Carlos A. Heuser, and Ronaldo dos Santos Mello. Querying Het-erogeneous XML Sources through a Conceptual Schema. In Proceedings of the 22ndInternational Conference on Conceptual Modeling (ER 2003), pages 186–199, 2003.
30. Silvana Castano, Valeria De Antonellis, and Sabrina De Capitani di Vimercati. GlobalViewing of Heterogeneous Data Sources. IEEE Transactions on Knowledge andData Engineering, 13(2):277–297, 2001.
190
31. Yi Chen and Peter Revesz. CXQuery: A Novel XML Query Language. In Proceedings ofInternational Conference on Advances in Infrastructure for Electronic Business,Science, and Medicine on the Internet (SSGRR 2002w), 2002.
32. Vassilis Christophides, Gregory Karvounarakis, I. Koffina, Giorgos Kokkinidis, AimiliaMagkanaraki, Dimitris Plexousakis, G. Serfiotis, and Val Tannen. The ICS-FORTHSWIM: A Powerful Semantic Web Integration Middleware. In SWDB 2003, pages381–393, 2003.
33. Jeff Conklin. Hypertext: An Introduction and Survey. IEEE Computer, 20(9):17–41,1987.
34. Valerie Cross. Uncertainty in the Automation of Ontology Matching. In Proc. of the 4thInternational Symposium on Uncertainty Modeling and Analysis (ISUMA), pages135–140, 2003.
35. Isabel F. Cruz, Afsheen Rajendran, and William Sunna. XML Database Integration forVisualizing US Election Results. In Proc. of the National Conference on DigitalGovernment Research (dg.o), pages 403–406, 2002.
36. Isabel F. Cruz, Afsheen Rajendran, William Sunna, and Nancy Wiegand. Handling Se-mantic Heterogeneities using Declarative Agreements. In Proc. of ACM GIS 10thInternational Symposium on Advances in Geographic Information Systems, pages168–174, 2002.
37. Isabel F. Cruz, William Sunna, and Anjli Chaudhry. Semi-Automatic Ontology Alignmentfor Geospatial Data Integration. In Proc. of the 3rd Int. Conf. on GIScience, pages51–66, 2004.
38. Isabel F. Cruz and Huiyong Xiao. Using a Layered Approach for Interoperability on theSemantic Web. In Proceedings of the 4th International Conference on Web Infor-mation Systems Engineering (WISE 2003), pages 221–232, Rome, Italy, December2003.
39. Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. An Ontology-based Framework for Seman-tic Interoperability between XML Sources. In Proceedings of the 8th InternationalDatabase Engineering & Applications Symposium (IDEAS 2004), pages 217–226,July 2004.
191
40. Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. Peer-to-Peer Semantic Integration ofXML and RDF Data Sources. In The 3rd International Workshop on Agents andPeer-to-Peer Computing (AP2PC 2004), July 2004.
41. Stefan Decker and Martin Frank. The Social Semantic Desktop. In Proc. of the WWWWorkshop Application Design, Development and Implementation Issues in the Se-mantic Web, 2004.
42. Stefan Decker, Sergey Melnik, Frank van Harmelen, Dieter Fensel, Michel C. A. Klein,Jeen Broekstra, Michael Erdmann, and Ian Horrocks. The Semantic Web: TheRoles of XML and RDF. IEEE Internet Computing, 4(5):63–74, 2000.
43. AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Y. Halevy. Learning to Mapbetween Ontologies on the Semantic Web. In Proc. of the 11th International WorldWide Web Conference (WWW), pages 662–673, 2002.
44. Xin Dong and Alon Y. Halevy. A Platform for Personal Information Management andIntegration. In CIDR, pages 119–130, 2005.
45. Paul Dourish, W. Keith Edwards, Anthony LaMarca, John Lamping, Karin Petersen,Michael Salisbury, Douglas B. Terry, and James Thornton. Extending DocumentManagement Systems with User-specific Active Properties. ACM Transaction ofInformation System, 18(2):140–170, 2000.
46. Marc Ehrig, Christoph Tempich, Jeen Broekstra, Frank van Harmelen, Marta Sabou,Ronny Siebes, Steffen Staab, and Heiner Stuckenschmidt. SWAP - Ontology-basedKnowledge Management with Peer-to-Peer Technology. In Proc. of WOW 2003,2003.
47. Mary Fernandez, Ashok Malhotra, Jonathan Marsh, Marton Nagy, and NormanWalsh. XQuery 1.0 and XPath 2.0 Data Model. http://www.w3.org/TR/xpath-datamodel, W3C Working Draft, October 2004.
48. Davide Fossati, Gabriele Ghidoni, Barbara Di Eugenio, Isabel F. Cruz, Huiyong Xiao, andRajen Subba. The Problem of Ontology Alignment on the Web: a First Report.In Proc. of the 2nd Web as Corpus Workshop (associated with the 11th Conferenceof the European Chapter of the ACL), pages 51–58, 2006.
49. Enrico Franconi, Gabriel M. Kuper, Andrei Lopatenko, and Ilya Zaihrayeu. A DistributedAlgorithm for Robust Data Sharing and Updates in P2P Database Networks. In
192
Current Trends in Database Technology - EDBT 2004 Workshops, LNCS 3268,Springer, pages 446–455, 2004.
50. Eric Freeman and David Gelernter. Lifestreams: A Storage Model for Personal Data.SIGMOD Record, 25(1):80–86, 1996.
51. Jim Gemmell, Gordon Bell, Roger Lueder, Steven M. Drucker, and Curtis Wong.MyLifeBits: Fulfilling the Memex Vision. In ACM Multimedia, pages 235–238,2002.
52. Li Gong. JXTA: A Network Programming Environment. IEEE Internet Computing,5(3):88–95, May 2001.
53. Michael Goodchild. Spatially Enabled E-Government. 8th InternationalSeminar on GIS (Keynote Talk), Seoul, Korea, November 2003.http://www.csiss.org/aboutus/presentations/files/goodchild seoul nov03.pdf.
54. Thomas R. Gruber and Gregory R. Olsen. An Ontology for Engineering Mathematics.In Proceedings of the 4th International Conference on Principles of KnowledgeRepresentation and Reasoning (KR 1994), pages 258–269, 1994.
55. Tom R. Gruber. A Translation Approach to Portable Ontology Specifications. KnowledgeAcquisition, 5(2):199–220, 1993.
56. Nicola Guarino. Formal Ontology and Information Systems. In Proceedings of the 1st In-ternational Conference on Formal Ontologies in Information Systems (FOIS 1998),pages 3–15, 1998.
57. Peter Haase, Jeen Broekstra, Marc Ehrig, Maarten Menken, Peter Mika, Mariusz Olko,Michal Plechawski, Pawel Pyszlak, Bjorn Schnizler, Ronny Siebes, Steffen Staab,and Christoph Tempich. Bibster - A Semantics-Based Bibliographic Peer-to-PeerSystem. In Proc. of ISWC 2004, pages 122–136, 2004.
58. Alon Y. Halevy. Answering Queries Using Views: A Survey. VLDB Jounal, 10(4):270–294,2001.
59. Alon Y. Halevy, Zachary G. Ives, Peter Mork, and Igor Tatarinov. Piazza: Data Man-agement Infrastructure for Semantic Web Applications. In Proceedings of the 12thInternational World Wide Web Conference (WWW 2003), pages 556–567, 2003.
193
60. Mauricio A. Hernandez, Renee J. Miller, and Laura M. Haas. Clio: A Semi-AutomaticTool For Schema Mapping (demo). In Proc. of the ACM SIGMOD InternationalConference on Management of Data, page 607, 2001.
61. Eduard H. Hovy and Stefan Falke. Automating the Integration of Heterogeneous Data-bases. In Proc. of the 2004 National Conference on Digital Government Research(dg.o), 2004.
62. HP Labs. RDQL - RDF Data Query Language.http://www.hpl.hp.com/semweb/rdql.htm, 2005.
63. Ryutaro Ichise, Hideaki Takeda, and Shinichi Honiden. Rule Induction for Concept Hier-archy Alignment. In Proc. of the Workshop on Ontologies and Information Sharingat the 17th International Joint Conference on Artificial Intelligence (IJCAI), 2001.
64. Yannis Kalfoglou and Marco Schorlemmer. Ontology Mapping: the State of the Art. TheKnowledge Engineering Review, 18(1):1–31, 2003.
65. Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis, andMichel Scholl. RQL: a declarative query language for RDF. In Proceedings of the11th International World Wide Web Conference (WWW 2002), pages 592–603,2002.
66. Michel C. A. Klein. Interpreting XML Documents via an RDF Schema Ontology. InProceedings of the 13th International Workshop on Database and Expert SystemsApplications (DEXA 2002), pages 889–894, 2002.
67. Glenn E. Krasner and Stephen T. Pope. A Cookbook for Using the Model-View-ControllerUser Interface Paradigm in Smalltalk-80. Journal of Object-Oriented Programming,1(3):26–49, August/September 1988.
68. Fereidoon Sadri Laks V. S. Lakshmanan. Interoperability on XML Data. In Proceedingsof the 2nd International Semantic Web Conference (ICSW 2003), pages 146–163,2003.
69. Patrick Lehti and Peter Fankhauser. XML Data Integration with OWL: Experiences andChallenges. In 2004 Symposium on Applications and the Internet (SAINT 2004),pages 160–170, 2004.
194
70. Maurizio Lenzerini. Data Integration: A Theoretical Perspective. In Proceedings of the 21stACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems(PODS 2002), pages 233–246, Madison, Wisconsin, June 2002. ACM.
71. Ron Li, Keith W. Bedford, C. K. Shum, Xutong Niu, Feng Zhou, Vasilia Velissariou,J. Raul Ramirez, and Aidong Zhang. Integration of Multidimensional GeospatialInformation for Coastal Management and Decision-Making. In Proc. of the 2005National Conference on Digital Government Research (dg.o), page 231, 2005.
72. Yunyao Li, Huahai Yang, and H. V. Jagadish. NaLIX: an Interactive Natural LanguageInterface for Querying XML. In SIGMOD 2005 (Poster).
73. Hugo Lueders. Interoperability and Open Standards for eGovernment Services.http://xml.coverpages.org/Comptia-ISC-OpenStandards.pdf, January 2005.
74. Alexander Maedche, Boris Motik, Nuno Silva, and Raphael Volz. MAFRA - A MAppingFRAmework for Distributed Ontologies. In Proc. of EKAW 2002, pages 235–250,2002.
75. David Maier and Lois M. L. Delcambre. Superimposed Information for the Internet. InWebDB, pages 1–9, 1999.
76. Inderjeet Mani. Recent Developments in Text Summarization. In CIKM, pages 529–531,2001.
77. Frank Manola, Eric Miller, and Brian McBride. RDF Primer. http://www.w3.org/TR/rdf-primer, Feburary 2004.
78. Byron Marshall, Siddharth Kaza, Jennifer Jie Xu, Homa Atabakhsh, Tim Petersen, ChuckViolette, and Hsinchun Chen. Cross-Jurisdictional Activity Networks to SupportCriminal Investigations. In Proc. of the 2004 National Conference on Digital Gov-ernment Research (dg.o), 2004.
79. Deborah L. McGuinness, Richard Fikes, James Rice, and Steve Wilder. An Environmentfor Merging and Testing Large Ontologies. In Proc. of the 7th International Con-ference on Principles of Knowledge Representation and Reasoning (KR), pages483–493, 2000.
195
80. Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Similarity Flooding: A VersatileGraph Matching Algorithm and Its Application to Schema Matching. In Proc. ofthe International Conference on Data Engineering (ICDE), pages 117–128, 2002.
81. Eduardo Mena, Vipul Kashyap, Amit P. Sheth, and Arantza Illarramendi. OBSERVER:An Approach for Query Processing in Global Information Systems based on In-teroperation across Pre-existing Ontologies. In Proceedings of the 1st IFCIS In-ternational Conference on Cooperative Information Systems (CoopIS 1996), pages14–25, 1996.
82. Todd D. Millstein, Alon Y. Halevy, and Marc Friedman. Query Containment for DataIntegration Systems. Journal of Computer and System Sciences, 66(1):20–39, 2003.
83. Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja, Jim Pruyne, BrunoRichard, Sami Rollins, and Zhichen Xu. Peer-to-Peer Computing. Technical ReportHPL-2002-57, HP Laboratories Palo Alto, 2002.
84. Gianluca Moro, Aris M. Ouksel, and Claudio Sartori. Agents and Peer-to-Peer Computing:A Promising Combination of Paradigms. In Proceedings of the 1st InternationalWorkshop of Agents and Peer-to-Peer Computing (AP2PC2002), pages 1–14, 2002.
85. Wolfgang Nejdl, Boris Wolf, Changtao Qu, Stefan Decker, Michael Sintek, Ambjorn Naeve,Mikael Nilsson, Matthias Palmer, and Tore Risch. EDUTELLA: A P2P NetworkingInfrastructure Based on RDF. In Proceedings of the 11th International World WideWeb Conference (WWW 2002), 2002.
86. Wee Siong Ng, Beng Chin Ooi, Kian Lee Tan, and Aoying Zhou. PeerDB: A P2P-basedSystem for Distributed Data Sharing. In Proceedings of the 19th InternationalConference on Data Engineering (ICDE 2003), pages 633–644, 2003.
87. Natalya Fridman Noy. Semantic Integration: A Survey Of Ontology-Based Approaches.SIGMOD Record, 33(4):65–70, 2004.
88. Natalya Fridman Noy and Mark A. Musen. PROMPT: Algorithm and Tool for AutomatedOntology Merging and Alignment. In Proceedings of the 17th National Conferenceon Artificial Intelligence and 12th Conference on Innovative Applications of Arti-ficial Intelligence (AAAI/IAAI 2000), pages 450–455, 2000.
89. Natalya Fridman Noy and Mark A. Musen. Anchor-PROMPT: Using Non-local Contextfor Semantic Matching. In Proc. of the Workshop on Ontologies and Informa-
196
tion Sharing at the 17th International Joint Conference on Artificial Intelligence(IJCAI), 2001.
90. Borys Omelayenko. RDFT: A Mapping Meta-Ontology for Web Service Integration. InKnowledge Transformation for the Semantic Web 2003, pages 137–153, 2003.
91. Eyal Oren. SemperWiki: a Semantic Personal Wiki. In Proc. of the 1st ISWC Workshopon The Semantic Desktop, 2005.
92. Luigi Palopoli, Domenico Sacca, and Domenico Ursino. An Automatic Techniques forDetecting Type Conflicts in Database Schemes. In Proc. of the 7th InternationalConference on Information and Knowledge Management (CIKM), pages 306–313,1998.
93. Yannis Papakonstantinou, Hector Garcia-Molina, and Jennifer Widom. Object ExchangeAcross Heterogeneous Information Sources. In Proceedings of the 11th Interna-tional Conference on Data Engineering (ICDE 1995), pages 251–260, 1995.
94. Peter F. Patel-Schneider and Jerome Simeon. The Yin/Yang web: XML syntax and RDFsemantics. In Proceedings of the 11th International World Wide Web Conference(WWW 2002), pages 443–453, July 2002.
95. Martin Peim, Enrico Franconi, Norman W. Paton, and Carole A. Goble. Query Processingwith Description Logic Ontologies Over Object-Wrapped Databases. In Proc. of the14th International Conference on Scientific and Statistical Database Management(SSDBM), pages 27–36, 2002.
96. Chris Peltz. Web Services Orchestration and Choreography. Computer, 36(10):46–52,2003.
97. Lucian Popa, Yannis Velegrakis, Renee J. Miller, Mauricio A. Hernandez, and RonaldFagin. Translating Web Data. In Proceedings of the 28th International Conferenceon Very Large Data Bases (VLDB 2002), pages 598–609, 2002.
98. Dennis Quan, David Huynh, and David R. Karger. Haystack: A Platform for AuthoringEnd User Semantic Web Applications. In ISWC, pages 738–753, 2003.
99. Erhard Rahm and Philip A. Bernstein. A Survey of Approaches to Automatic SchemaMatching. VLDB J., 10(4):334–350, 2001.
197
100. M. Andrea Rodrıguez and Max J. Egenhofer. Determining Semantic Similarity amongEntity Classes from Different Ontologies. IEEE Transactions on Knowledge andData Engineering, 15(2):442–456, 2003.
101. Ozgur D. Sahin, Abhishek Gupta, Divyakant Agrawal, and Amr El Abbadi. QueryProcessing Over Peer-To-Peer Data Sharing Systems. Technical Report CSD-2002-28, University of California at Santa Barbara, 2002.
102. Gerard Salton. Automatic Text Processing: The Transformation, Analysis, and Retrievalof Information by Computer. Addison-Wesley, 1989.
103. Leo Sauermann. The Gnowsis Semantic Desktop for Information Integration. In The 3rdConference on Professional Knowledge Management, pages 39–42, 2005.
104. Leo Sauermann, Ansgar Bernardi, and Andreas Dengel. Overview and Outlook on theSemantic Desktop. In Proc. of the 1st ISWC Workshop on The Semantic Desktop,2005.
105. Leon A. Shklar, Amit P. Sheth, Vipul Kashyap, and Kshitij Shah. InfoHarness: Use ofAutomatically Generated Metadata for Search and Retrieval of Heterogeneous In-formation. In Proceedings of the 7th Conference on Advanced Information SystemsEngineering (CAiSE 1995), pages 217–230, 1995.
106. Pavel Shvaiko and Jerome Euzenat. A Survey of Schema-Based Matching Approaches.Journal of Data Semantics, 4:146–171, 2005.
107. Heiner Stuckenschmidt. Query Processing on the Semantic Web. Kunstliche Intelligenz(KI), 17(3):22–, 2003.
108. Gerd Stumme and Alexander Maedche. Ontology Merging for Federated Ontologies forthe Semantic Web. In Proceedings of the International Workshop on Foundationsof Models for Information Integration (FMII 2001), pages 16–18, 2001.
109. Jeffrey D. Ullman. Information Integration Using Logical Views. In Proceedings of the 6thInternational Conference on Database Theory (ICDT 1997), pages 19–40, 1997.
110. Ron van der Meyden. Logical Approaches to Incomplete Information: A Survey. In Logicsfor Databases and Information Systems, pages 307–356, 1998.
198
111. Holger Wache, Thomas Vogele, Ubbo Visser, Heiner Stuckenschmidt, G. Schuster, H. Neu-mann, and S. Hubner. Ontology-Based Integration of Information - A Survey ofExisting Approaches. In Proceedings of the IJCAI-01 Workshop on Ontologies andInformation Sharing, 2001.
112. Nancy Wiegand, Dan Patterson, Naijun Zhou, Steve Ventura, and Isabel. F. Cruz. Query-ing Heterogeneous Land Use Data: Problems and Potential. In National Confer-ence for Digital Government Research (dg.o), pages 115–121, 2002.
113. Huiyong Xiao and Isabel F. Cruz. RDF-based Metadata Management in Peer-to-PeerSystems. In The 2nd IST Workshop on Metadata Management in Grid and P2PSystem (MMGPS 2004), 2004.
114. Huiyong Xiao and Isabel F. Cruz. Integrating and Exchanging XML Data Using Ontolo-gies. LNCS Journal on Data Semantics, Springer Verlag, 2006. (To appear).
115. Huiyong Xiao and Isabel F. Cruz. Ontology-based Query Rewriting in Peer-to-Peer Net-works. In Proc. of the 2nd International Conference on Knowledge Engineeringand Decision Support, 2006.
116. Huiyong Xiao, Isabel F. Cruz, and Feihong Hsu. Semantic Mappings for the Integrationof XML and RDF Sources. In Workshop on Information Integration on the Web(IIWeb 2004), August 2004.
117. Cong Yu and Lucian Popa. Constraint-Based XML Query Rewriting For Data Integration.In Proc. of the ACM SIGMOD International Conference on Management of Data,pages 371–382.
VITA
NAME: Huiyong Xiao
EDUCATION:
Ph.D., Computer Science, University of Illinois at Chicago, Chicago, Illinois, 2006.
M.S., Computer Science, Tsinghua University, Beijing, China, 2002.
B.S., Computer Science, Huazhong University of Sci. and Tech., Wuhan, China, 1999.
PUBLICATIONS:
1. Huiyong Xiao and Isabel F. Cruz. Integrating and Exchanging XML Data using Ontolo-
gies. Journal of Data Semantics, 2006 (To appear).
2. Huiyong Xiao and Isabel F. Cruz. Ontology-based Query Rewriting in Peer-to-Peer Net-
works. In Proceedings of The 2nd International Conference on Knowledge Engineering
and Decision Support, pages 11-18, May, 2006.
3. Davide Fossati, Gabriele Ghidoni, Barbara Di Eugenio, Isabel F. Cruz, Huiyong Xiao, and
Rajen Subba. The Problem of Ontology Alignment on the Web: a First Report. In Pro-
ceedings of The 2nd Web as Corpus Workshop (in conjunction with the 11th Conference
of the European Chapter of the ACL), pages 51-58, April, 2006.
4. Isabel F. Cruz and Huiyong Xiao. The Role of Ontologies in Data Integration. Journal
of Engineering Intelligent Systems: 13(4):245-252, December, 2005.
199
200
5. Huiyong Xiao and Isabel F. Cruz. A Multi-Ontology Approach for Personal Information
Management. In Proceedings of The 1st Workshop on Semantic Desktop (in conjunction
with the 4th International Conference of Semantic Web), pages 19-33, November, 2005.
6. Huiyong Xiao and Isabel F. Cruz. RDF-based Metadata Management in Peer-to-Peer
Systems. The 2nd IST Workshop on Metadata Management in Grid and P2P System
(MMGPS), December, 2004.
7. Huiyong Xiao, Isabel F. Cruz, and Feihong Hsu. Semantic Mappings for the Integration
of XML and RDF Sources. Proceedings of VLDB Workshop on Information Integration
on the Web (IIWeb), pages 40-45, August, 2004.
8. Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. Peer-to-Peer Semantic Integration of
XML and RDF Data Sources. The 3rd International Workshop on Agents and Peer-to-
Peer Computing (AP2PC), July, 2004. LNCS 3601, Springer 2005.
9. Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. An Ontology-based Framework for
Semantic Interoperability between XML Sources. In Proceedings of the 8th International
Database Engineering and Applications Symposium (IDEAS), pages 217-226, July, 2004.
IEEE Computer Society 2004.
10. Isabel F. Cruz and Huiyong Xiao. Using a Layered Approach for Interoperability on the
Semantic Web. In Proceedings of the 4th International Conference on Web Information
Systems Engineering (WISE), pages 221-232, December, 2003. IEEE Computer Society
2003.
Recommended