35
XML Schema Integration Resources : Louise Lane & Kalpdrum Passi, Sanjay Madria and Mukesh Mohania - “A Model for XML Schema Integration”, and My Research in Fall, 2001 with Dr. Madria

XML Schema Integration Resources : Louise Lane & Kalpdrum Passi, Sanjay Madria and Mukesh Mohania - “A Model for XML Schema Integration”, and My Research

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

XML Schema Integration

Resources : Louise Lane & Kalpdrum Passi, Sanjay Madria and Mukesh

Mohania - “A Model for XML Schema Integration”, and My Research in Fall,

2001 with Dr. Madria

Contents What is XML Data Integration Why business applications use XML What is XML Schema Different ways to integrate XML data XML Schema Integration XML Namespaces Phases in Schema Integration XML Schema Data Model Graphical representation of the model

Contents contd..

Conflicts resolution Integration phase Construction of Global schema Advantages Disadvantages Conclusion

What is XML

XML is a markup language for documents containing structured information.

A markup language is a mechanism to identify structures in a document.

XML documents are self-describing, thus XML provides a platform independent means to describe data and therefore, can transport data from one platform to another.

XML documents can be created and used by applications.

Data Integration

E-Commerce applications use data from different sources and need to be integrated. A mediated schema is created to represent a particular application domain and data sources are mapped as views over the mediated schema.

Why Business applications use XML

Business applications needs to exchange data between different applications.

The data should be transparent from representation and should be platform independent.

XML is also used when one or more organizations merge. When organizations merge, interoperability among documents is necessary which can be achieved using XML integration.

XML Schema

XML Schema is the recommended as the standard schema language by W3C to validate documents.

XML Schema has a stronger expressive power than DTD schema for the purpose of data exchange and integration from various sources of data.

Different ways to integrate XML data

• Integrating XML documents

• Mapping of local schemas to global/integrated schema if the global schema is known, or Querying the data to obtain the required global schema.

• Integrating XML Schemas

Extracting Schema from XML Documents

Minimal Spanning graphs from different documents can be extracted and the Schema can be constructed using these graphs.

Heuristic rules are applied on the obtained spanning graphs to construct the schema.

The paper “Re-engineering Structures from Web Documents” – Chuang-Hue, Ee-Peng, and Wee-Keong deals with constructing Schema in DTD for given XML documents.

Complexities in integrating XML Documents

1. Need to extract the schema from the document.

2. Integrate the schemas obtained or perform mapping from the individual schema documents to the global schema if the global schema is already present.

3. Parse the XML documents and integrate the data according to the global schema. Querying on XML documents can be done to obtain the integrated document.

Tukwila Data Integration System

Tukwila Data Integration system uses a mediated schema to integrate data from different sources.

The user asks a query over the mediated schema and the data Integration system reformulates the query over the data sources and executes it.

Tukwila uses an Query Re-formulator and Optimizer to query large amounts of data efficiently. MiniCon algorithm is used to map the query from the mediated schema to data sources.

It uses an x-scan operator that can query streaming XML data.

Tukwila x-scan operator

To query an XML document, Querying techniques like XML-QL and XQL needs the complete XML document to be downloaded and is then queried.

Tukwila x-scan operator contd..

Tukwila X-scan matches regular path expression patterns from the query, returning results in pipelined fashion as the data streams across the network.

XML Schema Integration

The automated integration of XML schemas is beneficial to both the traditional forms of view integration and database integration.

An integrated schema forms the basis for a valid query language over a particular set of XML documents.

The schemas to be integrated currently validate a set of existing XML documents, data integrity and continued document delivery are chief concerns of the integration process.

XML schema requires the use of namespaces to uniquely identify schema structure ( elements, attributes, datatypes, etc. ).

The name of each structure is prefaced by a namespace prefix which identifies the namespace that the structure is defined within.

A practical example of schema integration is when two companies merge.

XML Namespace

Documents and schemas of the companies that merge

<?xml version="1.0" ?> <gs_equipment xmlns="http://www.GSE1example.org" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:schemaLocation="http://www.GSE1example.org GSE1.xsd"> <machine type=”baggage_handler”> <supplier>Air to Ground</supplier> <serial_number>FRD6754</serial_number> <service_agreement><expiry_date>01-01-2006</expiry_date> </service_agreement> <service_hours>345</service_hours> </machine> <location> <airport>Vancouver</airport> <terminal>6A</terminal> </location></gs_equipment> 

<?xml version="1.0" ?> <gs_equipment xmlns="http://www.GSE2.example.org" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:schemaLocation="http://www.GSE2example.org GSE2.xsd"> <placement> <airport>Winnipeg</airport> <terminal>main</terminal> </placement> <machine type=”tow_truck”> <serial_number>123456145</serial_number>

<vendor>Quick as a Jet GSE</vendor><service_agreement>QJ-TT-

123456145-September 2003 </service_agreement>

<service_hours>1090.75</service_hours> </machine></ge_equipment>

<?xml version="1.0"?><schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"targetNamespace="http://www.GSE1example.org"elementFormDefault="qualified"xmlns:GSE1="http://wwwGSE1example.org><element name ="gs_equipment"> <complexType> <sequence> <element ref="GSE1:machine" minOccurs="1" maxOccurs="1"/> <element ref="GSE1:location" minOccurs="1" maxOccurs="1"/> </sequence> </complexType></element><element name ="machine”> <complexType> <sequence> <element name="supplier" type="xsd:string" minOccurs="1" maxOccurs="1" /> <element name="serial_number" type="xsd:string" minOccurs="1" maxOccurs="1" /> <element ref=”GSE1:service_agreement" minOccurs="1" maxOccurs="1" /> <element name="service_hours" type="xsd:integer" minOccurs="0" maxOccurs="1" > <xsd:attribute name="type" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="baggage_handler"/> <xsd:enumeration value="boarding_stairs"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <sequence> </complexType></element><element name ="service_agreement”> <complexType> <sequence> <element name="expiry_date" type="xsd:date" minOccurs="1" maxOccurs="1" /> </sequence> </complexType></element><element name ="location"> <complexType> <sequence> <element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" /> <element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" /> </sequence> </complexType></element></schema>

<?xml version="1.0"?><schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"targetNamespace="http://www.GSE2example.org"elementFormDefault="qualified"xmlns:GSE2="http://wwwGSE2example.org><element name ="gs_equipment”> <complexType><sequence> <element name="GSE2:placement” minOccurs="1" maxOccurs="1“ /> <element ref="GSE2:machine" minOccurs="0" maxOccurs="1"/> </sequence></complexType></element>

<element name="placement"> <complexType><sequence> <element name="GSE1:airport" minOccurs="1" maxOccurs="1" /> <element name="GSE1:terminal" minOccurs="1" maxOccurs="1" /> </sequence></complexType></element>

<element name ="machine"> <complexType> <all> <element name=”vendor” type=”xsd:string” minOccurs=”0” maxOccurs=”1”> <element name="service_hours" type="xsd:decimal" minOccurs="0“ maxOccurs="1" > <element name="serial_number" type="xsd:positiveInteger" minOccurs="0" maxOccurs="1" /> <element name="service_agreement" type="xsd:string" minOccurs="0" maxOccurs="1" /> </all> <xsd:attribute name="type" use="optional"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="baggage_handler"/> <xsd:enumeration value="tow_truck"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </complexType></element></schema>

An object-oriented data model that is called as XSDM ( XML Schema Data Model ) is defined.

A three-layered architecture consisting of pre-integration, comparison and integration is used for the integration.

A global schema must meet the following criteria: completeness, minimality and understandability.

Optionality of elements is expanded to meet boundary restrictions.

Three Phases of integration

Pre-Integration: In this phase element, attribute and datatype definitions are extracted through parsing the actual schema document.

Comparison: In this phase, the correspondences between elements and attributes are determined either by using semantic learning or using human interaction.

Integration: In this phase, conflicts that exist between the corresponding elements and/or attributes such as naming conflicts, datatype conflicts and structural conflicts are resolved.

XML Schema Data Model (XSDM)

Basically four structures are defined – Node Object, Child Object, Datatype Object and Attribute Object.

Node Object : Represents an element, which may be either non-terminal or terminal. Each node represents another set of structures that define the node – Name, Namespace, Attribute, Datatype, Substitution Group Name, Child list and Node Type which has six types – terminal, sequence, choice, all, any or empty.

Child Object : Represents an element, which is a part of childList. Each child has structures that define itself – Name, namespace, Max Occurances, and Min Occurances.

XML Schema Data Model (XSDM) contd..

Datatype Object : Represents datatype of elements and attributes. The structures that define this are Name, Variety(atomic, union, list), Kind(43 simple and derived datatype), and Constraining Facets.

Attribute Object : Represents attributes associated with a non-terminal or terminal element. The structures that define an attribute – Name, Namespace, Use, DataType, and value(default value).

Graphical Representation of XML Schemas

Graphical representation of sample schema for GSE1

Graphical representation of sample schema for GSE2

Conflict Resolution

Naming Conflicts:Synonym Naming Conflict: Different names but same defination. Solved using substitution group names.

Homonym Naming conflict: Same name but different structure. Homonym conflicts at Non-terminals are called structural conflicts and at terminals are called datatype conflicts.

Conflict Resolution contd..

Datatype & scale differences:

Disjoint or incompatible datatypes – union

E.g. String, integer

Compatible datatypes – scale adjustment

E.g. Integer, float

Enumerated datatype – taking set of all the enumerations

E.g. {a,b}, {b,c} => {a,b,c}

Scale differences – constraint facet redefinition

Conflict Resolution contd..

Structural Conflicts:

Type Conflicts: Terminal in one schema and non-terminal in another schema – Add both to the global schema.

Key conflicts:

If both schemas have their individual keys, then the global schema’s key should be a composite of both the keys.

If an element is declared as key in one schema and as a non-key in other schema, a complete knowledge of the data present in the documents is required.

If the same element is declared as key in both the schemas, a prefix can be added to the keys to make the key elements unique globally.

Integration phase

1. Constructing correspondences table

2. Constructing dependencies table

Correspondences table contain the information about the corresponding elements/attributes.

An entry in the Dependencies table denotes the dependency of an element on other elements/attributes.

The elements/attributes are integrated only after their dependencies are integrated.

Graphical representation of Global schema obtained

Construction of the Global schema Document

Once the integration process is completed, the global schema in XSDM notation is used to construct the global XML schema document.

The construction of the XML schema document is a straight-forward process because all the data about the schema is present in the XSDM notation.

Global schema document<?xml version="1.0"?><schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"targetNamespace="http://www.GSEMexample.org"elementFormDefault="qualified"xmlns:GSEM="http://wwwGSEMexample.orgxmlns:GSE2="http://wwwGSE2example.org ><element name ="gs_equipment”><complexType><choice><sequence><element ref="GSEM:machine" minOccurs="1" maxOccurs="1"/><element ref="GESM:location" minOccurs="1" maxOccurs="1" /></sequence><sequence><element ref="GESM:location" minOccurs="1" maxOccurs="1" /><element ref="GSEM:machine" minOccurs="0" maxOccurs="1"/></sequence></choice></complexType></element><element name ="machine"><complexType><all><element name="supplier" type="xsd:string" minOccurs="0" maxOccurs="1" /><element name="serial_number" type="serial_number_type" minOccurs="0" maxOccurs="1" /><element ref=”GSEM:service_agreement" minOccurs="0" maxOccurs="1" /><element ref=”GSE2:service_agreement” minOccurs=”0” maxOccurs=”1”/><element name="service_hours" type="decimal" minOccurs="0" maxOccurs="1" ><element name="vendor" type="xsd:string" minOccurs="0" maxOccurs="1" ></all><xsd:attribute name="type" use="optional"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="baggage_handler"/> <xsd:enumeration value="boarding_stairs"/> <xsd:enumeration value="tow_truck"/> </xsd:restriction> </xsd:simpleType></xsd:attribute>

</complexType></element>

Global schema document Contd..

<xsd:simpleType name=”serial_number_type”><xsd:union>

<xsd:string><xsd:positiveInteger>

</xsd:union></xsd:simpleType>

<element name ="service_agreement”><complexType><sequence><element name="expiry_date" type="xsd:date" minOccurs="1" maxOccurs="1" /></sequence></complexType></element>

<element name ="location" substutionGroup =”GESM:placement”><complexType><sequence><element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" /><element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" /></sequence></complexType></element>

<element name ="placement"><complexType><sequence><element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" /><element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" /></sequence></complexType></element>

</schema>

  

Advantages

This method is useful when a required global schema is not present.

The global XML schema obtained is complete, minimal and understandable.

Human interaction is required only for a limited level.

Even though local schemas are large and complex, the global schema can be obtained efficiently.

Disadvantages

User interaction is required, cannot do the task by only using semantic learning.

Not successful in resolving all key conflicts. Complete knowledge on data is required to resolve these.

The method doesn’t have an cross check on the users input. The process may result in a un minimal schema if the user doesn’t recognize all the correspondences.

Conclusion

This method is successful in integrating schema documents.

The method explained is implementable.