Using Schematron as Schema Language in Conceptual Modeling ...crpit.com/confpapers/CRPITV143Benda.pdf · Using Schematron as Schema Language in Conceptual Modeling for XML Sobeslav

Using Schematron as Schema Language in Conceptual Modeling for XML

Sobeslav Benda Jakub Klımek Martin Necasky

XML and Web Engineering Research GroupCharles University in Prague

Malostranske namestı 25, 118 00 Praha 1, The Czech RepublicEmail: [email protected]

Abstract

Today, XML is a standard for message exchange insideand among IT infrastructures. For the exchange to workan XML format must be negotiated between the commu-nicating parties. The format is often expressed as an XMLschema. In our previous work, we introduced a conceptualmodel for XML, which utilizes modeling, evolution andmaintenance of a set of XML schemas and allows export-ing modeled formats into grammar-based XML schemalanguages like DTD and XML Schema. However, thereis another type of XML schema languages called rule-based languages with Schematron as their representative.Expressing XML schemas in Schematron has advantagesover grammar-based languages and in this paper, we iden-tify the advantages and we propose a method for easiercreation and maintenance of Schematron schemas usingour conceptual model. Also, we discuss the possibili-ties and limitations of translation from our grammar-basedconceptual model to the rule-based Schematron.

Keywords: XML schema, conceptual modeling, Schema-tron, translation

1 Introduction

Today, XML has many applications in various IT infras-tructures. When using XML, communication partnersmust agree on the used XML formats, i.e. which ele-ments and attributes may be present, in which order, etc.A specification of an XML format is an XML schema -a collection of rules which XML documents must satisfy.Programs that can automatically verify document valid-ity are called validators. There is a number of declarativelanguages called XML schema languages used for descrip-tion of schemas. The aim of these languages is to simplifythe creation, maintenance, readability and portability ofschemas.

The standardized schema languages are DocumentType Definition (DTD), W3C XML Schema and REg-ular LAnguage for XML Next Generation (Relax NG).These languages have differences in some features (e.g.expressive power, syntax complexity, object-oriented de-sign, etc). A common feature of these languages is

This work was supported in part by the Czech Science Foundation(GACR), grant number P202/10/0573 and in part by the grant SVV-2012-265312.

Copyright c©2013, Australian Computer Society, Inc. This paper ap-peared at the 9th Asia-Pacific Conference on Conceptual Modelling(APCCM 2013), Adelaide, Australia, January-February 2013. Confer-ences in Research and Practice in Information Technology (CRPIT), Vol.143, Flavio Ferrarotti and G. Grossmann, Ed. Reproduction for aca-demic, not-for-profit purposes permitted provided this text is included.

their formal background which is a Regular Tree Gram-mar (RTG) (Murata et al. 2005). RTG determines themaximum expressive power and gives instructions for theconstruction of validators. Commonly, we call these lan-guages grammar-based schema languages or grammarsfor short.

However, it is possible to express XML schemas inother languages that are not based on RTG. An example ofsuch language is an also standardized Schematron (Jelliffe2001). Briefly, Schematron allows describing schemas us-ing XPath conditions, that are evaluated over a given XMLdocument during validation. This brings interesting possi-bilities for the validation of XML documents.

Motivation In our previous work (Necasky, Mlynkova,Klımek & Maly 2012, Necasky, Klımek, Maly &Mlynkova 2012), we developed a methodology for mod-eling, evolution and maintenance of XML schemas usinga multilevel conceptual model based on Model Driven Ar-chitecture (MDA) (Miller & Mukerji 2003) and we intro-duced many extensions (Necasky, Mlynkova & Klımek2011, Klımek & Necasky 2011) and described several usecases (Necasky, Klımek & Maly 2011) of our approach.So far, we have only supported grammar–based XMLschema languages, because of their popularity due to un-derstandable declarations and efficient validation. Whileit is true that for relatively simple schemas DTD will doand for more complex structures XML Schema will pro-vide the necessary constructs, there are also drawbacksto these widely used languages. For example, when wevalidate documents using DTD or XML Schema, we usu-ally get a simple valid/invalid statement as a result. Inthe more interesting case of invalidity, the validators usu-ally return a built-in error message, which is often incom-prehensible, misleading and does not provide means forquality diagnostics (Nalevka 2010). In addition, it is of-ten not possible (or user-friendly) to pass them directlyto the user interface. Regarding this diagnostic prob-lem, Schematron schema can help. Schematron is oftendescribed as a language for description of integrity con-straints (Murata et al. 2005), but it is more than that. Us-ing Schematron, it is possible to describe most constraintsthat can be expressed by grammars. Moreover, it is possi-ble to describe many additional details and even structuralconstraints that we can not express using grammar-likelanguages like XML Schema. In (Jelliffe 2007), the au-thors identify the demand for Schematron-based solutionsfor XML schema management, which is another motiva-tion for adding support for Schematron to our conceptualmodel. Finally, when combined with the approach to ex-press integrity constraints in the conceptual model (Maly& Necasky 2012), Schematron becomes a unified schemalanguage for description of the structure and integrity con-straints of XML documents and a framework for detaileddiagnostics and error reporting. These advantages of usingSchematron outweigh its main disadvantage, which is its

Proceedings of the Ninth Asia-Pacific Conference on Conceptual Modelling (APCCM 2013), Adelaide, Australia

31

verbosity and complexity, because it can be eased by theusage of our conceptual model for schema management.

Outline The paper is organized as follows: In Section 2,we introduce our conceptual model for XML. In Section 3,we introduce the Schematron language. Section 4 con-tains the main contribution of this paper, the translationfrom the conceptual model to Schematron schemas. InSection 5 we discuss related work, in Section 6 we evalu-ate our approach and we conclude in Section 7.

2 Conceptual modeling of XML schemas

In this section, we briefly introduce our conceptual modelfor XML (see (Necasky, Mlynkova, Klımek & Maly2012) for full description). It is based on two levels of ab-straction. The Platform-Independent Model (PIM) mod-els the problem domain independently of any target plat-form such as XML or relational databases. The Platform-Specific Model (PSM) then provides description of howa part of the problem domain is represented in the targetplatform, in our case XML. A PSM schema is thereforea description of an XML format. From a PSM schema,we can automatically create a representation of the formatin a chosen XML schema language such as XML Schema(our previous work (Necasky, Mlynkova, Klımek & Maly2012)), or, in the case of this paper, Schematron. The mainfeature of the conceptual model is a mapping, which spec-ifies for each component in each PSM schema to whichcomponent in the PIM schema it corresponds. We ex-ploit this mapping for automatic propagation of changesbetween the two levels, which simplifies the managementof multiple XML schemas (Necasky, Klımek, Maly &Mlynkova 2012).

2.1 Platform-Independent Model

A PIM schema S is based on UML class diagrams andmodels real-world concepts and relationships betweenthem independently of the target platform (implementa-tion). It contains three types of components: classes, at-tributes and associations with the usual semantics. A sam-ple PIM schema is in Figure 1(a). Formally, we define itssimplification in Definition 1.

Definition 1 A platform-independent (PIM) schema is atriple S = (Sc,Sa,Sr) of disjoint sets of classes, at-tributes, and associations, respectively.

• Class C ∈ Sc has a name assigned by function name

• Attribute A ∈ Sa has a name, data type and cardi-nality assigned by functions name, type, and card,respectively. Moreover, A is associated with a classfrom Sc by function class.

• Association R ∈ Sr is a set R = {E1, E2}, whereE1 and E2 are called association ends of R. Rhas a name assigned by function name. Both E1and E2 have a cardinality assigned by function cardand are associated with a class from Sc by func-tion participant. We will call participant(E1) andparticipant(E2) participants of R. name(R) may beundefined, denoted by name(R) = λ.

For a class C ∈ Sc, we will use attributes (C ) to denotethe set of all attributes of C. Similarly, associations (C )will denote the set of all associations with C as a partici-pant.

2.2 Platform-Specific Model

The platform-specific model (PSM) specifies how a partof the reality modeled on the PIM level is represented inthe target platform, XML in our case, which makes thePSM schemas views of the PIM schema. Its advantage isthat the designer works in a UML-style way which is morecomfortable then editing the XML schema itself and alsoenables the maintenance of mappings to the PIM level.The individual constructs on the PSM level are, however,slightly modified to reflect the structure of XML docu-ments. A PSM schema represents an XML format andcan be automatically translated to the XML Schema lan-guage. Very briefly, classes represent complex types, at-tributes represent XML attributes or XML elements thathave simple types, associations represent the nesting re-lation. For full translation description see our previouswork (Necasky, Mlynkova, Klımek & Maly 2012). For-mally, PSM schema is defined by Definition 2. An exam-ple is in Figure 1(b). In Figure 1(c) there is a sample XMLdocument with its structure modeled by the PSM schema.

Definition 2 A platform-specific (PSM) schema is a tu-ple S ′ = (S ′c,S ′a,S ′r,S ′m, C′S′) of disjoint sets of classes,attributes, associations, and content models, respectively,and one specific class C′S′ ∈ S ′c called schema class.

• Class C ′ ∈ S ′c has a name assigned by function name

• Attribute A′ ∈ S ′a has a name, data type, cardinal-ity and XML form assigned by functions name, type,card and xform, respectively. xform(A′) ∈ {e, a} (el-ement or attribute). Moreover, it is associated with aclass from S ′c by function class and has a position as-signed by function position within the attributes as-sociated with class(A′).

• Association R′ ∈ S ′r is a pair R′ = (E′1, E

′2), where

E′1 and E′

2 are called association ends of R′. BothE′

1 and E′2 have a cardinality assigned by func-

tion card and each is associated with a class orcontent model from S ′c ∪ S ′m assigned by functionparticipant, respectively. We will call participant(E′

1)and participant(E′

2) parent and child and will de-note them by parent(R′) and child(R′), respectively.Moreover, R′ has a name assigned by function nameand has a position assigned by function positionwithin the associations with the same parent(R′).name(R′) may be undefined, denoted by name(R′) =λ.

• Content model M ′ ∈ S ′m has a content modeltype assigned by function cmtype. cmtype(M ′) ∈{sequence, choice, set}.

The graph (S ′c ∪ S ′m,S ′r) must be a forest of rootedtrees with one of its trees rooted in C′S′ . For C ′ ∈ S ′c,attributes (C ′ ) will denote the sequence of all attributesof C ′ ordered by position, i.e. attributes (C ′ ) = (A′

i ∈S ′a : class(A′

i) = C ′ ∧ i = position(A′i)). Similarly,

content (C ′ ) will denote the sequence of all associationswith C ′ as a parent ordered by position, i.e. content (C ′ )= (R′

i ∈ S ′r : parent(R′i) = C ′ ∧ i = position(R′

i)). Wewill call content (C ′ ) content of C ′.

2.3 Interpretation of PSM schema against PIMschema

A PSM schema represents a part of a PIM schema. Aclass, attribute or association in the PSM schema maybe mapped to a class, attribute or association in the PIMschema. The mapping specifies the semantics of classes,attributes and associations of the PSM schema in terms of

CRPIT Volume 143 - Conceptual Modelling 2013

32

0..* makes

0..*

0..*

1..*

1..*

Supply

amount

supply-price

date

name

email {1..*}

Supplier

Product

title

price

code

Item

tester

item-price

amount

Customer

name

email {1..*}

phone {0..*}

Purchase

code

create-date

status

Address

street

city

LocalAddress

has

GlobalAddress

country

ShippingAddress

gps

Items

Item

ProductBase

Product|

Purchase

@version

Customer

name

Contact

email {1..*}

phone {0..*}

ItemTester

@tester

ItemPricing

price

amount

cust items

item

PurRQSchema

purchaseRQ

1..*

Address

street

city

addr

ShippingAddress

country

gps

LocalAddress

<purchaseRQ version="1.1.4.">

<cust>

<name>Jakub Klímek</name>

<email>[email protected]</email>

<addr>

<street>Malostranské náměstí 25</street>

<city>Prague</city>

<country>Czech Republic</country>

<gps>50.088385,14.403629</gps>

</addr>

</cust>

<items>

<item tester="no">

<code>IT1234</code>

<title>Sample 4</title>

<price>20€</price>

<amount>4</amount>

</item>

</items>

</purchaseRQ>

(a) (b) (c)

Figure 1: Example of a PIM and a PSM schema and a sample corresponding XML document

the PIM schema. The mapping must meet certain condi-tions to ensure consistency between PIM schemas and thespecified semantics of the PSM schema. This mapping isthen utilized in various interesting use cases for the con-ceptual model like XML schema evolution and integra-tion (Klımek & Necasky 2010, Klımek & Necasky 2011,Necasky, Klımek, Maly & Mlynkova 2012). For the pre-cise conditions of the mapping, see (Necasky, Mlynkova,Klımek & Maly 2012). In this paper, we focus on thetranslation from a PSM schema to Schematron and there-fore, the precise definition of interpretation is not neces-sary here.

2.4 Conceptual model summary

In summary, the usefulness of our conceptual model forXML can be clearly seen when we, for example, ask ques-tions like ”In which of our hundred XML schemas used inour system is the concept of a customer represented?”and ”What impact on my XML schemas would this par-ticular change on the conceptual level have?”. Even bet-ter, with our extensions for evolution of XML schemas(Klımek & Necasky 2010, Necasky, Klımek, Maly &Mlynkova 2012) we can make changes to the PIM schema(e.g. change the representation of a customer’s name fromone string to firstname and lastname) and thosechanges can be automatically propagated to all the af-fected PSM schemas. Thanks to automated translationsfrom PSM schemas to, for example, XML Schema andback (Necasky, Mlynkova, Klımek & Maly 2012), we caneasily manage a whole system of XML schemas from theconceptual level all thanks to the interpretations. Theseextensions, however, are not trivial and are not in thescope of this paper. Also, it would be possible to gen-erate a clickable HTML documentation of a system mod-eled using our conceptual model. With the model, it isalso much easier and faster to grasp a system of multipleXML schemas when, for example, negotiating interfacesbetween two information systems. We already have ourmodel implemented in our tool called eXolutio1.

3 Schematron

Schematron is a declarative language which represents therule-based XML schema languages. These languages arenot based on construction of a grammatical infrastruc-ture. Instead, they use rules resembling if-then-else state-ments to describe constraints. These languages offer thefinest granularity of control over the format of the docu-ments (Vlist June 2002). We can even view constructs ofother schema languages as a syntactical sugar used insteadof sets of rule-based conditions. Schematron was designed

1http://exolutio.com

in 1999 by Rick Jelliffe. The language was standardizedin 2005 as ISO Schematron (Jelliffe 2001).

Example 1 Consider a specification of a com-plex element using the following DTD declaration<!ELEMENT purchase (item+,customer?)>.In Schematron, it is possible to describe the same seman-tics and cover valid instances using multiple intuitiveconditions, for example: If purchase element exists,the element can only have an item and a customerelements as children. The item element has at least oneoccurrence and the customer element has zero or oneoccurrence. If the customer element exists as a childelement of a purchase element, then the customerelement has no following sibling elements.

Schematron is not a standalone language. It is a gen-eral framework which allows schema designers to orga-nize conditions which are evaluated over the given doc-uments. These conditions are described using an under-lying XML query language such as the default XPath. Aresult of a validation is a report which contains informa-tion about evaluation of these conditions. Schematron isan XML-based language and uses only few elements andattributes for schema description.

3.1 Core constructs

Now we describe core Schematron constructs. The rootelement of every schema is a schema element introduc-ing the required XML namespace2. A pattern elementis a basic building stone for expressing an ordered collec-tion of Schematron conditions which are ordered in XMLdocument order. A rule is a Schematron condition whichallows a designer to specify a selection of nodes from agiven document and evaluation of predicates in the con-text of these nodes. The rule element has a requiredcontext attribute used for an expression in the under-lying query language. The value of the context at-tribute is commonly called a path. Predicates are spec-ified using a collection of assertions. An assertion is apredicate which can be positive or negative. An asser-tion is represented using the assert and report el-ements. Both elements have a required test attributefor specification of a predicate using the underlying querylanguage. Both elements also have a text content callednatural-assertion. Natural-assertion is a message in a nat-ural language, which a validator can return in the valida-tion report. A positive predicate is represented using anassert element and if it is evaluated as false, we saythat the assert is violated and the document is invalid. Anegative predicate is represented using a report elementand if it is evaluated as true, we say that the report is active

2http://purl.oclc.org/dsdl/schematron


33

http://exolutio.com

http://purl.oclc.org/dsdl/schematron

and a natural-assertion will be reported. Schematron is notonly a validation language. It is a more general XML re-porting language (Ogbuji 2004) where one type of reportis an error message.

Example 2 A pattern in Figure 2 selects all triangleelements from a document. In a context of everytriangle element a positive predicate specified with ex-pression count(vertex)=3 is evaluated. If the given

<pattern><rule context="triangle"><assert test="count(vertex)=3">The element ’triangle’ shouldhave 3 ’vertex’ elements.

</assert></rule>

</pattern>

Figure 2: Schematron pattern

triangle has for example four child vertex elements,then the predicate will be false and the following messagewill be reported: The element ’triangle’ should have 3’vertex’ elements.

3.2 Additional constructs

In addition to the core Schematron constructs we de-scribe other constructs used in practical applications.Schematron allows using metadata for introduced con-structs. Identifiers allow identification of a pattern insidea schema, a rule inside the pattern, etc. It is representedusing an id attribute. A role attribute can be used to as-sign special semantics to Schematron constructs. A diag-nostic is a natural-language message giving details abouta failed assertion, such as found versus expected valuesand repair hints. It is represented using a diagnosticelement with required id attribute and text content with amessage. Diagnostics are referenced by assertions usinga diagnostics attribute. We can use substitutions innatural-assertions for clearer result in validation reports.An element name is substituted by the name of a contextnode. An element value-of is substituted by a valuefound or calculated using an expression in the requiredselect attribute. Abstract rules provide a mechanismfor reducing schema size. An abstract rule can be invokedby other rules belonging to the same pattern. Variablesare substituted in assertion tests and other expressions be-fore the expression is evaluated. Phases allow to organizepatterns into identified parts. Every Schematron schemahas one default phase which includes all patterns. Beforevalidation, it can be determined which phase is used andwhich patterns are activated. This selected phase is calledan active-phase. A phase is represented using a phaseelement with an id attribute. One phase can have mul-tiple active elements which refer to patterns using apattern attribute. A variable is represented using alet element. If the variable is a child of a rule, thevariable is calculated in scope of the current rule and con-text. Otherwise, the variable is calculated within the con-text of the instance document root. Abstract patterns allowa common definition mechanism for structures which usedifferent names and paths, but which are the same other-wise.

3.3 Schematron implementations

An implementation of Schematron validation is very sim-ple in general, because it is based on already implementedXML technologies. There are two kinds of Schematronvalidation.

3.3.1 XSLT validation

For this approach, we only need an XSLT processor anda predefined XSLT script3. The script translates the givenSchematron schema to another XSLT script which is usedfor the actual XML document validation. During the val-idation, a given XML document is transformed into an-other XML document. This document is the result of thevalidation and may be formatted using standard Schema-tron Validation Report Language (SVRL) (Jelliffe 2001),which provides rich information about the validation pro-cess, e.g. XPaths for elements which violated assertions.

3.3.2 Special libraries

Another approach is to use a special (platform-dependent)library. Some libraries4 only wrap the described XSLTvalidation. However, there are other implementations notbased on XSLT. These libraries are based on the evaluationof XPath expressions. This allows a programmer to adaptthe validation for special requirements or possibilities of atarget platform. For example, we have implemented a C#validator called SchemaTron (Benda et al. 2011) provid-ing excellent performance for XML content-based mes-sage routing inside an intermediate service.

3.4 Schematron properties

In this section, we describe selected properties of Schema-tron schemas in the context of conceptual modeling ofXML and compare them with grammar-based schemas.We mostly consider XML Schema 1.0 as their representa-tive because we already support it in our conceptual modelfor XML.

3.4.1 Platform independence

Schematron is based on standard XML technologies,which are commonly implemented in many software en-vironments, e.g. XSLT processor is natively implementedin standard web browsers. For these reasons, we can seeSchematron as a platform independent XML schema lan-guage, because we do not need a specific validator.

3.4.2 Expressive power

Presently, there is no formal framework (Kwong & Gertz2006) which could capture the broad set of possibilities ofSchematron conditions.

The authors of (Lee & Chu 2000) provide some ba-sic expressive possibilities of Schematron, compare itwith other schema languages and show by example thatSchematron has an excellent expressive power and in thisregard it is (e.g. with XSD) a first class XML schemalanguage. The authors describe, that we can specify forexample: parent-child relationships, sequences, choicesamong elements and attributes, unordered sets, min andmax occurences, etc. Moreover, we can specify manyXML formats, which can not be expressed using gram-mars, for example conditional definitions (e.g. a presenceof elements or attributes is dependent on a value of anotherelement or attribute) or detailed integrity constraints, etc.

However, there is not any precise generalization ofSchematron rules which would provide a clever mappingof regular tree grammar into Schematron rules (and viceversa), but experiments show (Jelliffe 2007) that it is pos-sible to describe many instances of grammars in Schema-tron, even in diverse ways.

3http://www.schematron.com/tmp/iso-schematron-xslt1.zip

4http://www.probatron.org/


34

http://www.schematron.com/tmp/iso-schematron-xslt1.zip

http://www.schematron.com/tmp/iso-schematron-xslt1.zip

http://www.probatron.org/

4 PSM to Schematron translation

A PSM schema models a grammar-based XML formatspecification and its concepts are interpreted against PIMconcepts. There are several theoretical and practical prob-lems that we must consider when we want to describe thetranslation of a PSM schema to a Schematron schema.In particular, we need to identify groups of Schematronrules that impose equivalent constraints on the documentsas constructs of grammar-based languages would.

4.1 Overall view of the translation

The translation algorithm (see Algorithm 1) is fully auto-matic. It has a PSM schema on the input and it gradu-ally builds a Schematron schema on the output. The gen-erated schema covers grammatical structural constraintsnormally expressed in, for example, XSD.

Algorithm 1 Overall view of the translation algorithm1: <schema xmlns="http://purl.oclc.org/dsdl/schematron">

2: Generate allowed root element names (Section 4.2);3: Generate allowed names (Section 4.3);4: Generate allowed contexts (Section 4.4);5: Generate required structural constraints (Section 4.5);

6: Generate required sibling relationships (Section 4.6);

7: Generate required text restrictions (Section 4.7);8: </schema>

In the first step (line 2), we generate Schematron pat-terns for XML elements, which are allowed inside validXML documents as root elements. Similarly, we generatepatterns for allowed names of XML elements and XMLattributes in the next step on lines 3. On line 4, we pro-duce patterns for allowed contexts, i.e. paths where cer-tain names of elements (and attributes) may occur. Wecall the generated patterns absorbing patterns and we de-scribe them later. The patterns for validation of requiredcomplex element structures are produced in the steps onlines 5 and 6. These patterns are more complex, becausewe must generate an equivalent of regular expressions toobtain the semantics of regular grammars. We call thesepatterns conditional patterns and we describe them laterin this section. In the last step (line 7), the patterns for textrestrictions, i.e. validation of attribute values and simpleelement contents, are produced.

4.2 Allowed root element names

We need a tool for reporting names of elements which arenot allowed in the schema, but are present in the docu-ment.

Definition 3 An absorbing pattern is a Schematron pat-tern for an ordered set of paths P , where the last rule iscalled global and it contains the * wildcard somewhere inits path and no previous rules use wildcards.

In this definition we defined a special kind of aSchematron pattern which we call an absorbing patternand which allows Schematron to absorb elements (or at-tributes) specified by paths. It checks for all the allowedelements or attributes in the path and if none of them isfound, it matches whatever is found in the path using awildcard (absorbs it), so that the validation can continue.If the element or attribute is absorbed by the wildcard, itis a violation of the expected format and the element orattribute absorbed by the wildcard rule is reported. In thefirst step on line 2 in the overall Algorithm 1, we generate

an absorbing pattern for checking allowed root elementsan the intuitive way described here.

Example 3 As an example, consider the set of pathsP , which contains paths for all allowed root elements/request and /response. We generate the absorb-ing pattern that is in Figure 3. When the validated docu-

<pattern id="allowed-root-elements"role="absorbing-pattern">

<rule context="/request"><assert test="true()"/>

</rule><rule context="/response"><assert test="true()"/>

</rule><rule context="/*"><assert test="false()">The element ’<name/>’ isnot declared in the schemaas a root element.

</assert></rule>

</pattern>

Figure 3: Absorbing pattern example

ment has request or response element as a root, theelement is absorbed and the validation continues. Whenthe document has for example an x root element, it is ab-sorbed by the wildcard part of the pattern, which meansthat the document is invalid and the following message isreported: The element ’x’ is not declared in the schemaas a root element. However, the validation still continues,which is in contrast to XSD validation, which would endat this point.

4.3 Allowed names

The approach is very similar to generation of allowed rootelement names, because we also generate absorbing pat-terns. Patterns for allowed attributes are also similar, sowe skip them in this paper. Production of patterns forchecking allowed XML elements inside validated docu-ments follows this algorithm: We produce the set P of allpaths for allowed element names. For complex elements,we get them from the names of named associations whichhave classes as children in the PSM schema. For simpleelements, we get the names from PSM attributes A′ withXML form set to xform(A′) = e. From the paths andnames, an absorbing pattern is generated.

4.4 Allowed contexts

Now we introduce stricter patterns for checking allowedcontexts, i.e. paths inside documents. We also generateabsorbing patterns, but we need more sophisticated paths,because we absorb only element and attribute names inthe declared contexts, so the other names (contexts) breakvalidity.

Paths overview A path is described using an XPath ex-pression to select some nodes from the validated XMLdocument. When nodes are selected, we can evaluate as-sertions, i.e. certain XPath predicates in the context ofthese nodes. In general, we have two approaches to howwe can describe paths, i.e. absolute paths, for example/book/author/name or relative paths for examplename. If we want to design schemas more powerful thanDTD, i.e. local regular tree grammars (Murata et al. 2005),we need absolute paths to select nodes from documents.However, relative paths are also important for example to


35

design recursive declarations. There is also a possibility touse predicates in paths. We do not deal with predicates fornodes selection, because we aim to design a Schematronschema as simple as possible. For this purpose, we im-pose a SORE precondition on our PSM schemas in Def-inition 4. Every SORE is deterministic as required bythe XML specification and more than 99% of the reg-ular expressions in practical schemas are SOREs (Bexet al. 2006), so the precondition does not limit us muchand at the same time simplifies the translation a lot. Forinstance, ((a|b),c 0..*,d 0..1) 0..3 is SOREwhile a(a|b) 0..* is not as a occurs twice.

Definition 4 Let S′ be a PSM schema. We will call SOREprecondition an assumption on S′, that every complex ele-ment has content described using Single Occurrence Reg-ular Expression, i.e. every element (or attribute) name canoccur at most once in this regular expression.

Paths construction Here we describe the constructionof paths for a PSM schema. The main idea is as fol-lows. For each XML element and XML attribute decla-ration present in a PSM schema, we produce all possiblepaths (contexts) where they can occur. Every created pathis associated with a PSM component, i.e. a complex ele-ment, a simple element or an attribute declaration and thepairs are placed into the global set of pathsGp. In the nextstep, we perform sorting of Gp members. The resultingordered setGp = {(X ′, p);X ′ ∈ (S′

r∪S′a) and p is path}

is used for generation of Schematron rules in the order ofthis set in the rest of the translation. We sort members ofGp using the following ordering: (1) The absolute pathswithout recursions go first (2) The absolute paths with re-cursions follow, the longest path is the first one (3) Therelative paths go last, again, the longest path is the firstone to go.

Firstly, we need to create all paths for a given XML el-ement or XML attribute declaration. Let us mark the dec-laration – the given PSM component X ′ ∈ (S′

a ∪ S′r). We

build an ancestor tree for X ′ which represents all achiev-able ancestor PSM components ofX ′ in the PSM schema.Then we can translate all its paths from leaf nodes to rootnode into Schematron paths, i.e. XPath expressions. Foreach (X ′, p) ∈ Gp must hold that p is unique, which cor-responds to the SORE precondition in Definition 4.

Pattern for allowed element contexts Now we can pro-duce patterns for allowed contexts. We go through allmembers of the ordered set Gp and produce set of pathsP only for complex element names and simple elementnames (PSM attributes with XML form set to element).In the last step we produce an absorbing pattern for P with*. Similarly, we produce a pattern for allowed attributecontexts.

4.5 Required structural constraints

Now we have absorbing patterns for weak validation ofXML documents generated. These patterns say what isallowed inside the documents. Now we deal with restric-tions which say what the given document must satisfy.

Conditional pattern First of all, we specify anotherSchematron pattern, which we call conditional pattern (seeDefinition 5).

Definition 5 A conditional pattern is a Schematron pat-tern for a set of pairs E = {(p,A); where p is a path andA is a set of predicates}. It consists of several rules andthe document passes validation by this pattern only if allthe rules are satisfied.

Example 4 Consider a PSM schema where the root el-ement customer must have a ship-to child ele-ment which must have a street child element and thestreet must not have any child elements. We generate aconditional pattern, which is in Figure 4. The pattern re-

<pattern role="conditional-pattern"><rule context="/customer"><assert test="count(ship-to)=1"/>

</rule><rule context="/customer/ship-to"><assert test="count(street)=1"/>

</rule><rulecontext="/customer/ship-to/street"><assert test="count(*)=0"/>

</rule></pattern>

Figure 4: Conditional pattern example

sembles a collection of if-then conditions, because it says:If the customer element exists in the document as a root,it must hold that it has a ship-to child element. If theship-to element exists in the document as a child of thecustomer element, it must hold that it has a streetchild element, etc. The example demonstrates, that we cancreate such a pattern with chained rules. If we wanted todescribe this pattern using XML Schema, it would be:

<schema><element name="customer"><complexType><sequence><element name="ship-to"><complexType><sequence><element name="street"/>

</sequence></complexType>

</element></sequence></complexType>

</element></schema>

For the production of conditional patterns, we need toanalyze specifications of complex element contents. Thecomplex element declared in a PSM schema is preciselyspecified using a regular expression, so we need to analyzesuch regular expressions and translate them into Schema-tron predicates. The main idea is as follows. We translatethe regular expression into several conditional patterns.These patterns cover the same semantics as the regular ex-pression, when they are evaluated together.

We generate two conditional patterns for checkingstructural constraints as a part of Algorithm 1 in thestep on line 5. One of the generated patterns checks re-quired parent-child relationships, the other pattern checksrequired parent-attribute relationships and other relation-ships of attributes and elements (predicates for choicesamong attributes and elements). There can be also otherdistributions of conditions into patterns. For example, ev-erything may be inside one pattern, but we believe thatour distribution provides flexible solution, because we cancheck required elements only, required attributes only, etc.In the algorithm, we use translations of regular expres-sions into boolean expressions and then normalization ofboolean expressions into conjunctive normal form (CNF).Due to lack of space and because these are quite standardtransformations, we do not describe them here in detail.


36

Purchase

item1..*

ItemTester

@tester

ItemAmount

amount : int

price : double

|

Item

@code : ID

<pattern role="conditional-pattern">

<rule context="/purchase/item">

<assert test="@code and ((amount and price and

count(@tester)=0) or (count(amount|price)=0 and @tester))"/>

</rule>

</pattern><pattern role="conditional-pattern">

<rule context="/purchase/item">

<assert test="@code"/>

<assert test="amount or @tester"/>

<assert test="price or @tester"/>

<assert test="count(@tester)=0 or count(amount|price)=0"/>

<assert test="price or count(amount|price)=0"/>

<assert test="amount or count(amount|price)=0"/>

</rule>

</pattern>

(a)

(b)

(c)

Figure 5: Example of element item

Algorithm 2 Generate patterns for structural constrains1: Let E1 be an empty set of pairs (p,Ae);2: Let E2 be an empty set of pairs (p,Aa);3: for all (X ′, p) ∈ Gp do4: if X ′ ∈ S′

r then5: Let Ae be an empty set of predicates;6: Let Aa be an empty set of predicates;7: for all Y ∈ cnf (be ′(X ′)) do8: if Y has only elements in its literals then9: Add Y into Ae;

10: else11: Add Y into Aa;12: end if13: end for14: if Ae is not empty then15: Add (p,Ae) into E1;16: end if17: if Aa is not empty then18: Add (p,Aa) into E2;19: end if20: end if21: end for22: Generate conditional pattern for E1;23: Generate conditional pattern for E2;

Let us now take a look at Algorithm 2 for productionof patterns for structural constraints based on boolean ex-pressions. Firstly, we initialize two empty sets of pairs(p,Ae) (line 1) and (p,Aa) (line 2), where p is a path andAe is a set of associated predicates for elements, Aa isa set of associated predicates for attributes and relationswith elements. Then, we go through pairs (X ′, p) ∈ Gp

and when X ′ is a complex element declaration, we initial-ize new sets Ae and Aa (lines 5 and 6) and translate X ′

into boolean expression and the boolean expression intoconjunctive normal form (line 7). Then, we go throughobtained predicates. When a predicate (marked as Y ) hasonly elements in its literals we add it into Ae, else we addit into Aa. Then, if Ae or Aa is not empty, we add a pair(p,Ae) or (p,Aa) into E1 (line 15), or E2 (line 18), re-spectively. In the last step (lines 22 and 23), we generateconditional patterns for E1, E2. We have created two pat-terns, which cover certain structural constrains of modeledcomplex contents.

Example 5 As an example, consider a regular expressionwhich specifies the complex element item in Fig-ure 5(a): (@code,(amount,price)|@tester).We translate it into @code and ((amountand price and count(@tester)=0) or

(count(amount|price)=0 and @tester)),which is an XPath predicate that we use in Schematronassertion in Figure 5(b). This representation is quitestraightforward and corresponds well with grammar-based languages like XML Schema. However, it alsocomes with disadvantages in the form of poor diagnostics.As with XML Schema validation, when we would validatea document using the Schematron rule from Figure 5(b),we would only get a valid or invalid statement withoutfurther details. For this purpose, it is more advantageousto go into more detail and write the same rule as multiplesimpler rules. We transform the regular expression, whichis a logic formula, to a conjunctive normal form as seen inFigure 5(c). Now, we can create user-friendly diagnosticsfor each of these rules. Note that this is also an exampleof choice between attributes, which is not possible in XMLSchema but can be done using Schematron.

4.6 Required sibling relationships

In the previous section we generated structural constraintsusing boolean expressions, which allow to validate parent-child relationships. So far we did not deal with the orderof child elements inside a parent element. Here we de-scribe our approach based on the theory of regular expres-sions. The main idea is as follows. We build a finite stateautomaton for a given regular expression. We deal onlywith SORE so we can build the deterministic SORE au-tomaton, where every name of XML element is assignedto at most one inner state and it has one initial and onefinal state. Then we translate information obtained fromthis structure into Schematron conditions. We representthe transition function of the automaton using conditionalpatterns and we cover for example the order of XML ele-ments (sequences, choices among elements) and also car-dinalities zero or one (0..1, or ?), just one (1..1), zero ormore (0..*, or Kleene star *), one or more (1..*, or Kleenecross +). We can also provide clear natural-assertions anddiagnostics.

There are also some problems and exceptions. Firstly,we can not cover arbitrary numeric intervals of regular ex-pressions using this approach (it is possible to create anautomaton with numeric intervals, but it is not possible torepresent it in Schematron). We need another approachfor numeric constraints in general, which is not part ofthis paper. Secondly, a PSM content model SET compli-cates construction of the algorithm. The restriction (Defi-nition 6) for content model SET is similar to restriction ofXSD construct ALL.

Definition 6 Let S′ be a PSM schema. We introduceSET precondition, which is an assumption on S′, that for


37

each content model SET it must hold that it has namedassociations with classes as children in its content andthe content model is descendant of associations R′ ∈S′r, (name ′(R′) = λ ∨ child ′(R′) /∈ S′

c) in the complexcontent, where card ′(R′) = 0..1 or card ′(R′) = 1..1.

Now we can presume that we can build the automatonfor each complex element declaration, i.e. named associ-ation with class as a child. We also need to translate ob-tained information into Schematron rules. For each com-plex element and for elements in its content we producea set of predicates. These predicates are composed fromfollowing-sibling XPath axes. For each of the ob-tained predicates we generate a conditional pattern in thestep on line 6 in Algorithm 1.

Example 6 Consider content (title?,name,(phone|e-mail)+). We can represent this regularexpression using the SORE automaton in Figure 6(a).Then we generate Schematron rules (see Figure 6(b))which represent the automaton in Schematron. Notethat we use F := following-sibling substitutionfor code size reduction in this example. The first rulerepresents the initial state of the automaton and says whatelements can be at the first position in the content. Otherrules represent if-then conditions, i.e. if title elementexists, it has a name follower. If name element exists,it has a phone or an e-mail followers. If phoneelement exists, it has the phone element or the e-mailelement followers or no following-sibling elements.

4.7 Required text restrictions

In this section we show the final patterns of our proposedtranslation from a PSM schema of our conceptual modelfor XML to Schematron. They deal with validation ofdata types for simple element contents and attribute val-ues. The supported set of data types for PSM attributesis implementation dependent, as we need the datatypes tobe defined by the designer in Schematron. Our implemen-tation, eXolutio, provides XSD built-in simple data typesand we created the rules for their definition in Schema-tron, because Schematron does not provide built-in datatypes as default. We can, however, create many datatypes specifications using XPath expressions placed intoabstract rules or patterns. Schematron over XPath 1.0,which we describe here, is worse than XSD in this prac-tical aspect and we need to help ourselves by dependingon the designer to define the used datatypes. Examples ofthese definitions are in Figures 7 - 11.

Definition 7 Let S ′ be a PSM schema. We will call datatypes precondition an assumption on S ′, that each datatype used in S ′ has corresponding declaration in Schema-tron provided by the designer or by the eXolutio tool.

In the step on line 7 of Algorithm 1 we generate pat-terns for data types validation as extension rules of ourpredefined data type rules using the <extend/> elementin the <rule/> element.

4.8 Translation summary

In this section, we introduced the problem of automaticconstruction of Schematron schemas from PSM schemas.The translation is not simple, because we have differ-ent models - grammar-based PSM schema (and XMLSchema, DTD, etc.) and the rule-based Schematron.However, we showed that Schematron is a very powerfullanguage and it can express many grammatical structuralconstrains from the grammar-based languages and more.

<rule id="emptyString"abstract="true">

<assert test="string-length(normalize-space(text()))=0"/>

</rule>

Figure 7: Empty string data type

<rule id="string" abstract="true"><let name="str"

value="string(text())"/><assert test="$str"/>

</rule>

Figure 8: String data type

<rule id="normalizedString"abstract="true">

<extends rule="string"/><let name="nstr"

value="normalize-space($str)"/><assert

test="string-length($str)=string-length($nstr)"/>

</rule>

Figure 9: Normalized string data type

<rule id="boolean" abstract="true"><extends rule="string"/><assert test="$str=’true’

or $str=’false’"/></rule>

Figure 10: Boolean data type

We started with production of absorbing patterns,which allow to validate allowed occurrences of XML el-ements and XML attributes inside validated XML docu-ments. Then we produced conditional patterns for vali-dation of required grammatical structural constraints. Weanalyzed the most used parts of regular expressions whichcan be represented in Schematron. Then we generated pat-terns for validation of data types for simple element con-tents and attribute values.

There are some limitations to our approach that, how-ever, do not seem critical at the moment. The most visibleone is the lack of support for arbitrary numeric intervals incardinalities. We only support the usual 0..*, 0..1,1..*, 1..1. This is because the support for arbitraryintervals would necessarily lead to Schematron code ex-plosions which would only complicate and slow down thevalidation process.

5 Related work

In parallel to the research of translation of PSM schemasto Schematron, other PSM schema improvements are alsobeing researched. In particular the support for ObjectConstraint Language (OCL) (Object Constraint LanguageSpecification, version 2.0 2009) and its translation toSchematron for the specification of integrity constraints,where Schematron is used as a complement of grammar-based schemas. These patterns for integrity constraintsgenerated from OCL may be potentially merged with ourSchematron schemas. To our best knowledge, little workhas been done in the area of translations between Schema-tron and other XML schema languages. There are sourcesnot based on academic research which provide some ba-sic ideas and techniques for translation of grammar-basedschemas to Schematron schemas and vice versa. Mostwork in this area has been done by Rick Jelliffe and his


38

<rule context="context">

<assert test="*[1][self::title or

self::name]"/>

</rule>

<rule context="context/title">

<assert test="F::*[1][self::name]"/>

</rule>

<rule context="context/name">

<assert test="F::*[1][self::phone or

self::e-mail]"/>

</rule>

<rule context="context/phone">


self::e-mail] or not(F::*)"/>

</rule>

<rule context="context/e-mail">


self::e-mail] or not(F::*)"/>

</rule>

(a) (b)

Figure 6: Example of an automaton in Schematron

<rule id="double" abstract="true"><let name="num" value="number

(normalize-space(text()))"/><assert test="$num"/>

</rule>

Figure 11: Double data type

company Topologi5. They have implemented an XSDto Schematron converter6, because their customers pre-ferred Schematron diagnostics over XSD validation. Thegenerated schemas are called Schematron-ish grammars.In (Necasky, Mlynkova, Klımek & Maly 2012), we pro-vide formal description of mutual translation betweenPSM schemas and regular tree grammars.

6 Evaluation and implementation

Figure 12: PSM schema in eXolutio5http://www.topologi.com/6http://www.schematron.com/resource/

XSD2SCH-2010-03-11.zip

With our proposed method, we have generated severalSchematron schemas in various data domains using ourconceptual model. The schemas are verbose and cannot beshown here whole due to space limitations. Their structureis, however, shown in our examples throughout the paper.During our experiments, we found the Schematron basedvalidation as easy to use from a domain expert’s perspec-tive as a validation using XML Schema would be giventhat both can be generated from our conceptual model forXML. The downside of Schematron mentioned in our mo-tivation, which is its verbosity, is not a problem in the endbecause the user does not need to read the actual gener-ated Schematron. He only needs to give it as an inputto a Schematron validator. From the validation perfor-mance point of view, rule-based validation (e.g. Schema-tron) is computationally more expensive than the linearvalidation using grammar-based languages (such as XMLSchema) (Nalevka 2010). This could be a problem in anenvironment that requires high performance validations,such as routing of XML messages. Nevertheless, whenperformance is not an issue or when validating againstcomplex XML formats, the benefits in form of better di-agnostics are more important.

The reward for using our approach is much easier di-agnostics of a possible problem in the validated XML doc-ument because as we described in this paper, Schema-tron supports user-friendly and descriptive error messages.Also, its expressive power is greater than that of XMLSchema, which can be seen in Figure 5, where we use achoice between attributes, which is not possible to expressin XML Schema. Our experiments were done using ourimplementation of the conceptual model for XML, eXo-lutio.

eXolutio is an application developed in our researchgroup. Its base is the formalism for our conceptual modelfor XML described in (Necasky, Mlynkova, Klımek &Maly 2012) and a complex system of operations and theirpropagation between the levels of abstraction describedin (Necasky, Klımek, Maly & Mlynkova 2012). In ad-dition, it is a platform where novel extensions to XMLschema modeling and evolution are implemented. One ofthem is the approach described in this paper. In Figure 12there is a PSM schema modeled in eXolutio and in Fig-ure 13 there is the generated Schematron schema.


39

http://www.topologi.com/

http://www.schematron.com/resource/XSD2SCH-2010-03-11.zip

http://www.schematron.com/resource/XSD2SCH-2010-03-11.zip

Figure 13: Schematron schema in eXolutio

7 Conclusions

In this paper, we briefly introduced our conceptual modelfor XML as a basis for modeling and maintenance of XMLschemas independent of the target schema language. Thenwe introduced Schematron, a rule-based language that canbe used for XML schema description, and its constructs.Next, we described how a schema from our conceptualmodel can be translated to Schematron and described theadvantages over grammar-based languages such as XMLSchema. We have evaluated our approach and describedits implementation in our tool, eXolutio.

References

Benda, S., Zamecnık, B., Cicko, M., Sobotka, P., Kroupa,T. & Necasky, M. (2011), SchemaTron - Native C#validator of ISO Schematron language.URL: http://www.assembla.com/spaces/xrouter/documents/b5mJ6o1cyr4k0TeJe4gwI3/download?filename=XRouter-docs-en-05-SchemaTron.pdf

Bex, G. J., Neven, F., Schwentick, T. & Tuyls, K. (2006),Inference of Concise DTDs from XML Data, ACM.

Jelliffe, R. (2001), The Schematron – An XML StructureValidation Language using Patterns in Trees, ISO/IEC19757.URL: http://xml.ascc.net/resource/schematron/

Jelliffe, R. (2007), Converting XML Schemas to Schema-tron.URL: http://www.oreillynet.com/xml/blog/2007/09/converting_xml_schemas_to_sche.html

Klımek, J. & Necasky, M. (2010), Integration and Evolu-tion of XML Data via Common Data Model, in ‘Pro-ceedings of the 2010 EDBT/ICDT Workshops, Lau-sanne, Switzerland, March 22-26, 2010’, ACM, NewYork, NY, USA.

Klımek, J. & Necasky, M. (2011), Generating Lower-ing and Lifting Schema Mappings for Semantic WebServices, in ‘25th IEEE International Conference onAdvanced Information Networking and ApplicationsWorkshops, WAINA 2010, Biopolis, Singapore, 22-25March 2011’, IEEE Computer Society.

Kwong, A. & Gertz, M. (2006), On Tree Pattern Con-straints for XML Documents.URL: http://dbs.ifi.uni-heidelberg.de/fileadmin/publications/2003/on_tree_pattern.pdf

Lee, D. & Chu, W. W. (2000), Comparative Analysis ofSix XML Schema Languages, ACM.

Maly, J. & Necasky, M. (2012), Utilizing new capabilitiesof XML languages to verify integrity constraints, in‘Proceedings of Balisage: The Markup Conference2012’, Vol. 8.URL: http://www.balisage.net/Proceedings/vol8/html/Maly01/BalisageVol8-Maly01.html

Miller, J. & Mukerji, J. (2003), MDA Guide Version 1.0.1,Object Management Group.URL: http://www.omg.org/docs/omg/03-06-01.pdf

Murata, M., Lee, D., Mani, M. & Kawaguchi, K. (2005),Taxonomy of XML Schema Languages Using FormalLanguage Theory.URL: http://www.cobase.cs.ucla.edu/tech-docs/dongwon/mura0619.pdf

Nalevka, P. (2010), Grammar vs. Rules.URL: http://petrnalevka.blogspot.com/2010/05/grammar-vs-rules.html

Necasky, M., Klımek, J. & Maly, J. (2011), When the-ory meets practice: A case report on conceptual model-ing for XML, in ‘2011 Sixth International Conferenceon Digital Information Management (ICDIM)’, IEEE,pp. 242–251.

Necasky, M., Klımek, J., Maly, J. & Mlynkova, I. (2012),‘Evolution and Change Management of XML-basedSystems’, Journal of Systems and Software 85(3), 683– 707.

Necasky, M., Mlynkova, I. & Klımek, J. (2011), Model-Driven Approach to XML Schema Evolution, in ‘Onthe Move to Meaningful Internet Systems: OTM 2011Workshops’, Vol. 7046, Springer, pp. 514–523.

Necasky, M., Mlynkova, I., Klımek, J. & Maly, J. (2012),‘When conceptual model meets grammar: A dual ap-proach to XML data modeling’, Data & Knowledge En-gineering 72(0), 1 – 30.

Object Constraint Language Specification, version 2.0(2009), OMG.URL: http://www.omg.org/technology/documents/formal/ocl.htm

Ogbuji, U. (2004), A hands-on introduction to Schema-tron, IBM.

Vlist, E. (June 2002), XML Schema The W3C’s Object-Oriented Descriptions for XML, O’Reilly Media.


40

http://www.assembla.com/spaces/xrouter/documents/b5mJ6o1cyr4k0TeJe4gwI3/download?filename=XRouter-docs-en-05-SchemaTron.pdf





http://xml.ascc.net/resource/schematron/

http://xml.ascc.net/resource/schematron/

http://www.oreillynet.com/xml/blog/2007/09/converting_xml_schemas_to_sche.html



http://dbs.ifi.uni-heidelberg.de/fileadmin/publications/2003/on_tree_pattern.pdf



http://www.balisage.net/Proceedings/vol8/html/Maly01/BalisageVol8-Maly01.html



http://www.omg.org/docs/omg/03-06-01.pdf

http://www.omg.org/docs/omg/03-06-01.pdf

http://www.cobase.cs.ucla.edu/tech-docs/dongwon/mura0619.pdf

http://www.cobase.cs.ucla.edu/tech-docs/dongwon/mura0619.pdf

http://petrnalevka.blogspot.com/2010/05/grammar-vs-rules.html

http://petrnalevka.blogspot.com/2010/05/grammar-vs-rules.html

http://www.omg.org/technology/documents/formal/ocl.htm

http://www.omg.org/technology/documents/formal/ocl.htm