XML with C#

XML Types for C# Erik Meijer, CLR Wolfram Schulte, MSR

Abstract

This paper proposes a language extension that adds native support for XML to the type system of C#. In our approach XML documents or document fragments become first class citizens that benefit from the full range of features available in a modern program-ming language like C#. XML elements can be constructed, loaded, passed, transformed, updated, and written in a true type-safe manner. To our knowledge, no other approach to XML provides for type safety and syntactic in-tegration of this sort. Existing approaches are either completely untyped (some form of string processing) or rely on schema-to-class translation patterns that have many limita-tions. Our approach is a true semantic integration via the type system. One of the consequences of type integration is that many runtime checks may be moved to compile time. This has many benefits for performance and program correctness. The type system extension is based on XML Schemas and incorporates operational as-pects from XPATH, XSLT and XQuery. The implementation uses the existing capabili-ties of the System.XML namespace in Microsoft’s Common Language Infrastructure to provide its functionality. In this document, we use the term C#-xml to mean the combination of C# and the pro-posed extensions. We present these extensions through a series of worked examples. Although we restrict our attention to XML and C# in this paper, our approach can be ap-plied to integrate any standard object-oriented programming language (C#, Visual Basic, Java, etc.) with any data-structuring language (XML, SQL, etc.).

1 Introduction

XML has become the lingua franca for data-exchange over the Internet [Box, Skonnard, Lam]. As an open industry standard, XML lets developers describe the data exchange between different devices, applications or businesses. XML is used in data base applica-tions as the language to describe the structure of the database and its views as well as for

data access. XML is also used for dynamic layout of web pages. But what is so special about XML? XML schemas can describe typed content. Typing helps guarantee the in-teroperability of applications. Currently, there is a plethora of special purpose XML processing languages. [XSLT] is for transforming well formed but untyped documents. [XDuce] and [XMLambda] are typed alternatives, but they are mainly an experiment in language design; their use is re-stricted to a subset of [XML 1.0]. [XPath] and [XQuery] are languages for query process-ing. XPath is untyped; XQuery however is typed. Yet all of these languages have the same problems:

• they all come with their own syntax; • they have no or only restricted type systems; • they have no module system, no separate compilation; • they have only a very restricted library.

Furthermore as soon as you want to compute something beyond the special purpose for which they are developed, you reach their limits. Many of them have ended up re-implementing features that are already in programming languages, but do so only par-tially and in an ad-hoc manner. Table 1 compares the different languages and their fea-ture sets. So why not use our favorite object-oriented programming language to support XML processing? The story so far is disappointing: One of the main reasons is that the type systems of XML and class based object-oriented languages don’t match very well. In [C#], you have subtyping based on named relationships, while the type system of XML schemas is based on structural equivalence, named relationships and range restrictions. As a remedy for this problem two approaches have been taken to integrate XML into pro-gramming languages (see also Table 1):

• either one uses XML in an untyped way • or else one translates XML types into classes.

A typical example of the former is the support for XML in the Common Language Infra-structure [CLI], where XML processing is supported in the form of libraries. This is ex-tremely flexible; whenever new functionality is needed a library is added. However proc-essing XML is not type safe. Furthermore it is not efficient because one always deals with untyped data. This requires later validation to be on the safe side. The alternative, also supported in the CLI distribution is to use a schema compiler that can either map schemas to classes or classes to schemas. As long as only very primitive forms of sche-mas or classes are used, the mapping works; however in most cases the results are disap-pointing. [Box] for example noted that many XML constructs can’t be mapped easily

onto programmatic types: choice, derivation by restriction, mixed content, or “nillable” elements of value types. Likewise some programming language constructs don’t map eas-ily onto typed XML, for instance there is no canonical way to map arrays or typed refer-ences onto XML; that’s one of the reasons for the complexity of the SOAP specification. Language/ Fea-ture

Purpose Paradigm/ Syntax

Typesystem Technology Restrictions

XML Schema

Type Decla-rations

Data defs/ XML

XML Validating Parser

Sublanguage

XPath Projection Functional/ Directory Paths

None Interpreter/ Compiler

Sublanguage

XQuery Query Lan-guage

Functional/ Own

XML Prototype Sublanguage

XSLT Transforma-tion Lan-guage

Functional/ XML

None Interpreter/ Compiler

Sublanguage

XDuce Explore DTD Typesystem

Functional/ SML

Monomorphic DTDs

Interpreter/ Compiler

Experimental

XMLambda Explore DTD type-system

Data defs; Functional/ Haskell + XML

Polymorphic DTDs

Not imple-mented

Experimental

Sys-tem.XML

XML sup-port for C#

Imperative/ C#

C# Library XML proc-essing un-typed

XSD Com-piler

XML sup-port for C#

Data defs; Imperative/ C#

C# Compiler C# and XML Types don’t match

C#-xml XML sup-port for C#

Imperative/ C# + XML + XPath

C# + XML Precompiler None

Table 1: Different Languages and Approaches How to Use and Integrate XML We solve these problems by integrating the type system of XML as a first order citizen in the programming language. We use C# in this paper, but a similar approach would work

for other languages. We call this C#-xml. We add XML Schemas as types, and XML document fragments as literals. The proposed type system guarantees that

• valid document fragments are well-typed expressions; • invalid fragments are ill-typed expressions.

We also integrate ideas from XSLT and XPath for a limited form of pattern matching; and from XQuery for a limited form of set based operations. For interoperability we pro-vide mappings between the CLI and schema types. The resulting language is an extension of C#. A pre-compiler translates C#-xml into C#. We assume that the reader is familiar with basic C# concepts and its syntax and has a rough understanding of using XML. XML schemas are nicely explained in [XML Schema Part0]. The paper is organized as follows. Section 2 discusses the mapping of C# types to XML schemas and vice versa. Section 3 shows how we build dynamic documents. Section 4 discusses projection and selective update. Section 5 presents C#-xml’s support of itera-tion, here used for query processing and stream processing. Section 6 discusses C#-xml’s provisions for dealing with well formed but not well typed documents. Section 7 con-cludes. The Appendices are not yet written. (However by the time this document might be read they should be on the WEB.) Appendix A gives an example application. Appen-dix B defines the grammar for the extension; Appendix C describes part of the type sys-tem. Appendix D gives the signatures for the new CLI functionality.

2 XML Schemas and CLI Types The main challenge of the integration of XML into C# is to engineer a bridge between both type systems. In the following we assume that the reader knows the type system of C# and knows a little bit about XML and the former [XML 1.0] document definitions.

XML Schemas - An Overview The purpose of an XML schema is to define a class of XML documents. XML schemas extend [XML 1.0] document definitions considerably. Among other things schemas pro-vide a uniform syntax, support derivation of document types (similar to inheritance in OO languages), permit all-groups and nested definitions and provide atomic types (similar to built-in types in programming languages). Here is a fragment of a database schema for a collection of books.

<?xml version = "1.0" encoding = "UTF-8"?>  <xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"> <xsd:element name = "bib" type = "Bib"/> <xsd:element name = "book" type = "Book"/> <xsd:complexType name = "Bib">

<xsd:sequence> <xsd:element ref = "book" minOccurs = "0" maxOccurs = "unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name = "Book"> <xsd:sequence> <xsd:element name = "title" type = "xsd:string"/> <xsd:element name = "author" type = "xsd:string" maxOccurs = "unbounded"/> </xsd:sequence> <xsd:attribute name = "isbn" use = "required" type = " xsd:string "/> <xsd:attribute name = "year" use = "optional" type = "xsd:string"/> <xsd:attribute name = "price" use = "optional" type = "xsd:float"/> </xsd:complexType> </xsd:schema>

The schema declares two elements bib and book having type Bib and Book respec-tively. The bib element declaration describes an XML document tree with root <bib> .. </bib> whose children consists of a list of book elements. Likewise a book can be a root element. Each book has a mandatory isbn, optional attributes for year and price, followed by a title, followed by a non-empty list of authors. Note that the content of each element and attribute is typed. An example document that conforms to this schema is the following

<?xml version = "1.0" encoding = "UTF-8"?> <bib xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation = "file:///C:/XML/XML/Bib.xsd"> <book year = "2000" isbn = "0-201-70914-7"> <title>Essential XML</title> <author>Box</author> <author>Skonnard</author> <author>Lam</author> </book> <book isbn = "0-201-17888-5" price ="22.50" year = "1999" > <title>Component Software</title> <author>Szyperski</author> </book> </bib>

The value of this document is a node-labeled tree (in XML parlance: an abstract informa-tion set [Infoset]). The root of the instance document is a document node; it defines the character encoding and version information. In our example it contains the bib element as its only child. The node for the bib element contains the namespace node, a node for the attribute schema location and two element subtrees. The book nodes contain attribute nodes, and title and author subtrees.

Our aim is to integrate these XML schema types and conforming documents into C#. To do so let us analyze schemas more carefully.

Simple Types A simple datatype, as defined in [XML Schema Part 1], is either primitive (e.g., xs:string, xs:boolean, xs:float, xs:double, xs:ID, xs:IDREF) or it is derived from another simple type by specifying a set of facets (e.g., xs:language, xs:NMTOKEN, xs:long, etc., or user defined). Facets are constraints like minimum or maximum values of numbers or regular expressions for strings. A type hierarchy is induced between simple types by subset or-dering on the value spaces of the corresponding types. For instance, the value space of the XML type string is the set of finite-length sequences of characters. The value space of normalizedString is the set of strings that do not contain the carriage return, line feed or tab characters. Thus in XML every value of the type normalized string is a string but not vice versa; normalized string is a (proper) subtype of string. Note that the value spaces of simple types may overlap; a simple value may be an instance of more than one schema simple type. (We defer the discussion of lists and unions of simple datatypes until we discuss the complex types.) All of C#’s built-in types like bool, int or string have a corresponding XML simple type. For other simple types, like normalizedString, C#-xml provides corresponding structures in the namespace System.XML.Types. Constructors for these structures take C# built-in types and check whether their range restriction is met. Here is the C# struct for XML’s normalizedString.

namespace System.XML.Types{ public struct NormalizeString{ private string Value; public NormalizedString(string s){ foreach(c in s) if(c == ‘\r’ || c == ‘\f’ || c == ’\t’) throw new RangeRestrictionException(); value = s.Clone(); } public static implicit operator string(NormalizeString s) { return s.Value; } }}

The constructor takes a string, and if the passed string doesn’t contain any carriage re-turns, line feeds or tabs, assigns it to the field Value. Runtime checking is necessary, since checking whether the constraints are fulfilled is in generally not possible at com-pile-time. To access the embedded value, C#-xml uses implicit conversion operators.

The proposed encoding of simple types in C#-xml does not maintain the subtype order on XML’s simple types; for instance in C#-xml a normalized string is not a subtype of string. However, many of the properties provided by subtypes – for instance, ease of use, flexibility, and type safety – are retained because of the implicit casting that allows nor-malized strings to be passed to functions that expect a string. Simple types, for which C #-xml provides corresponding C# structures, are very limited. They have only one field of the base type from which this type derives. Values of simple types never contain object references.

Complex Types Complex types contain elements and/or attributes (see [XML Schema Part 2]). Complex types build their own type hierarchy. A complex type is either a restriction of a complex base type, an extension of a simple or complex type definition or a restriction of the super type of all complex types called AnyComplexType (see also Section 6). Restriction and extension is based (as in class based languages) on named relationships. The content model for complex types, however, is based on regular tree grammars. Regular tree grammars support structural subtyping. The content model for C#-xml is adapted from [XML Schema: Formal Description]. To simplify type checking it unifies various constructs that are distinct in XML. For instance mandatory and optional attributes (see for example the attributes isbn and price in our bib schema) are handled as if they would be defined as a sequence with appropriate number of occurrences. However it distinguishes everything that is relevant for checking structural subtyping. In this respect C#-xml follows the design decisions taken by [XQuery Formal Semantics]. The following simplified grammar makes this more precise:

p in primeType ::= qName | `*:*` name(set)s t in typeDefn ::= elem p { g } global type | attr p { g } | group p { g } g in groupDefn ::= p element, group or attribute reference | t local type | all{g1;…;g2} interleaved product | {g1;…;gn} heterogeneous sequence g1 followed by … gn | choice{g1;…;gn } choice, g1 or … gn | mixed g mixed content of g | g[m-n] homogeneous sequence of type g

A prime type is either a qualified name, denoting a type with a given URI and local name, or a wildcard. The wildcard `*:*`denotes any name in any namespace. C#-xml al-lows one to define elements, attributes and (named) groups. We have the following group constructors: The empty sequence (written {}) matches only the empty document; it is an identity for sequence and all. The empty choice (written choice {}) matches no docu-ment; it is a unit for choice. An interleaved product all{g1;..;gn} matches all documents which contain values in groups g1 up to gn in arbitrary order. All groups in XML Schema are a specialization of “all” in this type system. In XSD they can consist only of global or local element declarations. The homogeneous sequence type T[n-m] matches a minimum of m values and a maximum of n values. The length of the sequence is undetermined if n= *. In the sequel we will homogenous sequences just sequences. C#-xml users normally don’t have to write type definitions using this grammar. C#-xml is able to directly import a schema; this is shown in Section 3. However, if a query results in a return type which is not predefined by a schema, then they can denote the type using the above grammar. For example, the C#-xml types for the given Bib Schema is as fol-lows :

group Bib{ elem bib{Book}[0-*]; } group Book { attr isbn{ int }; attr year{ int } [0-1]; attr price{ int } [0-1]; elem title { string}; elem author {string} [1-*]; }

(This declaration uses the fact that the type {g1;…; gn} can be written as g1;…;gn as long as it is unambiguous.) XML schemas are much richer than what is provided in C#-xml. We deliberately decided not to include the whole type language of XML in C#-xml. To use more advanced fea-tures one has to import the corresponding schema; however the result of every query and of every literal is expressible using the given syntax.

Structural Subtypes The idea of the subtype relationship is that t is a subtype of t’ if t describes a subset of the possible values described by t’. For instance t is subtype of {t; t}. We use the symbol <: to denote the subtype relationship of content types. We write t1 <: t2 if t1 is a subtype of t2, for instance t <: {t; t}. The subtype relationship is a partial order, i.e. it is reflexive and transitive. Let t, t1, t2 be types denoting elements. Here are some of the inequalities that hold (AnyType is the supertype of any XML type, see below and Section 6):

choice{} <: t, t <: AnyType, t1 <: choice{ t1;t2}, t2 <: choice{ t1;t2}. Sequences are covariant; in addition we have the following relationship on bounds: If t <: t’ and m’ <= m and n <= n’ then t[m-n] <: t’[m’-n’]. Finally let’s relate sequences and all groups: If t1<: t1’ and t2<: t2’ then also the follow-ing relationships hold:

{t1; t2} <: {t1’; t2’}, {t1; t2} <: all{t1’ ;t2’}, all{t1;t2}<: all {t1’; t2’}. Elements can also stand in subtype relationship using wildcards. For an elaborate exposi-tion consult [XML Schema: Formal Description]. C#-xml supports structural subtyping of XML types and their values based on these rela-tionships. For instance one can pass a value of type t1 or a value of t2 to a function that accepts values of choice{t1, t2}. However note that t1 and t2 must be simple or complex XML types. XML subtyping rules cannot be applied for ordinary C# structures or classes. The root of any XML type is AnyType. AnyType inherits from Object. AnyType pro-vides basic functionality for reading and writing XML documents; see Section 5. AnyType also captures “untyped” but well formed documents. Their introduction is de-ferred until Section 6.

Datamodel The value space of a complex type is a set of ordered node-labeled trees, see [XQuery 1.0 and XPath 2.0 Data Model] for more details. Node values also include a concept of node identity. Node identity simplifies the representation of XML reference values, e.g., IDREF, and URI values. As mandated by [XQuery 1.0 and XPath 2.0 Data Model] “two nodes have the same identity if and only if they were created by the same application of a node constructor”. Note that having identities does not mean that our nodes are reference types. Instead they are mathematical values (and not an aggregation of individual mem-ory cells); they never contain null. In C#-xml the equality of any simple schema is reduced to equality on the underlying base type. Any complex type provides two relations: the Equals method defines node-equality (in C#-xml, it is also provided by the == operator); the ValueEquals method is defined structurally (see below for ordering aspects). In addition our type system supports implicit boxing and unboxing of XML values to objects. Finally, we have implicit con-version operations from homogeneous sequences to arrays and vice versa. But note that

in contrast to arrays, homogeneous sequences might be lazy data structures — they might only be populated as elements are accessed. A document order is defined on all nodes in a document. It corresponds to the order in which the XML document (fragment) is written after expansion of entities. Thus, first the tag is written; then, namespace nodes followed by the attribute nodes, followed by the children and finally, the end tag. The relative order of attributes is implementation de-pendent. Nodes from different documents are also totally ordered; the order doesn’t change during a program run, however the chosen order is implementation dependent.

3 Literals and Namespaces In C#-xml, documents are essentially written as they are in XML. To type check an XML expression one has to provide the schema type. Schemas can be imported. This is usually done via an import statement. Here is our first example:

using System; using Bib.xsd; public class Sample { static Book book = <book isbn = "0-201-17888-5"> <title>Component Software</title> <author>Szyperski</author> </book>; public static void Main() { Console.WriteLine(book); }}

The compiler checks that the program is type correct using the type inference rules from the Formal Schema definition. When executed, the program first builds up an internal representation of the [XQuery 1.0 and XPath 2.0 Data Model]. Next, it prints the follow-ing string on the console: <book isbn = "0-201-17888-5"> <title> Component Software </title> <author> Szyperski</author> </book>. How to select isbn, title and author is explained in Section 4.

Preprocessing The construction of XML is parameterized by three new C# preprocessor flags; they af-fect the construction of the data model. If the IgnoreComments flag is true, comment nodes are not preserved in the data model. If the IgnoreProcessingInstructions flag is true, processing-instruction nodes are not preserved in the data model. If the IgnoreWhitespace flag is true, insignificant white space is not preserved. For a definition of the notion of insignificant whitespace see [XQuery 1.0 and XPath 2.0 Data Model].

Dynamic Literals C#-xml supports the construction of dynamic documents. It uses the XQuery convention whereby an arbitrary C# expression can be embedded inside an element by escaping it with curly braces. The expression must yield values of the required type. The following example uses a parameterized method to compute the same book as the previous exam-ple.

static Book book = CreateBook("0-201-70914-7”, ”Component Software”, “Szyperski”); static Book CreateBook(String isbn, String title, String author) { return <book isbn={isbn}> <title>{title}</title> <author>{author}</author>; </book>; }

C#-xml extends the XQuery convention by allowing C# blocks inside quotes, too. A block within a quote must yield values with the type demanded by the context. Conceptu-ally, yield statements generate a sequence of values. When the block exits, the concate-nated result is the result of the quote. Let’s look at an example. This time we pass an array of authors to the CreateBook method. CreateBook then has to generate an author element for each of the passed au-thors.

private static Book CreateBook(String isbn, String title, String[] author) { return <book isbn={isbn}> <title>{title}</title> {for (int i = 0 ; i < authors.Length; i++) { yield <author>{authors[i]} </author>; } </book>; }

Executing the loop constructs a sequence of author elements. When the loop terminates the sequence is returned as the result of the quote. Yield statements are basic ingredients of iterators, see Section 5 for an elaboration of this topic. C#-xml also allows the quoting of element and attribute names. To escape quotes you have to use two open curly braces in a row. Example: Suppose that author elements are in fact email addresses and that you want to abbreviate emails with the familiar bracket

notation, where for instance {a,b}@c means a@c and b@c. You use escapes to suppress evaluation of the parenthesized group. Here is a declaration of a book with the email of its authors.

static Book book = <book year = "2000" isbn = "0-201-70914-7"> <title>Essential XML</title> <author> {{Box,Skonnard,Lam}}@developmentor.com </author> </book>

Multiple and Default Namespaces All element and attribute names have two components: a local name and a namespace URI. For documents the local name is mandatory. The schema defines whether one has to qualify the local name with a namespace URI. C#-xml allows using several namespaces in parallel including a default one:

using ph = “http://www.PrenticeHall.com/books.xsd“; using aw = “http://www.AddisonWesley.com/books.xsd“; using default springer = “http://www.Springer.com/books.xsd“; public class Sample { static `aw:Book` b1 = <aw:book ...> … </aw:book>; static `ph:Book` b2 = <pw:book … > … </pw:book>; static Book b3 = <book … > … <book>; ... }

Schema names that don’t follow C#’s constraints for identifiers must be written in backquotes (here aw:book and ph:Book).

4 Projection and Update C#-xml uses a subset of [XPath]’s functionality to query, update or delete a document or data store for some node or sequence of nodes. XPath got its name from its use of path notation (as with URLs and directories) for navigating through the hierarchy of an XML document. For example, the path "book/author" means find "author" elements that are directly contained within "book" elements.

Projection In C#-xml projection is expressed using get expressions. Get takes an XPath as an argu-ment. The XPath expression begins with an expression that identifies a specific document or sequence of documents. Next follows a series of "steps”. Each step represents move-

ment through a document: / selects children of the current node maintaining their order; // selects the current node and all its sub-nodes in document order. Either a match or a func-tion application has to follow ‘/’ or ‘//’. One can match on element, attribute or URI names, or nodetypes. For instance, let us assume that a variable bib of type Bib (see section 2) is defined within the scope of the following declarations. To find all author elements within bib, we write the C#-xml expression

get bib/book/author

But what is the type of this expression and how does projection work? The expression bib is of type Bib, next bib/book selects all book elements of bib. According to its schema type this can result in zero or more books, i.e. the type of bib/book is Book[0-*]. Finally the expression bib/book/author selects all authors of each book; there-fore it has zero or more author elements. Therefore a correctly typed query is

Author[0-*] q1 = get bib/book/author;

provided the author type is given as follows: group Author {elem author {string} [1-*]}

Two different but equivalent queries are

Author[0-*] q2 = get bib/*/author; Author[0-*] q3 = get bib//author;

The difference between the queries q1, q2 and q3 is the order of traversal. The XPath expression of q1 first iterates over all books in bib and then over its authors. The XPath expression of q2 iterates over all children of the bib database which happen to be books and then over its authors. Finally q3 selects bib itself, all children, grandchildren, great-grandchildren and so on that have an author child. Instead of a match XPath also allows the use of built-in functions, the most prominent being data() which selects the simple typed content of an element or attribute. This is the way to find all author elements as strings or to find out all prices as integers.

string[0-*] q4 = get bib//author/data(); int[0-*] q5 = get bib/book/@price/data();

A predicate can follow a match to eliminate nodes that fail to satisfy a given condition. Predicates are written within square brackets. For instance to find all books which ap-peared in 2000 one would write:

Book[0-*] q6 = get bib/book[./@year/data() == 2000];

C#-xml’s projection also supports an optional sortedby clause. For instance, the fol-lowing C#-xml statement sorts the resulting sequence of books for titles:

Book[0-*] q7 = get bib/book sortedby title/data();

The sortedby clause can take several keys (in which case it sorts them lexicographi-cally) and the modifiers ascending and descending. C#-xml also supports aggregation. But there is no magic involved; sequences are ordi-nary generic data structures predefined in the System.XML.type namespace. Se-quences provide the well know aggregation functions from SQL, like count, min, max. But they also provide additional functions like every and some, which stand for uni-versal and existential quantification, respectively. Here are two examples: the first one tests whether the bib database contains a book with a particular isbn number. The sec-ond query tests whether all books of the bib database were published in the previous century.

bool q1 = Seq.some(get bib/book [./@isbn/data()== “0-201-70914-7"]); bool q2 = Seq.every(get bib/book [b/@year/data() < 2000]);

Selective Update XML is often used to label the information content of diverse data sources including rela-tional databases and object repositories. For these applications selective updates are man-datory. However XML values are mathematical values. Only variables containing XML values can be updated. For updates of XML variables C#-xml uses the set statement. A set statement takes an XPath as its left hand side which describe the node(s) to be updated and an expression as its right hand side. We first consider updating single nodes and then look at updating many nodes in parallel. Suppose that we have a book variable of type Book (as defined in the previous section). Let’s start with updating attributes.

set book/@price = 25.0;

The XPath book/@price selects the node, whose content is updated, i.e. this statement changes the price attribute of the book. If the attribute price exists, it is overridden; oth-erwise a new attribute node is generated. Updates on elements work similarly. For instance here is a statement to change the title of a book from lower case to uppercase:

set book/title = <Title>{ (get book/title/data()).ToUpper() }</Title>;

Note that the string method ToUpper is applied on the result of a projection. In C#-xml this is correct if the sequence of type t is known to be a singleton sequence, i.e. has type t[1-1], which in our example is the case. The right hand side of the expression can refer to the selected value using the implicit pa-rameter value. The previous statement can be simplified to

set book/title = <Title>{ (get value/data()).ToUpper() }</Title>;

But what happens if the value of the referred node of type t is not guaranteed to exist, for example since value denotes an optional attribute? In this case value will return an op-tional type (i.e t[0-1]) and the user has to handle it appropriately (for an example, see Subsection Parameter Passing below). If the XPath expression results in many nodes, the modification is done simultaneously. To change the last name of all authors from lowercase to uppercase you would write:

set bib/book/author= <author>{ (get value/data()).ToUpper() }</author> ;

C#-xml allows inserting elements before or after existing elements, by adding the modi-fier before or after to the set statement. Here is a statement to insert a new author to the book:

set before bib/book/author[1] = <author> Abiteboul </author>;

Note that before and after can also work on multiple nodes. To delete an attribute or element, use the delete statement.

delete book/@price;

Again, deletion can refer to multiple nodes. Selective updates and deletions may introduce additional runtime checks if the compiler can’t statically decide that the new value is a subtype of the runtime type of the updated value. This is exactly the same problem that is well-known from array-updates in C#.

Parameter Passing By default XML parameters are passed by value. If they should be updated inside a func-tion they must be passed by reference. The example shows how to increment a book price by 10% if it exists and if it doesn’t exist how to generate a new attribute with a default value.

static inc (ref Book b, int dflt) { set b/@price = value.Length==1 ? get value/data()*1.1 : dflt; }

The context conditions for reference parameters carry over from C# to C#-xml, for in-stance books that are passed by reference must denote (parts of) updateable variables. In Section 5 we will see that not all XML variables are in fact updateable.

5 Database Iteration and Stream Processing If an XML document represents a data store, then one is often interested in aggregating information or comparing information from different stores, for instance to do a search for the best price of a particular product. If the document contains mainly markup, then it must often be converted from one representation into another, for example to visualize the received XML in a browser capable of only displaying HTML. In both cases pro-grams iterate over the provided documents. Thus although data processing and stream processing are different they have something in common, namely iteration. However the optimizations for the iterations are different. Iteration over data stores should allow query optimizations; iteration for transformations should be done lazily, i.e. piece by piece.

Database Iteration The support for iteration comes almost for free: Selections return sequences and se-quences implement the IEnumerable and IEnumerator interfaces; thus we are all set to reuse C#’s foreach loop to express iteration. Here is an example of collecting all titles of the bib database as strings.

String[0-*] ts = new Seq(); foreach (Book b in get bib/book){ ts += get b/title/data(); }

The foreach statement iterates over all book elements in bib, and binds the variable b to each such element. For each element bound to b, the body of the foreach loop selects the data of the title and appends it to the resulting sequence ts. Iterators can be nested. In SQL this corresponds to computing inner joins. For ease of reading, writing and optimization C#-xml also supports iterated bindings within one

foreach statement. Here is an example to select from two databases of type Bib all books that have overlapping authors but different titles

Book [0-*] res; foreach (Book b1 in get bib1/book, Book b2 in get bib2/book, String a1 in get b1/author/data(), String a2 in get b2/author/data() [a1 == a2 && b1/title/data() != b2/title/data()] ) res +=b1; res +=b2; }

In C#-xml the first generator b1 varies faster than b2, which varies faster than a1 and so on. For instance, let bib1 and bib2 denote the initial Bib document given in Section 2, furthermore let B1 denote the Essential XML book, and let K11, K12, K13 be its authors, likewise let B2 denote the book Component Software and K2 its author. Then the above iteration will generate the following sequence of bindings (where ellipses (...) denotes that the binding of the particular variables isn’t changed).

Variable/

n’th Binding b1 b2 a1 a2

1 B1 B1 K11 K11 2 … …. … K12 3 … …. … K13 4 … … K12 K11 5 … … … K12 6 … … … K13 7 … … K13 K11 8 … … . K12 9 … … … K13 10 … B2 K11 K2 11 … … K12 K2 etc etc etc etc Etc

Changing the order of the generators, obviously changes the result. But order preserva-tion disallows query optimizations. However for most data processing applications the order is irrelevant. To abstract from the order C#-xml adds a modifier unordered to the iterator, i.e. we write

foreach unordered (..){ ..}

Bindings can now occur in any order; this can be exploited by the query processor.

Stream Processing Normally one uses [XSLT] scripts to transform XML. Here we propose to use C#-xml. It is as least as efficient as XSLT; in addition it is type safe. To demonstrate C#-xml’s capabilities, let us build a small application that takes a bib in-put stream and transforms it into an output stream by stripping the year and price attrib-ute. Our main program looks as follows.

public static void Main(string [] args) { readonly Bib bibIn = Bib.OpenRead(arg[0])); writeonly Bib bibOut = Bib.OpenWrite(arg[1]); bibOut = <bib> {StripYearAndPrice(get bibIn/Book)} </bib>; bibIn.Close(); bibOut.Close(); }}

Bib is a schema root element; schema root elements provide methods to load and unload a document of corresponding type. Thus we load the input stream and require that no-body will write onto it. This is a specific requirement for stream processing. If bibIn would represent a datastore we probably won’t declare it as readonly. Likewise we open the output stream and make sure that nobody will read from it. The whole program now becomes reading from the input variable and writing to the output variable. Now let’s look at StripYearAndPrice. A naïve approach would probably result in the following code.

static Book[0-*] StripYearAndPrice (Book[0-*] is) { Book[0-*] os = new Seq(); foreach (Book b in is) os +=<book {b/@isbn}> {b/title}{b/author} </book>; return os; }

There is nothing wrong with this code except it doesn’t perform well: it first computes the whole sequence before StripYearAndPrice returns it. However we would like to delay the computation of the sequence. In the imperative world computations are delayed by using iterators. C#-xml uses CLU style iterators as suggested by Proebsting [Iterators for C#]. Iterators are a procedure-like

mechanism that yield a sequence of values. For instance the static iterator prod produces one or two values.

static iterator char[1-2] prod(bool b) { yield ‘a’; if (b) yield ‘b’; }

When prod is called within a foreach loop with parameter value true it first returns ‘a’, next it returns ‘b’, next it fails. If it is called with the value false, it first returns ‘a’, next it fails. Iterators are a very convenient way to write enumerators, all the state that needs to be maintained between the calls is handled by the iterator. Using iterators we can now write the following simplified version of StripYearAndPrice:

static iterator Book[0-*] StripYearAndPrice (Book[0-*] is) { foreach (Book b in is) yield <book {b/@isbn}> {b/title}{b/author} </book>; }

This code is simple and efficient. In fact, reading and writing of sequences (for example to compute the result of path expressions or to read data from a file or datastore) is inter-nally always done using iterators. This shows also the intended effect for our running ex-ample: consumption and production of books is interleaved in the best possible way, i.e. one book is processed after another.

6 Untyped XML documents So far we have dealt only with schema validated documents. However C#-xml is also ca-pable of handling non-typeable documents as long as they are well-formed. Well-formed documents obey the following schema.

group AnyTree {choice{ AnySimpleType; AnyElement; AnyAttribute} group AnyAttribute {attr `*:*` { AnySimpleType } } group AnyElement {elem `*:*` { AnyComplexType } } group AnyComplexType {AnyAttribute[0-*]; choice{ choice{ AnyElement; string} [0-*]; simpleType} } group AnyType {AnyTree[0-*]}

AnySimpleType stands for the most general simple type, all simple types like int or normalizedString are subtypes of it. The type AnyTree stands for any simple type, attribute or element. AnyAttribute stands for the most general attribute, which must have a name and a simple type. AnyElement must have a name and a complex type. The latter can consist of attributes followed by either a simple type or mixed con-tent. Finally AnyType is the most general XML type.

Type Tests, Type Cast and Type Switch. Type test and type cast are used to recover more precise type information than is statically given. A type test checks whether the given value is compliant with the required schema type. A type cast validates the given expres-sion and casts it appropriately; in case validation fails an exception is thrown. To make this more precise let’s assume that we load a document. If it is compliant with our bib schema then we do one thing, otherwise we do a different action. This pattern can be pro-grammed as follows:

public static void Main(string [] args) { AnyElement a = AnyElement.Load(arg[0])); if (a is Bib) { Bib b = (Bib) a; do_this } else { do_that }}}

A type switch combines type test and type cast. We can write the same statement as shown in the previous example in a more compact way as follows:

public static void Main(string [] args) { typeswitch (AnyElement.Load(arg[0])) { case Bib b: do_this break; default: do_that break; } }

7 Summary

C#-xml supports the XML schema type system to a large degree. C#-xml checks the structural constraints on types. It uses runtime checks to check range constraints on val-ues. We haven’t dealt with all of the XML schema features like key or unique attributes or substitution groups; but we are confident that we can deal with them when we extend this work. Furthermore C#-xml currently doesn’t support any meaningful form of Reflec-tion for XML types. Except for references, C#-xml already supersedes the functionality of XQuery. (Refer-ences will be added to C#-xml.) It combines full support for queries, i.e., declarative processing and also provides imperative processing. C#-xml provides high performance and type safe stream processing. We currently ha-ven’t integrated any of the functionality of XSLT (except XPath’s functionality). This is something that we would like to do in the future.

Summarizing, we have shown that it is possible to have XML as first order citizen in modern class based languages. Only a bridge between both type worlds is needed. Build-ing the bridge is mainly an engineering task. But once it is available, it offers the best of both worlds!

Acknowledgements

We thank Don Box for valuable insights that guided the direction of this work. We thank Mike Barnett for a careful review of this paper. We thank MSR’s database team headed by Phil Bernstein for many helpful comments.

Bibliography

[Box, Skonnard, Lam] Don Box, Aaron Skannard, John Lam. Essential XML, Addison Wesley, 2000. [Box] Don Box, House of Web Services, MSDN Magazine, Nov. 2001. [XML] Tim Bray and Jean Paoli and C. M. Sperberg-McQueen, Extensible Markup Lan-guage (XML) 1.0, , World Wide Web Consortium, 1998. Available at http://www.w3.org/TR/REC-xml. [XML Schema: Formal Description] Allen Brown, Matthew Fuchs, Jonathan, Philip Wadler World-Wide Web Consortium XML Schema: Formal Description, September 2001. Working draft. Available at http://www.w3.org/TR/2001/WD-xmlschema-formal-20010925/. [XML Schema Part 1] Henry S. Thompson, David Beech, Murray Maloney and Noah Mendelsohn, XML Schema Part 1: Structures, World Wide Web Consortium, 2001. Available at http://www.w3.org/TR/xmlschema-1/. [XML Schema Part 0] David C. Fallside, XML Schema Part 0: Primer, World Wide Web Consortium, 2001. Available at http://www.w3.org/TR/xmlschema-0/. [XML Schema Part 2] Paul V. Biron and Ashok Malhotra, XML Schema Part 2: Datatypes, World Wide Web Consortium, 2001. Available at http://www.w3.org/TR/xmlschema-2/. [XPath 1.0] James Clark and Steve DeRose, XML path language (XPath) version 1.0, World Wide Web Consortium, 1999. Available at http://www.w3.org/TR/xpath.

[Xduce] Haruo Hosoya and Jerome Vouillon and Benjamin C. Pierce, "Regular Expres-sion Types for XML", Proceedings of the International Conference on Functional Pro-gramming (ICFP), 2000. [XMLambda] Erik Meijer and Mark Shields. Draft. XMLambda: A functional language for constructing and manipulating XML documents, 1999. Available at http://www.cse.ogi.edu/~mbs/pub/xmlambda/. [XSLT] James Clark, XSL Transformations (XSLT), Version 1.0, World Wide Web Consortium, 1999, Available at http://www.w3.org/TR/1999/REC-xslt-19991116. [Mitchell] John C. Mitchell, Foundations for Programming Languages, MIT Press, 1996, Cambridge, Massachusetts. [CLI] Common Language Infrastructure. Submitted ECMA Standard. Available at http://www.msdn.microsoft.com/net/ecma/. [Iterators for C#] Todd Proebsting. Iterators for C#. Internal memo 2000, Microsoft con-fidential. [XQuery] Henry S. Thompson, David Beech, Murray Maloney and Noah Mendelsohn, XML Schema Part 1: Structures, World Wide Web Consortium, 2001, available at http://www.w3.org/TR/xmlschema-1/ . [XQuery Formal Semantics] Peter Fankhauser, Mary Fernández, Ashok Mal-hotra,,Michael Rys, Jérôme Siméon Philip Wadler, Working draft available at http://www.w3.org/TR/2001/WD-query-semantics-20010607 . [XQuery 1.0 and XPath 2.0 Data Model] World-Wide Web Consortium XQuery 1.0 and XPath 2.0 Data Model, Working Draft, June 2001. See http://www.w3.org/TR/query-datamodel/. [C#] Anders Hilsberg, Scott Wiltamuth C# Language Specification Submitted ECMA Standard. Available at http://www.msdn.microsoft.com/net/ecma/. [DOM] Arnaud Le Hors, Philippe Le Hégaret,Lauren Wood, Gavin Nicol, Jonathan Ro-bie, Mike Champion Document Object Model (DOM) Level 2 Core Specification, Ver-sion 1.0. W3C Recommendation 13 November, 2000. Available at: http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113.

[Infoset] John Cowan, Richard Tobin The XML Information Set, W3C Recommendation 24 October 2001. Available at http://www.w3.org/TR/2001/REC-xml-infoset-20011024.

Documents

XML with C#