Efficient XML Interchange

Efficient XML Interchange

XML

Why is XML good?• A widely accepted standard for data

representation• Fairly simple format• FlexibleIt’s not used by everyone, but it’s used by enough

people to make for a rich tools environmentIt’s flexible enough to be used in lots of contextsIt’s text based and human readable, which makes

it a good archival format

XML

XML in 10 pointshttp://www.w3.org/XML/1999/XML-in-10-PointsIncludes (3) “XML is meant to be read”,

and (4) “XML is verbose by design”XML can (but should not be) read by

humans, and is not very compact

http://www.w3.org/XML/1999/XML-in-10-Points

http://www.w3.org/XML/1999/XML-in-10-Points

XML

These design principles also make it very difficult to use XML in some environments

• Wireless military links: low bandwidth• Mobile devices: battery life limitations• Processing efficiency: it can take CPU

cycles to parse XML• Data binding

Limitations

Lots of ships have 64 Kbit/sec at best. It is problematic to ship XML across these links

CPUs are on Moore’s law curve, but battery power is limited by the state of chemistry. We can’t assume that faster processors will save us. Lots of applications for hand held devices with limited battery power (cell phones, etc.)

Cell phones don’t necessarily have strong CPUs, so parsing XML can be expensive relative to other tasks

Data Binding

This is a more subtle problem.<Point x=“1.0” y=“2.0”/>How do you convert this to an object? You

need to parse the string “1.0”, then convert it to a binary representation

It’s the difference betweenstring x;And float x;

Data Binding

Typically something comes in from the wire, and you have to do the Java equivalent of

Float.parseFloat(“1.0”);This is expensive when working with

numeric-heavy documentsIt is much more efficient to keep the value

X in a binary representation in the document, then simply read it on the receiving side

Efficient XML Interchange

EXI relaxes some of the requirements of XML in order to be more compact, faster to parse, and have better data binding characteristics

• Relax the “human readable” requirement• Allow binary dataWhat you get is an alternate encoding of the XML

infoset that is more compact, faster to parse, and allows deployment in new environments that XML previously could not be deployed in

EXI

EXI is being developed by a W3C working group and is on a standards track. The hope is that this will become a W3C-blessed encoding of the XML infoset

Working group draft now working its way to approval.

Need multiple implementations, blessed by W3C technical architecture group, approval by other W3C working groups (encryption, processors, etc.)

EXI

• Represents the same data as an XML document, only in a more efficient encoding

• Minimal impact on other XML technologies, such as encryption

• More efficient to parse, better data binding performance

EXI

http://www.w3.org/XML/EXIIncludes file format specification, primer

on EXI, best practicesNote that one thing that is NOT specified

is an API for accessing the data. This is an important and significant omission

Lack of a standardized typed API means we still have to go through string representations

http://www.w3.org/XML/EXI

Typed API

What is meant by a typed API?DOM and SAX return string values:Attr anAttribute;…// DOM returns a String attribute value hereString val = anAttribute.getValue()And then we need to convert val into a float

viaFloat aFloat = Float.parseFloat(val);

Typed API

But what we often want is the value specified in the schema:

Float aFloat = anAttribute.getFloat();There are proposals for a generalized

typed API, but it is not part of this standard

EXI

EXI has several options to handle different situations.

• You have an XML document and a schema

• You have an XML document but no schema

• You have an XML document, and a schema that almost, but not quite, matches the document

Element and Attribute Names

Tag names take up a lot of space, and can be somewhat expensive to parse

<Name first=“James” last=“Madison”> <State>Virginia</State></Name>Count up the characters used for markup

here:31/55 ~=50-60% of file size for markup tagsIf we replace the character tags with numeric stand-

ins we can get much more compact, and it will be faster to parse

Schema-Informed

If you have a schema, that gives you type information about the XML document. You know that <foo x=“1.0”/> means the x is a float value rather than a string, because the schema tells you that.

That means you can store the “1.0” value in a binary format, which is generally more compact and has the potential to have better data binding with a typed API

Schemaless

What if you don’t have a schema? This means you can’t exploit type information. But EXI should support this situation, because it should be a general solution

EXI handles this by replacing repeating strings with a compact identifier

Schemaless

<Address town=“Monterey” zip=“93943”/>The strings “Monterey” and the zip code are

likely to be repeated many times in an XML document. We can create a table of these values, and then use the table ID rather than the whole string

String ID

Monterey 1

93940 2

San Jose 3

98842 5

“Almost” Schemas

If you have a document that doesn’t quite match the schema, EXI can take a forgiving attitude. It uses the schema to encode the types it knows about, and uses strings and string table identifiers to handle the ones not described by the schema

Implementations

As of now there is one implementation of the draft spec, Efficient XML from Agile Delta (http://www.agiledelta.com)

Other open source projects underway, and some commercial projects

The standards process requires that multiple independent implementations be available before the standard is approved

http://www.agiledelta.com/

Results

Example: Distributed Interactive Simulation (DIS) is an IEEE standard for modeling and simulation. It is a binary standard that contains (x,y,z), velocity, acceleration, and other numeric-heavy data

We did an XML representation of the binary DIS standard

Results

DIS Binary(bytes)

DISXML

EXIFormat

1 PDU 144 1167 129

1000 PDUs

464,480 3,924,680

365,564

Results

• Somewhat better size than the original binary format. The exact size varies somewhat depending on the numeric data, while the original binary format is always the same size. Exi seems to be consistently better, though

• AND it is marked up in a way that makes it equivalent to an XML file. This means we can easily access all the tools of the XML ecosystem by simply converting it to a text XML representation

Conclusions

Replace all text XML with EXI? No! EXI is intended to expand the use of XML into use cases that XML could not service. XML mostly does fine in its existing environment

EXI can be used to XML-ify existing binary protocols and get slightly better performance with greatly increased interoperability (no one knows DIS binary, everyone knows XML)

Next great frontier: typed XML APIs

Documents

Efficient XML Interchange