Download ppt - 9/25/08IEEE ICWS 2008 High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers Wei Zhang & Robert van Engelen Department of

9/25/08 IEEE ICWS 2008

High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers

Wei Zhang & Robert van Engelen

Department of Computer Science

Florida State University

IEEE ICWS 2008 29/25/08

Presentation Overview

Schema-specific Parsers Related Work PTDX: Table-Driven XML Parser with

Permutation Phrase Grammar Performance Conclusion

IEEE ICWS 2008 39/25/08

Schema-specific parsers Compile-time vs. Run-time Parsers

Compile-time parsing and validation approaches use specialized compilation techniques to generate customized parsers from schemas

Run-time approaches use generic drivers( or engines) and grammar-like representation of schemas

Blocking vs. non-blocking Parsers Blocking parsers may suspend the entire program for sufficient

XML content received. E.g. recursive based parsers Non-blocking parsers always control the program and buffered

data can be incrementally supplied Time-efficient vs. Space-efficient Parsers

Time efficient but encoding many states Space efficient but with backtracking

IEEE ICWS 2008 49/25/08

Related Work [Van Engelen, 2001]

The earliest work on schema-specific LL(1) recursive descent parser w/ namespace support and validation

[Van Engelen, 2004] Two-level DFA integrating parsing and validation

[Chiu et al., 2004] Using nondeterministic generalized automata to

merge all aspects of low-level parsing and validation

[Reuter, 2003] Using Cardinality-Constraint Automaton (CCA) to

perform schema-aware validation

IEEE ICWS 2008 59/25/08

Related Work (Cont’d) [Kostoulas et al., 2006]

An efficient parser generator that translates XML schema into a parser either in C or Java

[Matsa, 2007]Schema-directed interpretive XML parser

using special purpose byte-codes. [Zhang et al., 2006]

A table-driven approach parsing and validating in a single pass

Generator that translates schema in C

IEEE ICWS 2008 69/25/08

PTDX: Table-Driven XML Parser with Permutation Phrase Table-driven grammar-based parser

Extended LL(1) grammar with permutation phrase support Parsing table is constructed from extended LL(1) permutation

grammar Run-time parser

Generic parsing engine (2-stack PDA) Both time and space efficient

Predictive parsing Integrating parsing and validation into a single pass No buffering Operating on tokens Main stack size growing in depth of XMLdata Auxiliary stack size growing in number of elements of <xs:all>,

<xs:attribute> Non-blocking parser

IEEE ICWS 2008 79/25/08

Constructing PTDX Tables

XML Schemas Mapping

Rules

Extended LL(1) Permutation Phrase

Grammar

LL(1)Parsing Table

TokenTable

ActionTable

Note: actions are generated from schemas to perform type-checking verification although some validation constraints are incorporated in grammar productions.

Note: actions are generated from schemas to perform type-checking verification although some validation constraints are incorporated in grammar productions.

IEEE ICWS 2008 89/25/08

Mapping Rules Define translation from schema

components to LL(1) grammar productions Preserve structural constraints Map Free-ordered schema components

(<xs:all>, <xs:attribute>) to permutation grammar

IEEE ICWS 2008 99/25/08

Mapping Example

<complexType name=“T”> <all> <element name=“a” type=“string” minOccurs=“0”/>

<element name=“b” type=“string”/> <element name=“c” type=“string”></all> </complexType>

T → << A || B || C >>

A → bA CD eA

A → ε

B → bB CD eB

C → bC CD eC

Note: bA and eA representing tokens of starting and closing element “a” Respectively; CD representing token of CDATA

Note: bA and eA representing tokens of starting and closing element “a” Respectively; CD representing token of CDATA

IEEE ICWS 2008 109/25/08

Permutation Phrase

A permutation phrase is a grammatical phrase that specifies a syntactic construct as any permutation of a set of constituent elements.

E.g., the permutation phrase

<< a || b || c >>

recognizes language {abc, acb, bac, bca, cab, cba}

IEEE ICWS 2008 119/25/08

Two-stack PDA for Parsing Permutation Phrase

<< a || b || c>>

abc

top

Main stack Aux stack

b c aInput:

bctop

Main stack

a

Aux stack

b c aInput:

actop


b c aInput:

top

1 2 3

IEEE ICWS 2008 129/25/08

Two-stack PDA for Parsing Permutation Phrase (Cont’d)

<< a || b || c>>


4 5 6

ctop

Main stack

a

Aux stack

top

b c aInput:

atop


b c aInput: b c aInput:

Note: All optional constituent elements are left on auxiliary stackonce all non-empty elements have been parsed. Note: All optional constituent elements are left on auxiliary stackonce all non-empty elements have been parsed.

IEEE ICWS 2008 139/25/08

PTDX ArchitectureHot-swappableHot-swappable

IEEE ICWS 2008 149/25/08

Schema-directed Scanner Optimized by schema

E.g., scanning a specific tag name is more efficient than scanning the generic string then doing comparison

Tokenizer Breakes XML message into token stream

Token Defined by element names, attribute names,

enumeration values Classified as starting tags and closing tags Normalized namespace binding

<namespace, tag_name>

IEEE ICWS 2008 159/25/08

Experiment Settings

Test environment 3.0 GHz, 2GB RAM, Linux 2.6.20-1.2320, GCC 4.1.1 with option

-02 Memory-resident message Randomly arranged free ordered elements

Compared with Validation parsers

gSOAP 2.7 Xerces 2.7.0 pTDX flex based parser

Non-validation parsers Expat 2.0.1 DFA-based parser

IEEE ICWS 2008 169/25/08

Test Cases

IEEE ICWS 2008 179/25/08

Performance: comparison of validating and non-validating parsers

Bett

er

perf

orm

ance

IEEE ICWS 2008 189/25/08

Performance: effect of number of elements in <xs:all> of PTDX parser

Bett

er

perf

orm

ance

IEEE ICWS 2008 199/25/08

Performance: runtime and compile time memory usage comparison(32 <xs:all> elements)

IEEE ICWS 2008 209/25/08

Conclusion

Free ordered constraints can be parsed and validated efficiently using a 2-stack PDA

Table-driven permutation phrase grammar parsing technique is time and space optimal

Table-driven approach offers flexible framework for dealing with schema evolvement