Dita Accelerator Xml2008

Accelerating DITA with OmniMarkA Scalable Sol tion for Demanding Prod ction En ironments

Copyright © Stilo International 2008

A Scalable Solution for Demanding Production Environments

[email protected] 2008

Darwin Information Typing Architecture (DITA)An OASIS standard for content Reduce

ToolsDITA Open ToolKit Editors

An OASIS standard for contentReuseRepurpose

Transclusion Topic-LevelMaps Specialization Metadata-Based

Filtering

ToolspCMSes

Established Foundationsand Best Practices

Maps Filtering

HTML

CALS TablesHyTime

Topic Types

XML SGML

DITA PublishingDepends on efficient assembly, interpretation, filtering & formatting of content components

The DITA Open Toolkit FactorThe Open Toolkit has been a big part of DITA's successThe Open Toolkit has been a big part of DITA s success

Open sourceActive development communityThorough implementation of DITAOut-of-the-box support for multiple output formatsModular architectureEasily customized

Components of the Open Toolkit are replaceableU h h i f XSLT d FO tUsers have a choice of XSLT and FO processor components

Many commercial products bundle the Open ToolkitAs a result DITA is closely identified with the Open ToolkitAs a result DITA is closely identified with the Open Toolkit

DITA Editors incorporating Open ToolkitAdobe Framemaker 8Adobe Framemaker 8

Information Mapping Content Mapper

Inmedius DITA Storm rcom

.pdf

Inmedius DITA Storm

In.vision DITA Studio

Justsystems XMetaL Author Enterprise 5 1 DIT

A is

sue)

ools

/STC

_Int

er

Justsystems XMetaL Author Enterprise 5.1

PTC Arbortext 5.3

S RO S ft X / 9 1

m, A

pril

2008

(fro

m a

to z

,

dita

new

s.co

m/to

SyncRO Soft <oXygen/> 9.1

Syntext Serna 3.5

STC

inte

rco

DIT

A To

ols

fB

ob D

oyle

http

://w

ww.

d

XMLmind XML Editor 3.6

Sou

rce:

DITA CMS Integration with Open ToolkitAstoria On Demand

Author-it

Bluestream XDocs

rcom

.pdf

DITA Exchange

DocZone

Inmedius Horizon DIT

A is

sue)

ools

/STC

_Int

er

IXIASOFT DITA CMS Framework

PTC Arbortext Content Manager

m, A

pril

2008

(fro

m a

to z

,

dita

new

s.co

m/to

SiberLogic SiberSafe

Trisoft Infoshare

Vasont

STC

inte

rco

DIT

A To

ols

fB

ob D

oyle

http

://w

ww.

d

Vasont

X-Hive Docato

XyEnterprise Content@ Sou

rce:

Exploiting DITAAs DITA evolves it will be applied to ever more demanding situationsAs DITA evolves, it will be applied to ever more demanding situations

Many industries publish huge volumes of dataAerospace, automotive, oil services, legal publishing

Aspects of DITA can be used for their own sakeAspects of DITA can be used for their own sakeDITA specialization may spin off into its own standardTransclusion can allow reuse even among monolithic documentsMetadata based filtering can provide general purpose effectivity supportMetadata-based filtering can provide general-purpose effectivity supportDITA is a very modular specification

Some of these scenarios will have very demanding requirementsV l "t i "Very large "topics"Large numbers of topics

The DITA Continuum at Stilo

Pure DITA Semi-conductor

FrameMaker source; PDF

Authoring costs; Consistency;

Content Details Drivers

Datasheets publishing Customized Pubs

Legal Procedures

e-Learning; Word and HTML source

Adaptable; Simplified authoring; Integration with existing XML tools

Semi-DITA

Aerospace Standards, 2 projects

Monolithic; SGML, Interleaf, Word source; publish to ATA, S1000D, new web services

Many legacy formats;Multi-target; access to sub-contractors; S1000D support

web services

Aircraft Maint. Manuals

Monolithic; ATA source; E-manuals

Efficient update; Targeting; Costs; Regulatory compliance

Non-DITA

Automotive Monolithic; Multiple sources; SGML

Efficient update; Targeting; Costs;

Software Docs

Topics; SGML; RDBMS storage

Authoring costs; Multi-target; Reuse;

Pushing the BoundariesHow well does the Toolkit cope with these situations?How well does the Toolkit cope with these situations?The Toolkit has a modular architecture

It can be used as a base for partial DITA applications

Some coding tricks are requiredXSLT rules must be implemented carefully to preserve support for specializationspecialization

Most importantly, XSLT is not known as a fast processing technology

Can the Toolkit cope with high volumes of data?

We can test this

Building a DITA Stress TestSample input is the DITA language referencep p g g

200+ topics1468 conref references741 targets referenced by conref1 06 MB1.06 MBAverage file size 5 kB

The DITA language reference was inflated in two waysTopic sizes were increased up to a factor of 100 (to 500KB per file)p p ( p )Number of files was increased up to a factor of 100 (to 20,000 files)

To increase topic sizesThe body of each topic was replicatedA random prefix was added to each word to create unique contentA random prefix was added to each word to create unique contentThe number of links increased proportionately

To increase the number of filesThe whole topic was replicatedp pA random prefix was added to each word, each id, and each idrefThe number of links and link targets increased proportionately

Open Toolkit performance (1)Processing Time vs. Average File Size: from 5 kB to 50 kB

4000 1.1

g g

hoursseconds

4000 1.1

3000

3500

0 7

0.8

0.9

1.0AVG SIZE (kB) Open Toolkit

TIME (s)5 80

3000

3500

0 7

0.8

0.9

1.0

me

1500

2000

2500

0 4

0.5

0.6

0.7DITA Accelerator

Open Toolkit

10 156

21 428

31 8521500

2000

2500

0.4

0.5

0.6

0.7DITA Accelerator

Open Toolkit

cess

ing

Tim

500

1000

0.1

0.2

0.3

0.441 1422

51 2160

500

1000

0.1

0.2

0.3

0.4

Proc

00 10 20 30 40 50 60

0.000 10 20 30 40 50 60

0.0

Average File Size (kB)

Open Toolkit performance (2)Processing Time vs. Average File Size: from 5 kB to 500 kB

40000

10

1140000

10

11

AVG SIZE (kB)

Open Toolkit TIME (s)

DITA Open Toolkit TIME

(hr)

g ghoursseconds

25000

30000

35000

7

8

9

10

25000

30000

35000

7

8

9

10 5 80 0.02

10 156 0.04

21 428 0.12

31 852 0 24me

15000

20000

4

5

6

DITA Accelerator15000

20000

4

5

6

DITA Accelerator

31 852 0.24

41 1422 0.40

51 2160 0.60

103 8160 2.27cess

ing

Tim

5000

10000

1

2

3DITA Accelerator

Open Toolkit

0

5000

10000

0

1

2

3DITA Accelerator

Open Toolkit

103 8160 2.27

206 33660 9.35

309

412

Proc

OUT OFMEMORY0

0 100 200 300 400 500 60000

0 100 200 300 400 500 6000

515


MEMORY

Open Toolkit performance (3)Processing Time vs. Number of Files: from 200 to 2,000g ,

10001000Number of DITA Open

800DITA AcceleratorOpen Toolkit800DITA AcceleratorOpen Toolkit

Number of Files

DITA Open Toolkit TIME

(s)

206 80

me

(s)

400

600

400

600 412 144

824 286

1236 415essi

ng T

im

0

200

0

2001648 557

2060 699

Proc

e

Number of Files

00 500 1000 1500 2000 2500

00 500 1000 1500 2000 2500

Open Toolkit performance (4)Processing Time vs. Number of Files: from 200 to 20,000

30003000

g ,Number of

FilesOpen Toolkit

TIME (s)

2000

2500

2000

2500

me

(s)

206 80

412 144

824 286

1000

1500

1000

1500

essi

ng T

im 1236 415

1648 557

2060 699

0

500DITA AcceleratorOpen Toolkit

0

500DITA AcceleratorOpen ToolkitPr

oce

4120 1429

8240

1236000 5000 10000 15000 20000 25000

00 5000 10000 15000 20000 25000

Number of Files

12360

16480

20600

OUT OFMEMORY

Accelerating DITA for ProductionAn alternative to the Toolkit is requiredAn alternative to the Toolkit is requiredProduction-level quality

No limits on large volumes of contentConsistently high throughput speed as volume increasesy g g p pRobust and maintainable

Rapid development architectureOut-of-the-box rendering for standard DITA schemas/DTDsEasily customized

DITA-awareBuilt-in support for DITA concepts

TransclusionSpecializationFiltering

No programming tricks requiredNo programming tricks required

OmniMark DITA AcceleratorDITA Accelerator implements HTML publishingp p g

Implements all functionality required for language referenceHTML support still requires completionPDF to be implemented in the future

Behavior is modeled on the ToolkitBehavior is modeled on the ToolkitAutomated tests were written to ensure that the output is almost identicalThe output of the DITA Accelerator is nearly identical to the Open Toolkit

index.html from the Open Toolkitpindex.html from the DITA Accelerator

Some small differences remainTable cell borders are inconsistent in some casesSome errors in the DITA toolkit are corrected in the AcceleratorSome errors in the DITA toolkit are corrected in the Accelerator

High performance is achieved with streaming technologyLeverages OmniMark's built-in support for streamingMakes heavy use of referentsy

A DITA-aware library has been implementedProgrammers do not have to employ coding tricks

Gentlemen, start your enginesDITA language referenceDITA language reference

206 files1414 elements with ids (potential link or conref targets)1468 conref references1468 conref references741 targets referenced by conref1.06 MBAverage file size 5 kBAverage file size 5 kB

Initial results are promisingDITA Open Toolkit: 1 minute, 21 secondsDITA Accelerator: 18 secondsSpeed improvement: 4X

What about larger input sets?g p

Comparing DITA Accelerator and Open Toolkit (1)

Processing Time vs. Average File Size: from 5 kB to 50 kB

3500

4000

1.0

1.1

3500

4000

1.0

1.1 AVG SIZE (kB)

Open Toolkit

TIME (s)

DITA Accelerator

TIME (s)

ocess g e s e age e S e o 5 o 50hoursseconds

2500

3000

0.7

0.8

0.9

DITA Accelerator2500

3000

0.7

0.8

0.9

DITA Accelerator

TIME (s) TIME (s)5 80 18

10 156 20

21 428 35ng T

ime

1500

2000

0 3

0.4

0.5

0.6DITA Accelerator

Open Toolkit

1500

2000

0 3

0.4

0.5

0.6DITA Accelerator

Open Toolkit

21 428 35

31 852 41

41 1422 46

51 2160 57Proc

essi

n

0

500

1000

0 0

0.1

0.2

0.3

0

500

1000

0 0

0.1

0.2

0.3

00 10 20 30 40 50 60

0.000 10 20 30 40 50 60

0.0


Comparing DITA Accelerator and Open Toolkit (2)Processing Time vs. Average File Size: from 5 kB to 500 kB

40000

10

1140000

10

1140000

10

11

AVG SIZE (kB)

Open Toolkit

TIME (s)

DITA Accelerator

TIME (s)

g ghoursseconds

25000

30000

35000

7

8

9

10

25000

30000

35000

7

8

9

10

25000

30000

35000

7

8

9

10 5 80 18

10 156 20

21 428 35

31 852 41g Ti

me

15000

20000

25000

4

5

6

7

15000

20000

25000

4

5

6

7

15000

20000

25000

4

5

6

7 31 852 41

41 1422 46

51 2160 57

103 8160 86Proc

essi

ng

5000

10000

1

2

3

4DITA Accelerator

Open Toolkit

5000

10000

1

2

3

4DITA Accelerator

Open Toolkit

5000

10000

1

2

3

4DITA Accelerator

Open Toolkit

103 8160 86

206 33660 150

309 217

412 292

P

OUT OFMEMORY

00 100 200 300 400 500 600

0

1

00 100 200 300 400 500 600

0

1

00 100 200 300 400 500 600

0

1

= 9 hours

= 6 minutes

515 369


MEMORY

Processing Throughput as File

Comparing throughput rate as sizes increaseAVG SIZE

Open Toolkit THROUGHPUT

DITA AcceleratorProcessing Throughput as File

Size IncreasesSIZE (kB)

THROUGHPUT (kB/s)

Accelerator THROUGHPUT

(kB/s)5 13.3 62

250

300

e (k

B/s

)

DITA Accelerator

10 13,6 104

21 9.9 120

31 7.5 155

100

150

200

ghpu

t Rat

e DITA Accelerator

Open Toolkit

41 6.0 184

51 4,9 186

103 2.6 247

0

50

100

Thro

ug 206 1.3 283

309 293

412 290OUT OFMEMORY

0 200 400 600


412 290

515 287MEMORY

Comparing DITA Accelerator and Open Toolkit (3)

Processing Time vs. Number of Files: from 200 to 20,000

2500

3000

2500

3000

2500

3000 Number of Files

Open Toolkit TIME (s)

DITA Accelerator

TIME (s)

g ,

2000

2500

2000

2500

2000

2500206 80 18

412 144 49.5

824 286 100.2 Tim

e (s

)

1000

1500

DITA Accelerator

1000

1500

DITA Accelerator

1000

1500

DITA Accelerator

824 286 100.2

1236 415 142.4

1648 557 193.8

2060 699 247 1Proc

essi

ng

0

500

0 5000 10000 15000 20000 25000

DITA AcceleratorOpen Toolkit

0

500

0 5000 10000 15000 20000 25000


0

500

0 5000 10000 15000 20000 25000


2060 699 247.1

4120 1429 491.2

8240 1055.2

12360 1601 5

P

0 5000 10000 15000 20000 250000 5000 10000 15000 20000 250000 5000 10000 15000 20000 25000 12360 1601.5

16480 2143.4

20600 2788.5

Number of FilesOUT OF

MEMORY

70.0

Throughput rate as number of files increasesNumber Open Toolkit DITA

50.0

60.0

kB/s

)

of Files THROUGHPUT (kB/s)

Accelerator THROUGHPUT

(kB/s)206 13.3 35

30.0

40.0

hput

Rat

e (k

DITA Accelerator

412 14.7 35

824 14.8 42

1236 15.3 47

10.0

20.0Th

roug

hDITA Open Toolkit

1648 15.2 47

2060 15.2 45

4120 14 8 43

0.00 5000 10000 15000 20000 25000

Number of files

4120 14.8 43

8240 42

12360 42

16480 41OUT OF

MEMORY Number of files16480 41

20600 39

O

Interpretation of timing statisticsDITA Open Toolkit is best for light dutyDITA Open Toolkit is best for light duty

Performance degrades rapidly as file sizes increasePerformance is fairly flat as the number of files increaseIn both sets of tests, the toolkit eventually fails when it runs out of memoryA great starting point

OmniMark DITA Accelerator is robust and scales wellDoes not run out of memoryThroughput rate is fairly flat in both types of testingThroughput rate is fairly flat in both types of testing

DITA can play in demanding production environmentsBecause DITA is a standard, technology can be changed without changing the information architecture

Ongoing analysisTests used DITA Toolkit "out-of-the-box"Tests used DITA Toolkit out-of-the-box

Different XSLT processors may improve performance

Forum discussions suggest a workaround for memory exhaustion

Reload XSLT stylesheet on every transformationCurrently requires toolkit modification (may be configurable in 1 5)Currently requires toolkit modification (may be configurable in 1.5)Expect slower performance on smaller topics

Even with improvements, best performance will still be quadratic for increasing file sizes

Th ill b f i t f th

Linear

Quadratic

There will be room for improvement for the foreseeable future

Role of OmniMarkMost of the performance is due to engineeringMost of the performance is due to engineering "behind the scenes"

Native efficiency of OmniMarkStreaming architecture reduces memory requirementsStreaming architecture reduces memory requirementsRecord shelves can be used to implement high speed lookup for DITA processing rules

OmniMark referents simplify support for transclusionOmniMark referents simplify support for transclusionReferents are a streaming mechanism for reordering contentEliminate complex book-keeping

O iM k l i il t d dOmniMark language is easily extendedMacrosModules (functions and data types)

Bonus: SGML support included

UsabilityXSLT supports DITA reluctantlyXSLT supports DITA reluctantlyXSLT rule selection mechanism is not DITA-aware

Two templates that match the element "u":<xsl:template match="*[contains(@class,' hi-d/u ')]"><xsl:template match="*[contains(@class,' topic/ph ')]">

Both have equal priorityProgrammer must use tricks to ensure that the "hi-d/u" takes precedence over the "topic/ph" ruleExtra conditions on the "topic/ph" rule can invert the hierarchy!y

The spaces around the class names are requiredAnd no more than one on each sideXSLT d t f thiXSLT does not enforce thisProgrammer must code carefully to avoid inexplicable behavior

OmniMark extensions provide DITA supportDITA Accelerator augments OmniMark with "DITA rules"DITA Accelerator augments OmniMark with DITA rules

Automatically prioritized according to the specialization hierarchyRule selection is optimized so that performance stays consistent

l dd das more rules are addedDITA rules can be grouped into sets, like OmniMark rulesDITA rules can be supplied as OmniMark modulesLocal DITA rules take precedence over imported rules for the same DITA class

Module supplies support functions that understand DITAModule supplies support functions that understand DITA class specialization

DITA Accelerator specialization supportThe syntax of DITA rules is based on OmniMark element rulesThe syntax of DITA rules is based on OmniMark element rules

Element rules specify element nameselement "u"

output "<u>" || "%c" || "</u>"p

DITA rules specify classes instead of elementsdeclare dita-rule hi-d-u-rule

class "hi-d/u"

Selection by DITA class – understands

specializationoutput "<u>" || dita.process-content || "</u>"

DITA rule for "hi-d/u" will take precedence over "topic/ph"Based on the class specialization in the DTD Processes content,

Currently implemented by macrosAllows access to full OmniMark language

DITA module also provides utility functions

,like "%c" in element

rules

p yDITA class-based queries for current and ancestor elementsMimics the element tests built into OmniMark

ConclusionsThe OmniMark-based DITA Accelerator provides scalabilityy

RobustConsistent throughput as volumes increaseNo catastrophic failures

The OmniMark language can be easily extended to provide a natural DITA g g y pprogramming environment

Programmers can "think in DITA", rather than trying to align a pre-existing programming model with the DITA semantics

Standards are about choice of toolsDITA Toolkit is a good choice for

Learning DITAPrototypingLess demanding production uses

OmniMark DITA AcceleratorDemanding production environments

Most importantly, tool choice must be governed by the unique characteristics of your environment

Technology

Dita Accelerator Xml2008