Approaches to document/report generation

Preview:

DESCRIPTION

Presents approaches for programmatically creating Office files. Targeted at developers. Presented at http://osdc.com.au/talks/generating-documents-tools-and-techniques

Citation preview

Document GenerationDo’s and Don’ts

Jason HarropPlutext Pty Ltd

www.docx4java.org

Where I’m coming from…

• docx4j is an ASLv2 library for (Microsoft) Open XML office documents (docx, pptx, xlsx)

• My company Plutext sponsors that project• docx4j started in 2007

www.docx4java.org

Since its introduction in 2007, docx4j has become quite popular.

www.docx4java.org

Comparables

tool Open XML SDK docx4j POI Aspose

vendor Microsoft Plutext Apache Aspose

language .NET (C# etc) Java Java Java

cost free free free expensive

open source no yes(ASL v2)

yes(ASL v2)

no

marshalling framework .NET JAXB

(even moXy)XML Beans JAXB

www.docx4java.org

www.docx4java.org

Choose your hub format; import/export from/to others

XHTML

PDF

docx?

docx

XHTML

PDF?

• If you need to replicate the appearance of existing Office documents, using the Microsoft formats as your “hub” will avoid lots of pain

• If you can, work with the OpenXML formats, not the legacy binary ones, or Word 2003 XML, or Word HTML

• LibreOffice/OpenOffice is a useful tool for conversion, driven by JODConverter

www.docx4java.org

Open XML

• standardised via ECMA 376 and ISO/IEC 29500• includes XSD

– can generate strongly typed classes

Open Unzip Alter XMLOpen Unzip Unmarshal Manipulate

objects

www.docx4java.org

Authoring time Generation time

What skills do authors

need?

data

docx

PDF

HTML

www.docx4java.org

Approach 1:- Variable replacement.

This approach can also be used for pptx, xlsx

www.docx4java.org

What could be simpler?

www.docx4java.org

Ummm… not so fast.

1. spelling/grammar proofing

2. rsid

3. run formatting

www.docx4java.org

Look for a solution which maintains integrity

• Typically a Word Add-In or macro which ensures integrity• This suggestion applies to approaches #2 and #3 as well

www.docx4java.org

Additional requirement: repeating data (list items, table rows)

• can be done using some convention, for example:[#list developers as developer] ${developer.name}[/#list]

• many systems invent their own (eg HotDocs)• but freemarker or velocity template language can be used to

do this:– http://freemarker.sourceforge.net/– http://velocity.apache.org/

• for example:– XDocReport (FreeMarker or Velocity; open source)

• (this templating approach can also be used with OpenOffice documents)

www.docx4java.org

Additional requirement: images

• Now it is starting to get a bit trickier, because inserting an image requires:– adding an image part to the docx package– making a note of its rel id– replacing the placeholder with the image XML, including the rel id

www.docx4java.org

Approach 2:- MERGEFIELD and other fields

• Fields are a long standing feature of Word, included in the Open XML specification

• so lots of documents use this (aka mail merge)• Various other useful field types eg IF• A partial solution to the integrity problems of Approach 1

www.docx4java.org

But, two unpleasant XML hybrids (simple and complex)

<w:fldSimple w:instr=" MERGEFIELD name "> <w:r> <w:t>«name»</w:t> </w:r> </w:fldSimple> <w:r>

<w:fldChar w:fldCharType="begin"/>

<w:instrText xml:space="preserve">NAME</w:instrText>

<w:fldChar w:fldCharType="separate"/>

<w:r> <w:t>«name»</w:t> </w:r>

<w:fldChar w:fldCharType="end"/> </w:r>

www.docx4java.org

Approach 3:- Content controls

www.docx4java.org

Much nicer XML, and XPath binding

<w:sdt> <w:sdtPr> <w:alias w:val="name"/> <w:tag w:val="od:xpath=ribxv"/> <w:id w:val="13144269"/> <w:dataBinding w:xpath="/oda:answers/oda:answer[@id='name_Wt']" /> </w:sdtPr> <w:sdtContent> <w:r > <w:t>«name»</w:t> </w:r> </w:sdtContent> </w:sdt>

www.docx4java.org

Content controls are nice

• Better solution integrity wise• Can bind via XPath to arbitrary XML • handles images• since Word 2007• can nest, so repeats/conditions work well

– unlike Approaches 1 & 2– table row friendly

• w:tag supports arbitrary data

.. But unique to Open XML. (Could/should a revised ODF support similar?)

www.docx4java.org

Repeats/conditions

• applies to content inside• w:dataBinding doesn’t support these• so create your own semantics• OpenDoPE is one way• use w:tag for implementation• need an editing tool to insert repeats/conditions

– for OpenDoPE, there are Word Add-Ins designed for technical and non-technical users

• at generation time, need code to support them– docx4j does this, and other OpenXML libraries could be extended to

support

• can support complex documents (nested repeats etc)

www.docx4java.org

Choose your poison

• docx4j supports all three approaches– but content controls are strongly recommended

• other libraries offer more or less support for each approach

www.docx4java.org

Thanks!

Recommended