7
2005 IEEE International Professional Communication Conference Proceedings 0-7803-9028-8/05/$20.00 © 2005 IEEE. The Darwin Information Typing Architecture (DITA): Applications for Globalization Nancy Harrison IBM Rational Software [email protected] Abstract Translation of documentation has traditionally been a major expense in the globalization process, especially if translations are required for multiple languages. The Darwin Information Typing Architecture (DITA) is an XML-based architecture for creating topic-based and information-typed content. It provides a number of features that, in addition to supporting high-quality information delivery, allows for more efficient and reliable localization of information. This article provides both an introduction to DITA and a discussion of DITA features that enhance document globalization. Keywords: DITA, XML, topic-oriented authoring, globalization, translation Background Globalization of documentation has traditionally been a major expense in the product globalization process, especially if translations are required for multiple languages. This is especially true when using traditional desktop publishing (DTP) systems to create product documentation. Traditional DTP systems store documents in binary format, and the text is commingled with formatting information in the document file, making it difficult to use automated translation memory tools to extract and store the translatable text. Traditional documentation also contains multiple copies of certain text strings – book titles, legal boilerplate, certain cautions or warnings, etc – across a document set. Without using a single monolithic set of translation memory, and thereby slowing down performance of automated translation tools, these text strings have to be retranslated for every document in which they appear. The Promise of XML XML-based authoring systems, with the ability to separate content from presentation, and also to denote non-translatable text with special tags, provide a basis for more accurate and cost- effective document translation. An original goal in the development of the XML standard, as in the creation of its ancestor SGML, was the separation of semantic content from presentation information within information units to enhance reusability of the content. From a localization perspective, the model was to: [1] Separate content from form and reuse content seamlessly in creating translation memory; Allow semantic markup and identify different kinds of text for different translation treatment (e.g. which strings should/should not be translated): and Allow text-based format, rather than binary DTP formats, to improve translatability. 115

[IEEE IPCC 2005. Proceedings. International Professional Communication Conference, 2005. - Limerick, Ireland (July 7, 2005)] IPCC 2005. Proceedings. International Professional Communication

  • Upload
    n

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE IPCC 2005. Proceedings. International Professional Communication Conference, 2005. - Limerick, Ireland (July 7, 2005)] IPCC 2005. Proceedings. International Professional Communication

2005 IEEE International Professional Communication Conference Proceedings

0-7803-9028-8/05/$20.00 © 2005 IEEE.

The Darwin Information Typing Architecture (DITA): Applications for Globalization

Nancy HarrisonIBM Rational Software [email protected]

Abstract

Translation of documentation has traditionally been a major expense in the globalization process, especially if translations are required for multiple languages.

The Darwin Information Typing Architecture (DITA) is an XML-based architecture for creating topic-based and information-typed content. It provides a number of features that, in addition to supporting high-quality information delivery, allows for more efficient and reliable localization of information.

This article provides both an introduction to DITA and a discussion of DITA features that enhance document globalization.

Keywords: DITA, XML, topic-oriented authoring, globalization, translation

Background

Globalization of documentation has traditionally been a major expense in the product globalization process, especially if translations are required for multiple languages.

This is especially true when using traditional desktop publishing (DTP) systems to create product documentation. Traditional DTP systems store documents in binary format, and the text is commingled with formatting information in the document file, making it difficult to use automated translation memory tools to extract and store the translatable text.

Traditional documentation also contains multiple copies of certain text strings – book titles, legal

boilerplate, certain cautions or warnings, etc – across a document set. Without using a single monolithic set of translation memory, and thereby slowing down performance of automated translation tools, these text strings have to be retranslated for every document in which they appear.

The Promise of XML

XML-based authoring systems, with the ability to separate content from presentation, and also to denote non-translatable text with special tags, provide a basis for more accurate and cost-effective document translation.

An original goal in the development of the XML standard, as in the creation of its ancestor SGML, was the separation of semantic content from presentation information within information units to enhance reusability of the content. From a localization perspective, the model was to: [1] • Separate content from form and reuse content

seamlessly in creating translation memory; • Allow semantic markup and identify different

kinds of text for different translation treatment (e.g. which strings should/should not be translated): and

• Allow text-based format, rather than binary DTP formats, to improve translatability.

115

Page 2: [IEEE IPCC 2005. Proceedings. International Professional Communication Conference, 2005. - Limerick, Ireland (July 7, 2005)] IPCC 2005. Proceedings. International Professional Communication

2005 IEEE International Professional Communication Conference Proceedings

.

Figure 1. The DITA Vision: Using XML-based architecture to create content reusable across formats and information collections. [1]

However, the reality is that generic XML provides an SGML with simpler syntax but similar problems. For example: • Text-based models are easier to translate, but

the additional ease is restricted by the text segmentation algorithms used by translation memory software;

• Continued use of book model leads to large translation units, and when anything in a unit changes, the whole unit has to be run through translation memory to catch small changes; and

• Generic solutions provide no inherent way to differentiate translatable from non-translatable strings.

The DITA Edge

In addition to the XML advantages mentioned above, DITA adds the following translation-supporting features: • A topic-based structure that promotes creation

of content in small, reusable chunks; • Strongly typed content – content objects are

differentiated by class based on well-defined semantics and structure;

• A unified type hierarchy – A set of shared element types provides a base from which all other element types are specialized and to which all other element types can revert; and

• Separation of context from content – Collection-specific relationships between content and properties of content are specified separately from the content objects with references to the content objects.

The Globalization Benefits of DITA

In particular, DITA’s features enable more effective document globalization in the following ways: • It promotes reuse of content, reducing

translation word counts. • It enables clear identification and separation of

translatable and non-translatable content. • It reduces processing required to remove

formatting data – content is separate from formatting.

• It enables translation of UI data to be directly applied to associated documentation (for software products).

DITA Defined

DITA is an extensible topic-based XML architecture originally designed for technical information. It can be specialized to meet the needs of a variety of user communities without sacrificing content exchangeability.

It derives its name from its characteristics:

Darwin - DITA utilizes the principles of inheritance for specialization

Information Typing - DITA was originally designed for technical information; and it is based on an information architecture whose core information types are Concepts, Tasks, Reference information, and Maps (used to organize collections of topics).

Architecture - DITA is a model for extension, both of design and of processes.

Core DITA Design Principles DITA uses the following basic design principles: • Topic orientation – information is ‘chunked’

into discrete units of information covering a specific subject with a specific intent.

• Topic granularity – Self-contained topics combine with other topics into information sets.

• Strong typing – Document type definitions (DTDs) and schemas guarantee that DITA types follow identical information structures.

• Specialization – Architecture for extending basic types to new types adapted for a particular use within an information set.

116

Page 3: [IEEE IPCC 2005. Proceedings. International Professional Communication Conference, 2005. - Limerick, Ireland (July 7, 2005)] IPCC 2005. Proceedings. International Professional Communication

2005 IEEE International Professional Communication Conference Proceedings

.

• Common base class – Top-level “generic” base types –Topic and Map - provides “fallback” for all types. [1]

Figure 2. DITA Core Information Types

The DITA core information types are: • topic – base for all topic types; a unit of

information which is meaningful when it stands alone.

• concept – a specialization of topic; provides background information that users need to know.

• task – a specialization of topic; provides procedural details such as step-by-step instructions.

• reference – a specialization of topic; provides quick access to facts.

• map – base (and currently only) map type; organizes collections of topics to create an information set.

DITA Topic Model and Reuse of Content

The DITA topic-based model strongly promotes reuse of shared content. Because the content is designed and written in small, standalone chunks, reused content is easy to identify and segregate. This enables creation of manageable collections of reusable translation memory.

Types of Reuse We will consider three types of reuse:

• re-purposing: reusing content (topics) in different configurations; different collections share some, but not all content; content may be organized differently single-sourcing: reusing content for different output formats: same content is used, with roughly the same organization, to create online help and print manuals simple reuse: reusing boilerplate information or small chunks of text that appear widely but should be maintained in one place, for example book titles, terminology/glossary terms, product names, trademarks/legal notices

Topic Level Reuse –Re-purposing and Single-sourcing

At the topic level, reuse flows directly from the topic-based paradigm. When content is authored as standalone topics: • Topics can be re-purposed in different

contexts, the different collections of topics are organized using DITA maps

• Topics can be re-purposed by integrating multiple components to create a solution

• Print and online deliverables can be single-sourced to produce PDF and HTML/XHTML outputs.

Working with Maps A DITA map applies context to a set of topics. It organizes a set of topics in a hierarchy and sequence. It can be used to:

create a different organization for different deliverables that share some but not all content generate different output formats (PDF, HTML) for the same hierarchy reuse the same topic with different collections of topics provide multiple views on the same topics: by product, by task

In addition, a map sets properties, including title and metadata, of a topic at a position within the hierarchy. For example, a topic might be considered an ‘advanced’ topic within one hierarchy and a ‘basic’ topic within another. Or a map might define a topic title differently depending on the parent topic within a hierarchy.

117

Page 4: [IEEE IPCC 2005. Proceedings. International Professional Communication Conference, 2005. - Limerick, Ireland (July 7, 2005)] IPCC 2005. Proceedings. International Professional Communication

2005 IEEE International Professional Communication Conference Proceedings

.

Figure 3. Creating hierarchies with DITA Maps

Topics Re-purposed in Deliverables

Figure 4. Re-purposing Content

In Figure 4, deliverables use maps to select topics from a pool. • Deliverable 1 uses topic 1 and 4 • Deliverable 2 uses topic 2 and 4 • Neither deliverable uses topic 3

Single-sourcing: One Set of Sources, Multiple Outputs

Figure 5. Single-sourcing Content – Book Model

Note: The same book can be output to produce a hard-copy print manual (PDF) or an HTML version. The transforms that create the output formats generate boilerplate text such as ‘Chapter’, “Figure’, ‘Table’, etc. This generated text is stored as string variables called parameter entities, which are localized according to locale to automatically- generate localized content.

Block and Phrase Level Reuse of Content Content can also be easily reused even if it is in chunks smaller than a topic, as long as the content is marked up appropriately within a topic. The DITA conref attribute feature allows this reuse for highly granular levels of information, for example: • Terminology/glossary • Book titles • Warning/caution strings • Boilerplate information used throughout an

information set • UI label strings

When reusing content at the block or phrase level, collect text objects to be reused in one or more boilerplate text files (Figure 6) accessible for reading to all authors. Authors then use the <conref> attribute to reference (Figure 7) the shared text, which shows up in the generated output (Figure 8).

Figure 6. Boilerplate text file

118

Page 5: [IEEE IPCC 2005. Proceedings. International Professional Communication Conference, 2005. - Limerick, Ireland (July 7, 2005)] IPCC 2005. Proceedings. International Professional Communication

2005 IEEE International Professional Communication Conference Proceedings

.

Figure 7. Source file calls boilerplate using <conref> attribute

Figure 8. Output file displays referenced text

DITA Specialization and Reuse of Design

DITA not only provides for reuse of content, but also for reuse and extension of information structures. The DITA specialization mechanism solves the following information design problems: • General topic types are rarely detailed enough:

different organizations and industries have specific requirements.

• To meet these requirements, DITA allows for new elements to be derived from existing elements; the new element’s content is a subset of the content allowed in its ancestor.

• Implementing a specialization requires only adding the new elements, without re-architecting the base set.

• Different types of new elements can be added as independent modules; this avoids massive, bloated vocabularies.

• Processing can be specialized to take advantage of specialized elements, butspecialized material can also be processed by standard DITA processors, through a fallback mechanism. Processing is done using the closest ancestor topic for which specialized processing exists..

Summary of Reuse in DITA

In summary, DITA allows for reuse along a number of different axes: • Content can be reused through topics or

smaller chunks. • Designs can be reused through specialization. • Standard processing tools can be reused, or

customized processing tools can be reused only as needed, for content created using specialized DITA topic or map types.

Why Use DITA for Globalization?

DITA addresses the following globalization issues: • Reuse of content: reduces word counts. • Semantic tagging: allows for easy

identification of translatable, non-translatable text.

• Separation of content from formatting: improves translation memory (TM) usability.

• Enabling of localization documentation to better track user interface (UI) changes for some software.

Globalization Advantages of DITA Reuse Reuse of data reduces both the time and the cost of translation in a number of ways: • Storing data in smaller, topic-sized chunks,

rather than book chapters, makes it easier to pinpoint changes and to update translations quickly, promoting more efficient use of translation memory.

• Single-sourcing multiple outputs from the same source means you only translate content once, rather than multiple times (once for each output format).

• Individual small chunks are less likely to change once translated; more efficient use of translation memory.

• Isolating reused strings enables multiple reuse of translated material.

119

Page 6: [IEEE IPCC 2005. Proceedings. International Professional Communication Conference, 2005. - Limerick, Ireland (July 7, 2005)] IPCC 2005. Proceedings. International Professional Communication

2005 IEEE International Professional Communication Conference Proceedings

.

Note: This puts strict requirements on reused strings (e.g., complete sentences, etc.) to avoid translation problems.

DITA Information Typing and Identification of Translatable/Non-Translatable Data

DITA’s unified type hierarchy provides various methods to enable identification of data as translatable or non-translatable. The most fundamental method is through use of the ‘translate’ attribute, which is available for use with most DITA elements. Typically, all content within a DITA topic is defined as translatable by default. The ‘translate’ attribute is set to ‘no’ only for those elements whose content should not be translated. An example of this could be a programming API value (‘int’ or ‘float’), which might not change even in a localized version of a product.

If certain types of information are never to be translated, DITA’s semantic tagging model comes in handy. Rather than setting the attribute individually on multiple elements, set the value of the attribute to ‘no’ on an entire class of elements, for example, the syntax phrase <synph> or command name <cmdname> element.

For more sophisticated translation management systems, you can use DITA’s filtering/flagging attributes to identify data as translatable only if it applies to a particular set of conditions, for example, only if it applies to a Windows system (by setting the ‘platform’ attribute to ‘Windows’). DITA contains a number of attribute values, available at many element levels designed to be used for conditional processing in this way.

Finally, for completely customized control of what gets translated and what does not, DITA’s specialization capability allows for creation of content within a ‘specialized’ topic or vocabulary domain based on standard DITA types but allows for more granular control. Specialized topics can accommodate particular localization needs, for example, to translate for locale A but not locale B.

Reducing Translation Overhead with Text-based, Presentation-Independent Data

DITA’s XML-based semantic tagging substantially reduces the formatting information in documentation files, and therefore expedites machine translation. The translation management system does not have to handle any formatting

constructs; formatting data is typically in one or more associated style sheets, which are localized separately from the content data. All content begins and ends as text.

In addition to improving the efficiency of translation memory, the separation of content from formatting reduces the need for costly and manual corrections to the formatted localized content. Such corrections are often required when localizing content created using a WYSIWYG authoring environment.

Improving Translation Accuracy DITA provides the following methods for improving the accuracy of your translations:

Specialization method: use specialization of vocabulary, for example <term> -> <biochemterm> to provide translation assistance. Attribute method: use attribute values (e.g. otherprops=“biochem”) provides clues to meaning of terms or phrases.

Keeping Things Consistent: Using DITA to Align Documentation and User Interface

For software documentation, this text-based paradigm also enables the documentation content to reuse translation memory created in translating the UI by reusing the translated UI strings themselves. When documentation is stored in binary formats mingled with formatting, this type of reuse is very difficult. Therefore associated document content is translated separately, and perhaps differently, from the UI.

Reusing translated UI strings and the translation memory used to create them removes one step in the documentation localization process. Ordinarily, if the UI changes after it has been translated, the changed UI and the updated documentation reflecting the changes are sent for translation separately, often to entirely different translators. After both translations are returned by the translators, they have to be compared to ensure consistency.

On the other hand, if the translated UI text strings are directly reused in the documentation, there is no need to send the documentation out for re-translation, and there is also no need to compare translations for consistency. DITA allows for reuse of the translated UI strings because 1) DITA, like the UI, is text-based, 2) DITA contains markup to

120

Page 7: [IEEE IPCC 2005. Proceedings. International Professional Communication Conference, 2005. - Limerick, Ireland (July 7, 2005)] IPCC 2005. Proceedings. International Professional Communication

2005 IEEE International Professional Communication Conference Proceedings

.

identify various types of UI strings unambiguously, and 3) DITA uses its ‘conref’ attribute to reference text strings stored in a single location for reuse across multiple topics. To reuse ranslated UI strings and UI translation memory in creating localized documents or online help, do the following [2]:

1. Track UI panel values – panel name and field name values – from the Windows resources or Java properties file

2. Create an XSLT transform to move the relevant data from the resource/properties files to DITA entity files, one file per UI • Each UI panel name or field name string

becomes the content of a particular (for example, <uicontrol>) element

• Each element contains an ‘id’ attribute whose value is the key name of the field, from the application .h file

3. Within the documentation (books or online help), refer to UI elements using ‘conref’ attributes pointing to appropriate ‘id’ attribute strings in the DITA entity file

4. When UI is modified in resource or Java properties files, new UI data is sent for translation

5. Update English and localized versions for documentation, without additional translation cost, by regenerating DITA entity files from new versions - English and localized – of resource/properties files

Figure 9. Automated Synchronization of UI and Documentation Content [2]

Resources

OASIS: Latest work submitted to the OASIS DITA Technical Committee and plans for future DITA development.(http://www.oasis-open.org/committees/dita)CoverPages: DITA downloads and a full list of DITA resources (http://xml.coverpages.org/dita.html) DITA Open Toolkit on SourceForge: Complete DITA downloads and information about implementing DITA, including a forum for asking questions(http://sourceforge.net/projects/dita-ot/ DITA user's group : Mail list for posting questions and conducting DITA discussions (http://groups.yahoo.com/group/dita-users/)

References

[1] IBM Corporate User Technologies, An Introduction to Darwin Information Typing Architecture. IBM Corporation, 2004.

[2] Ian Larner, Information Development with DITA. Hursley, UK, IBM User Technologies, 2004.

About the Author

Nancy Harrison of IBM has over 20 years experience as an information developer, information architect, and globalization specialist. She has worked on localization issues as a project coordinator, trainer, and localization verification tester, and on SGML/XML document architectures since the early 1990s. She was part of the original development team for the DocBook SGML/XML DTD, a de facto standard for open source computer documentation.

While continuing to serve on the DocBook Technical Committee within the Organization for Advancement of Standards in Information Systems (OASIS), her current focus is on DITA, and she participates as an observer on the OASIS DITA Technical Committee as well. She can be reached at [email protected].

121