Best Practice for Leveraging Legacy Translation Memory ... · Recommended Best Practices 3.1. General ... If DITA content is stored on a file system or ... The translator uses standard

Best Practice for Leveraging Legacy Translation Memory when Migrating to DITA


Gershon Joseph

Tech-Tav Documentation Ltd.

Rodolfo Raya

Heartsome Holdings Pte. Ltd.

1. Statement of Problem2. Terminology3. Recommended Best Practices

3.1. General Process3.2. Using DITA with TMX

3.2.1. Advantages of using TMX3.3. Migrating Legacy TM

4. Notes

1. Statement of Problem

Many organizations have previously translated content that was authored in non-DITA tools (such as Word and FrameMaker). When migrating their legacy content into the new DITA authoring environment, what does the organization do about their legacy translation memory? This legacy translation memory (TM) was created with large financial investment that can't easily be thrown away simply because a new authoring architecture is being adopted.

This article describes best practices that will help organizations to use their legacy TM for future translation projects that are authored in DITA, in order to minimize the expense of translating DITA-based content.

2. Terminology

file:///Users/kevinfar/Desktop/translationBestPractice.html (1 of 6)9/24/2006 2:40:38 PM

Kevin Farwell

Highlight

Kevin Farwell

Note

This should be "XML." Differentiating DITA makes it seem that if a TM were built from XML source, these same problems would exist, which is not likely to be true.

Kevin Farwell

Note

I don't think the document is organized in a useful order. Sections 1 and 2 are fine, but 3 doesn't discuss legacy TM conversion until the end and 4 should not be in the document at all. Starting with section 4, keep in mind that this document is for people who have DTP files they need to get into DITA and they don't want to sacrifice the investment they've made in translation memories. They would have confronted the problem of the first note in past projects, and it doesn't really have to do with migration; it's just general TM difficulty. The second note is a process issue that many people are confused about. It should be mentioned directly in a discussion about processes dealing with legacy translation.


Before we get into the details, let's define the terms used in the localization industry so that subsequent sections will be better understood.

CAT

Computer Aided Translation, which helps the translator translate the source content. CAT tools usually leverage Translation Memory to match sentences and inline phrases that were previously translated. In addition, some CAT tools use Machine Translation to translate glossary and other company-specific terms (extracted from a terminology database).

CMS

Content Management System, while help teams manage, store, version and publish their source content.

Matching

The level of accuracy with which CAT tools can match content being translated to the TM. The levels of matching are defined as follows:

Fuzzy matching

The source segment being matched is similar, but not identical to, the source language segment in the TM.

Leveraged matching

The source segment being matched is identical to the matched segment, but the context is not known.

Exact matching

The source segment being matched is identical to the matched segment and comes from exactly the same context.

MT

Machine Translation is a technology that translates content directly from source without human intervention. Used in isolation, MT usually generates an unusable translation. However, when integrated into a CAT tool to translate specific terminology, MT is a useful technology.

TM

Translation Memory is a technology that reuses translations previously stored in the database used by the translation tool. TM preserves the translation output for reuse with subsequent translations.

TMX

Translation Memory eXchange is an industry standard format for exchanging TM between



CAT tools.XLIFF

XML Localisation Interchange File Format is a document format used to exchange translatable content between CAT tools.

3. Recommended Best Practices

3.1. General Process

This section describes the process at a high level that is independent of tools used and the features they support.

If DITA content is stored on a file system or in a CMS that does not support TMX (or in a CMS that uses proprietary techniques to implement reuse and other features that would be lost by exporting to TMX), then proceed as follows:

1. Migrate the legacy content to DITA.2. Modify the DITA source as necessary for the document release to be translated.3. When the DITA source has been approved in the source language, do one of the following:

● Transform the DITA content into XLIFF, which can handle segments at the block or sentence level. This is the recommended option if supported by your DITA tool chain.

● If your tool chain does not support export to XLIFF, send a copy of the DITA source to your translation service provider.

Note

If your CMS provides the translator direct access to the DITA XML, then the translator uses the standard techniques provided by the CMS to access and translate the DITA content.

4. The translator uses standard methodology for the translation tool to translate the DITA content, leveraging the legacy TM. The following points should be kept in mind when translating DITA content:

● Provided the structure of the DITA-based content has not changed radically compared to the legacy documents, the CAT software should achieve exact matching on most segments in the TM. As long as the legacy TM aligns with the DITA source at the sentence level, the translation software should be able to achieve leveraged matching for the elements. Most CAT tools break the DITA block elements down into sentence-level segments, which will ensure better matching of the legacy TM.


Kevin Farwell

Highlight

Kevin Farwell

Note

Is this accurate? XLIFF can be used for that, but it is more for storing translatable text extracted from translation-unfriendly formats.

Kevin Farwell

Note

This section is confusing. The introduction describes a problem of converting DTP-based TMs to be more efficient when used with DITA. This section begins with the premise that the content might be DITA stored in a way that doesn't support TMX. The reader might well ask why they should read this section, since their problem is the DTP-based TM. Then, step one advises migrating the content to DITA. Two issues immediately come to mind: we are back in DTP files and how much legacy content should be converted? I assume we are talking of the source language only, but it isn't clear.

Kevin Farwell

Note

This presents a false choice. Tools that don't support XLIFF can also segment at the paragraph or sentence level. Also, a typical localization customer wouldn't have the tools or time to convert files whether the process will ultimately include XLIFF or not.


● Inline elements may not match at all, or may only be fuzzy matches. If a CAT tool is used to preprocess the TM to prepare it for the DITA-based translation project, then inline elements should yield an exact match. Note that the TM engine should help you recover 70% of the inline tags, which is the main area where matching is prone to fail.

● If conrefs are used as containers for reusable text, then these items may not exactly match (only fuzzy match at best). However, since each of these items needs to be translated only once, and should at least fuzzy match, it should not result in significant translation expense. For best practices on using conref elements in DITA documents that need to be translated, please see XREF TO CONREF BEST PRACTICE.

● When text entities are used as containers for reusable text, it is preferable to use a CAT tool that extracts translatable text from the XML files using an XML parser. The XML parser will insert the content of the text entities into the source text that the translator uses as a reference. This allows the translator to check that the translated segments flow correctly in the target language. If text entities are translated separately from the context where they are used, there may be grammatical inconsistencies in the final text when the translated DITA files are published.

5. After the translated content has been approved, it can be published using the same publishing tools used to publish the source language DITA content.

3.2. Using DITA with TMX

This section describes the recommend approach when using tools that fully support DITA and TMX.

When your DITA content is ready to be translated for the first time, do the following:

1. Export the legacy TM to TMX and tweak the segmentation rules. This step is optional.

This process of creating a better aligned TM should result in an improvement of 10-20% on TM matching. Whether it's worth the effort and expense in doing this process depends on the size of the DITA documents to be translated and the number of target languages. If the number of target languages is small, it may be more economical to retranslate fuzzy matches in a separate file. However, if the word count is high and there are many target languages, tuning the TM will always yield substantial translation savings.

Proceed as follows:

a. Export the legacy TM to a TMX file.b. Tune the segmentation rules.

The TMX file is an XML file, which can be manipulated to better align the translation segments with the DITA markup.


Kevin Farwell

Note

This isn't consistently true. A paragraph with no in-line markup would be a exact match. The conref would not be a translatable string. Only sub-string conrefs would cause a problem.

Kevin Farwell

Note

This seems to assume sub-string entities, and the solution is muddled. Why not just resolve the entities in the translation source? A reference file won't necessarily help the translator merge the entites and the translated text.

Kevin Farwell

Note

This is presented as a choice to the process described in section 3.1. It isn't really an either/or situation. TMX really shouldn't be highlighted in this document.

Kevin Farwell

Highlight

Kevin Farwell

Note

Segmentation shouldn't be presented as a big problem and tagging secondary. Segmentation is based on language, not format.

Kevin Farwell

Note

steps 1a and 1b repeat the content of step 1.


When tuning your legacy TM, take the following into account:● Unmatched tags — Unmatched tags can result from conditional text marked up in

legacy tools (such as FrameMaker), or when block elements contain several sentences that share a common format marker (for example, a paragraph containing several sentences marked as bold; the first sentence contains only an opening bold tag, and the last sentence contains only a closing bold tag).

● Segmentation rules — The segmentation rules used for translating legacy material may not be well suited for XML documents. For example, your legacy Word or FrameMaker-based segmentation rules may include a rule to terminate a segment after a colon, to separate a procedure title from the steps. Since DITA uses markup to indicate where the procedure title ends and the steps begin, this segmentation rule can be discarded.

2. Convert the modified TMX file back into a TM.

This new TM will provide more exact matching against your DITA content than the legacy TM.

● Export the DITA documents to XLIFF.● Import the XLIFF files into your CAT tool.● Run the translation against the TM.

You should get exact matching on the plain text and fuzzy matching on the tags. It may be possible to automatically recover 70% of the tags. Depending on the algorithm used to measure quality, this means you will achieve about 80% to 95% matching overall.● Once the translator has completed the translation, the TM should be exported as a TMX file.

This TMX will correctly tag the DITA block elements as well as correctly segment the sentences, and should therefore be used as the TM for the next DITA-based translation project. For future localization projects, the new TMX should yield exact matching at the segmentation level used for translation (block or sentence).

3.2.1. Advantages of using TMX

Choosing a translation service provider who uses a tool that supports TMX has several advantages. It is possible to migrate your TM between CAT tools that support the industry standard for TM interchange. This is important not only to free you from dependence on a single translation service provider, but also to allow you to fine-tune your segmentation rules to better match your DITA-based XML source documents you'll be sending for translation.

.

3.3. Migrating Legacy TM


Kevin Farwell

Note

This is a red herring. Any segmentation rules would have to work with punctuation and tagging. If the rules stated that segment should break on a colon and a close tag of any kind, you'd still get the break you want. I'm nervous that this suggests to users they have a complex segmentation effort before them when they really don't.

Kevin Farwell

Note

This is not needed. TMX doesn't require XLIFF. The desire for standards can't discount the fact that 90% of the market uses tools that don't require XLIFF.

Kevin Farwell

Note

These are all true, but they are setting up another false choice. If a localization customer uses a vendor that doesn't use TMX in their workflow, they are not excluded from using more than one vendor or manipulation of the TM content. It also presents the idea that if TMX is used, interchange will be possible. This is not so. TMX is the standard format for interchange, but it assumes a common format between two TMs, which is not always true. Also, its use is very limited by the reality that interchange is by no means a standard activity. Even if it were, although their are thousands of localization vendors, statistically speaking the likely tools any two will use will be SDLX or Trados.


This section recommends how best to migrate legacy TM to DITA, for use when translating DITA content moving forward.

When using a CAT tool that does not support TMX, or supports features that would be lost by using TMX, it is recommended to migrate the legacy TM using the tools provided by the CAT tool. When migrating the TM, keep the following points in mind:

● Ensure the TM aligns with the DITA source at the segment level, to achieve the maximum level of exact matching possible for segments.

● Ensure the TM aligns with the DITA inline elements. When migrating to DITA from a non-XML legacy format, formatting tags should be removed, and inline DITA elements should be added. Note that mapping from the legacy inline formatting to DITA inline elements is not always 1:1. Also, DITA has many inline elements that may have no equivalent in the legacy TM.

● Resolve unmatched tags (see Section 3, “Recommended Best Practices”.● Remove any segmentation rules that are not relevant in the context of DITA. If required, add

segmentation rules that apply to the DITA content. Segmentation rules are discussed in Section 3, “Recommended Best Practices”.

4. Notes

● It should be noted that, in general, although sentence level segmentation provides better matching, working with segmentation at the block level improves the quality of the translation. For example, you may need three sentences in Spanish to translate two English sentences. The resulting Spanish translation will read better if the paragraph is translated as a block instead of isolated sentences.

● If the best practices discussed above are used, the first translation of the DITA content can include new content. There is no need to translate the DITA content after migration to DITA before adding new content to the documents.


Kevin Farwell

Note

This seems to say if you don't have TMX, do these things. However, they should be done whether the TM is in TMX or not. I have a question about the tools provided by the CAT tool. What might they be? Would a user assume there is a menu item for converting TMs?

Kevin Farwell

Note

This is an important point and should be higher in the document.

Documents

Best Practice for Leveraging Legacy Translation Memory ... · Recommended Best Practices 3.1. General ... If DITA content is stored on a file system or ... The translator uses standard