41
Converting Unstructured Docs to XML/DITA/ePub Mark Gross Linda Morone

DCL LavaCon Presentation 2011

  • View
    770

  • Download
    3

Embed Size (px)

DESCRIPTION

DCL presentation

Citation preview

Slide 1

Converting Unstructured Docs to XML/DITA/ePub

Mark Gross Linda Morone

(Confidential)12Background of Data Conversion Laboratory30 years of experience providing electronic document conversion services meeting the needs of technologytoday & in the futureMore than 1 billion pages converted to dateUS Based project management teamGlobal capabilitiesTransform legacy & future documentsFrom any format to any formatSpecialize in complex projectsIdentify redundant data for content reuseEmploy a proven automated processQuality Assurance service is standard in all projectsAdditional services include consulting, composition & transcription & translation(Confidential)DCLs core service is data conversion from any format to any formatAdditional services also includeCompositionQAContent Reuse AnalysisProject set-up/managementConsultingTranscriptionTranslationServices can be bundled or sold al a carte Highly proficient with very complex projectsSolid reputation within all industries23Serving All Industries Publishers Government Defense Life sciences Automotive Aerospace Heavy and Industrial Equipment Financial Services Manufacturing Computing Utilities Semiconductors Telecommunications (Confidential) Any industry where documentation & content exists

34Serving a Broad Client Base

(Confidential)Our client base spans the gamut Publishing:EducationalReferenceeBookLegalPeriodicalsSTMInstitutions:LibrariesUniversitiesHospitalsMuseumsIndustry:AutomotiveAerospaceTransportationEquipmentManufacturingDistributionTechnology:ComputingUtilitiesSemiconductorsTelecommunicationsFinancial Services:BankingCredit card servicesGovernment/Defense:Civilian agenciesGovernment agenciesMilitaryLife SciencesSPLSRPRecordsResearch45Comply with regulationsMatch Industry standardsMeet customer expectations & needsSupport internal departments Expand into new marketsMulti-purpose contentConverting Legacy Data Is it Worth the Expense?(Confidential) Helps maintain data in a more structured, easily reconfigurable format

Removing physical copies allows room for more valuable inventory

Makes transferring information easier

56Legacy Conversion: Fact or FictionClients Perception

Painful ProcessComplexExpensiveDrain on ResourcesReality Expertise & PlanningQC & AutomationGuaranteed ResultsLow Costs(Confidential)Moving past the clients perception is imperative to the process. The concept of converting legacy data can be a daunting task & one that can derail many integration projects if the proper process is not in place early in the equation. All DCL projects are managed by US based project managers who are specialized by industry & DTD. Involving these individuals (even for simply the analysis stage) is critical to all being delivered accurately & on time. We will also make our PMs available for consultation as well as ongoing project management for DIY or outsourced projects.

Preparation & planning are key elements so early involvement is crucial.

6So Which Format do you ChooseNLM and Publishing DTDsSupport traditional publishingFlexible open standardFreely availableHuman-readable format

DITA and Module-Based DTDsDesigned for multi-purposing and content reuseTopic based & modularSupportsMultiple variantsMultiple languagesContext independent content

ePUB and Rendering-Focused DTDsDesigned for e-readers & mobile devicesFreely availableOpen standardAdaptable toBooksDocumentsManualsUser guidesSupport for print publishing requirements is limited7(Confidential)78The Story with ePub and Rendering-Focused DTDsePub is an emerging standard used for most eReadersMobi is also a large player, proprietary to Amazon KindleePub is an evolving standardePub is supported differently by different eReadersThere are no Silver BulletseBooks are publications and need care in their productionNot just novels; recent DCL survey shows 75% will be using eBooks for complex materials(Confidential)89Things to Keep in Mind When ConvertingSmaller screen sizeLarge tables may not fitNot all Character Sets supported by all devicesMathML not currently supported(Confidential)OCR/TextExtraction Special CharactersEmphasis LigaturesHyphens Soft and HardPitfalls of TextExtraction

10(Confidential)11Converting exactly per source may lead to problems Handling of Objects Mid-Paragraph

(Confidential)12

Math as Images Changing Font Size Doesnt Change Images

(Confidential)13Unicode Symbols Will Adjust with the Font Size Change

(Confidential)Large TablesTable as Text (searchable but cut off)Table as Image

14(Confidential)15When Layout MattersTesting MaterialsPoetry

(Confidential)16LetterRecipe

When Layout Matters (contd)(Confidential)17Some Notes on the KindleDesigned for reading long documentsDesigned for simplicityHas some features that others dontBut also missing some features that others haveTherefore, need to design the conversion differently(Confidential)18iPad screenshotKindle screenshot

Glossary Definitions(Confidential)19Use of CSS Float StyleiPad screenshot

Kindle screenshot

(Confidential)Use of BordersiPad screenshot

Kindle screenshot

20(Confidential)21Color/Spanning/Large TablesiPad screenshotKindle screenshot

(Confidential)22The Story with NLM and Publishing DTDsWell-documented public domain standard.Well-tested on a wide variety of materials; designed for complex publishing.Originally designed with NIH support for Scientific, Technical, and Medical (STM) publications.Extended to be robust for many more uses; widely used in non-STM areas.DocBook and PRISM are other standard DTDs; each with its own strengths all designed for print publications.

(Confidential)2223Choosing the Content to ConvertTOCIndexLabelsTitlesList of Table, Figures, etc. Which content will be auto-generated?(Confidential)24Capturing Items as Multiple FormatsMath as images and MathMLTables as images and XHTML

L=

li/N.

(Confidential)25Determining Data ElementsAppearance Based:Content Based: - @ - www - PhD, MD, BA - Figure, Illustration, Chart, SchemeAlignmentPlacementPoint sizeFont(Confidential)Granularity of Tagging: Front Matter

26(Confidential)Granularity of Tagging: Back MatterAre the references Harvard or Numeric?Is the author name last/first or first/last?What is the placement of the year within the citation?Is a comma or period used after the author names?27(Confidential)28The Story with DITA and Module-Based DTDsAllows for modularization of your content with Topics, and easy re-use in multiple outputsPre-packaged & ready to use XML (almost) Ready-to-go for techdocs (mostly)Infrastructure included - taxonomy (DTD and schema); printing stylesheets; lots of toolsPrintable with standard toolsExtensible with specializationsFurther specializations for publishing, testing, and other specialized areasContent-basedWhat do you when things dont fit(Confidential)2829DITA is a conceptual departure from linear information and is difficult for many to get used toTurns the traditional book into a collection of TopicsTopics can be thought of as interchangeable partsto be reassembled in multiple waysto be repurposed for multiple outputsto be reused across multiple productsbut your documents werent likely to have been designed to do this.What Makes DITA Conversions DifficultGetting there using DITA is like building with prefabricated modular components that can be quickly assembled into a suitable structure.- DougHenschen, intelligententerprise.com(Confidential)2930Structuring a Book into Topics in DITAGetting there using DITA is like building with prefabricated modular components that can be quickly assembled into a suitable structure. DougHenschen, intelligententerprise.comReference 1Concept 2Concept 4Reference 5Task 1Task 2Task 3Reference 3Reference 4Book 1Reference 2Concept 5Concept 1Task 2Task 3Book 2Concept 3Concept 2Task 2Task 1Book 4Concept 1Book 3Reference 1Concept 3Concept 5Task 1Reference 5Concept 2DITA Content Management SystemConcept 1Concept 2Concept 3Concept 4Concept 5Task 1Task 2Task 3Reference 1Reference 2Reference 3Reference 4Reference 5Task 1Reference 1Concept 1Book AReference 2Task 1Book BReference 1Reference 3Task 2Concept 2Task 3Reference 2(Confidential)30Further Complications in DITA ConversionsTheres the usual conversion issuesAccuracy of the transferred textTablesMathSpecial CharactersTheres also the structuring issuesIdentifying topicsIdentifying reusable contentAnd the people issuesDeciding what needs re-authoringGetting used to a new document paradigmGetting rugged individualists to collaborate more31(Confidential)Architectural constraints of DITA the square pegsMultiple steps within a single task topicTask\Procedure authored as a table in the sourcePresence of untitled tasks/topics in the sourceReferences to page numbers (irrelevant cross-references)Having more than two levels of stepsHow your rendering system will handle XMLFiguresStepsOther conversion considerations:Hierarchy in Map Files Metadata in Map Files and TopicsIndex TermsConditional TextGlossary TermsContent Terms32Overview of Typical DITA Technical Conversion Issues(Confidential)Square Peg 1 - Task / Procedure Authored As a TableIssue: Tasks are done as tables rather than numbered lists. If theres no clear consistent pattern, then automated conversion keeps the tables as tables, and steps are not tagged as steps.1OverviewIn general, backup andrecovery refers to the various strategies and procedures involved in protecting a system against data loss.2Backup strategy and frequencyA backup is a copyof key files. Files included in the backup are:A logical backup of the databaseKey system filesNetwork filesTimezoneConfiguration files

33(Confidential)Square Peg 2 - Multiple Steps In A Single TaskIssue: Only one set of steps is allowed in a single task topic. When a task has two sets of steps within a topic, such as for two different scenarios, only one of the scenarios can be tagged as as per the DTD. Example:

Replacing an XYZ ModuleUse this procedure to replace an XYZ moduleRemove XYZ ModuleLoosen the screws.Disengage the ejectorsPull the module straight outInsert Replacement XYZ ModuleAlign the module.Insert the module, pressing in firmlyEngage the ejectorsSecurely tighten the screws34(Confidential)35Square Peg 3 - Irrelevant Cross-ReferencesIssue:Conversion to DITA may make some source cross-references irrelevant. For example, assuming all empty chapter headings are dropped, a reference to a chapter is no longer valid. In these cases, a tag is inserted to flag these occurrences for clean-up.

See Chapter 1, Introduction on page 2Would be tagged as:See Chapter 1, IntroductionNOTE: Hard-keyed page numbers are typically dropped from the cross-reference string since they are no longer relevant in DITA.(Confidential)3536It seems like such a pain to go through all the old luggage in the attic.There is always a need for some rewriting - few writers have the clairvoyance to author content with the intent that be converted in the future might as well rewrite it all.My writers arent very busy right now anyway.Its more fun and seems like less trouble to author anew.So Maybe You Shouldnt Bother Converting Your Content?(Confidential)36Throwing it out and starting over is an expensive optionIn DITA, rewriting at $25/page vs. converting at $3-$4/pageThe hidden costs of redoing index entries, links and other features youve built inThe hidden cost of reviewing, reproofing, and recertifying it allIts usually easier to use what you have as a base, and convert overNeeds planningNeeds timePlanning for a good conversion experienceWhich content will you need?Which content is worth converting?Which content is suitable for re-use in multiple places?What tools are available?How to specify the conversion to get it right?When do you start all this planning?

In Reality Converting Your Content is Worth the Bother37(Confidential)38Conversion Scope Options2timecost13Option 1: Convert nothingNo conversion costsDelayed ROI

Option 2: Convert everythingHigh conversion costsReduced ROI

Option 3: Convert frequently used documentsSome conversion costsMaximized ROI

(Confidential)3839What to Convert, and in What OrderCategorizingActive documents in good shapeActive documents that need a lot of workSomewhat inactive document that will likely be retiredArchival materialsPrioritizingDocuments that are most used Documents that are customer favoritesDocuments with longest product lifeStart with most recent documents and go backIdentifying the processCan be converted as isCan be converted with some workNeeds to be rewrittenDont convert just keep archival copies(Confidential)3940Closing ThoughtsKnow the scope of what you want to accomplishAre you trying to get eBooks quickly, or are you changing your publishing processAre you moving everything, or will a phased approach workWill your content work naturally with the selected DTDStart the conversion process earlyShifts the critical path; speeds the process; reduces cleanupOrganizing early lets more of the work be done by the content ownerseases the training and change acceptance burdenssetting up collaborative teams sets the tone and allows one to divide and conquer Converting legacy data is not trivialbut faster, safer and less expensive than rewritingEach DTD has special considerations to be taken into accountMuch can be automated, but it needs planning(Confidential)4041Questions...& Answers

Data Conversion Laboratory61-18 190th St., 2nd FloorFresh Meadows, NY 11365Telephone: (718) 357-8700Fax: (718) 357-8776Web: http://www.dclab.com

Mark Gross, [email protected]

Linda Morone, Sr. VP of Sales & [email protected](Confidential)