Classification of Business Documents DITA BusDocs Subcommittee Meeting 21 January 2008 Presentation with Notes from the Focus Group Meeting of 14 Jan 2008

Classification of Business Documents

DITA BusDocs Subcommittee Meeting

21 January 2008Presentation with Notes from the Focus Group Meeting of 14 Jan 2008

Meeting Summary• Classification focus group members include Howard

Schwartz, Eric Severson, Amber Swope, and Michael Boses. Howard was not able to attend the meeting due to travel

• Michael presented the enclosed PowerPoint as a starting point for the discussion

• Discussion was captured and incorporated into the PowerPoint under the heading, “Notes”

• Next steps:– Eric will work on a preliminary mapping of a limited number of

document types that illustrate the mapping– The focus group will present a summary of what we have

discussed to the full subcommittee during the January 21 meeting

Introduction - 1

• The need for a classification system for business documents arises from:– The desire to indentify the specific document set

that is being addressed by the subcommittee, as well as the rationale behind that selection

– The ability to further analyze the document set using a refinement of the same characteristics used to classify them

Introduction - 2

• What type of characteristics are important?– Documents can be classified in many ways. The

most common way used is a semantic classification based upon the textual content of the document

– The subcommittee approach is different since we want to classify documents based upon their structural characteristics since it is the structure of business document that will need to be harmonized with DITA

Potential Structural Characteristicto Consider when Classifying

• Is it a narrative?• Narrative complexity• Document length• Tree depth• Tree balance• Table frequency• Table complexity• Graphic frequency• XML vocabularies• Transclusions

– Notes: Eric feels that repetitive structures will be an important characteristic – Amber suggests that whether a document references external system data might be

important as well• Howard – Understanding the business purpose might be important as a characteristic. • Eric– could be interesting but maybe not the driver• We will capture the information as part of the analysis• Ann – It’s possibly a different level of classification• Josef– translation should not in itself change the structure, but perhaps what we want to look at is

documents with variants in them.• Howard--Business documents will have different challenges than technical publications

• Higher level model– Structures that are not linked to semantics that can

then be correlated to documents for different usage– The end-game is to say where does DITA fit in?– “semantic neutral” way of classifying– Apply the general to specific usages later– Eric– concept, task, and reference were specializations

to begin with—are they even meaningful for business documents?

– Howard-- Informational, vs. persuasive? Intent or purposes—does it correlate to structure—does it dictate structure, does it matter for reuse?

First-level Classification

• Notes: while the concept is good, none of us is happy with the terminology. In particular, we need to come up with an alternative for Forms.

• The purpose of this slide is to say that there are business documents that are out-of-scope. This is our first level?

Subject Document

Form-Narrative Scale

• Metric: – Ratio of total elements to total words

– Notes Eric: What is a form? How do we keep from excluding documents with structures that we need to address, because we called a “form”? Something to describe “form” that isn’t based upon its implementation. “XML blurs the distinction between documents and data”

– A: Elements are “structural” in nature. We need to define what type of elements we will use to arrive at the ratio

Most Significant Characteristic?

• Once we have established that it is a narrative document, what is the next most significant characteristic to examine?– Notes, general agreement with the presentation, that it would be the tree

depth of the document

• Eric- DITA is trying to apply best practices to writing – is this a fundamental thing about writing or is it just tech pubs?

• Should there be a more generic task that could be specialized into a tech pubs task?

• Ann- what we have now is a specialization for tech docs and so it fits—it is possible to start at higher at a more generalized level

• Interesting that paragraphs have “topic sentence.” The topic sentence may be an important bridge that allows us to introduce the concept of topic based authoring to the business community

• Business documents are maturing—are tech docs more mature? Tech docs are most often not read for pleasure and are “random access” information

• Writing for reuse has a significant impact on how content is written—does it invalidate some of our common business document structures?

• Types of reuse:– The ability to flow one person’s content into another person’s content and

have it hold up contextually– The ability to have content presented as a result of a query or aggregation and

have it hold its integrity as a single unit of information– Will the message change depending upon how someone arrived at it—either

in the original context or by itself?

• All this ties back to the maturity model that will help organizations move to a “best practice” approach to authoring. This will give us something valuable for business and acceptable to the DITA community.

• Now our classification can also correlate to this issue.

The Need to Quantify Hierarchy

• The author of the highly nested document is using structure to communicate semantics.

• Hierarchical Scale– Ratio of total transitions in hierarchy to total elements

• Notes: General agreement. No specific comments

Qualifying Narrative Density

• Narrative Density Scale– Average paragraph length for paragraphs > 100 characters– Notes: no specific comments

Recap of Characteristic Importance• Is it a Narrative?• Narrative complexity• Document length• Tree depth• Tree balance• Table frequency• Table complexity• Graphic frequency• XML vocabularies• Transclusions

• Notes: Eric- we need to address: repetitive structures (i.e., topics) and constrained structures. What do repetitive structures and constrained structures mean to DITA?

• Michael: the number of paragraphs per section seems important—but what is a section?

Notes: Additional Discussion• Discussion of an SOP as it relates to repeating structures

– One approach to an SOP is for it to be very verbose, with only 4-5 “structures”– Another approach is for it to be very terse, with 20 structures that add

semantics to the content. • The goal of XML in general when applied to narrative documents, is to

imply more and more of the semantics through the document structure• “Document linearity with repeating structures” as a structural characteristic

provides “random access” to the information in the document. • Repetitive structures appear to be as important a characteristic as the tree depth,

if not more. Repetitive structures to a degree indicate whether the document is a reference or something intended to be read end-to-end?

• Repetitive structures cause a document to actually be a collection of mini-documents, each that could stand alone