32
Creating Atomic Content For Taxonomy / Content Database Driven Documents Mark Cashman [email protected] atomic adj. [from Gk. `atomos', indivisible] 1. Indivisible; cannot be split up.

Creating Atomic Content For Taxonomy / Content Database Driven Documents Mark Cashman [email protected] atomic adj. [from Gk. `atomos', indivisible]

Embed Size (px)

Citation preview

Creating Atomic ContentFor Taxonomy / Content Database Driven Documents

Mark Cashman

[email protected]

atomic

adj. [from Gk. `atomos', indivisible] 1. Indivisible; cannot be split up.

Copyright © 2002 by Mark Cashman

The Documentation Problem

Regardless of what you are documenting, you face numerous challenges…

– A need for consistency – to achieve brand identity, clarity of presentation, or ease of navigation and understanding.

– The requirement to organize and link complex and interrelated content into a coherent whole.

– Controlling and leveraging the duplication of content across documents and projects to attain consistency, minimize effort, and maximize economy.

– Using the same material in a different order or combination for different applications.

– Multiple levels of detail or different perspectives on the same material to suit different audiences, media, or connection bandwidths.

Copyright © 2002 by Mark Cashman

The Many Uses Of Content

Multiple media

Various bandwidths / size constraints

Various levels of detail

Need for translations

Copyright © 2002 by Mark Cashman

Some Terms

Content– Text, Image, Audio, Video– Spreadsheet– Database rows / select statement– … anything

Metadata– Data about content

Name Description Author Production Date Licensing Embargo …anything

– Ingest: The process of putting content into a system and creating its metadata

Repurposing - Converting content to a different medium or level of detail for a new use.

Copyright © 2002 by Mark Cashman

More Terms

Storage Format– Raw text– HTML– XML– Database text– Postscript– … anything

Delivery Medium– Paper– Web, high bandwidth, low bandwidth, wireless– TV

Classification– Identifying content as a member of a class– Class has a name and a meaning and a relationship with other classes

Taxonomy– A hierarchy or network of classifications– Can be ad hoc or standardized

Copyright © 2002 by Mark Cashman

Content Workflow

Find out what you need. Get it from the right source – in-house or external. Bring it into a system so it will be available. Make sure it is right for the application and any legal or regulatory constraints. Index and add metadata to it so it can be found. Release it. Publish it. Be prepared to revise it.

RequirementDefinition

Sourcing Assignment

Purchase

Creation Editing Approval IndexingNeed forRevision

Ingest

Integration

Release

Publish

Copyright © 2002 by Mark Cashman

Implications for Content Management

Centralized storage and management of content to facilitate search and reuse.

Computer-based storage, indexing and metadata to provide rich search terms and rapid, context-sensitive recombination.

Delivery-media-independent content storage format.

Ability to apply translation and media specialization to content as needed.

Consistent and automatic updating of all accessible presentations of content when changes are made (may require republishing of print, broadcast and presentations).

Organization separated from content.

Copyright © 2002 by Mark Cashman

Implications for Content

Finest possible granularity.

Ability to be classified by hand or automatically into multiple rich categories.

Independent of other, related content items.

Multiple levels of detail for the same content item.

Multiple representations for the same content item with metadata allowing appropriate selection of representation based on destination media and other constraints.

Copyright © 2002 by Mark Cashman

Types of content organization

Sequential ordering (and reverse).

Context sensitive ordering.

Hierarchical linking.

Network linking.

Heterogenous organization (several of the above).

Copyright © 2002 by Mark Cashman

Sequencing Content Navigation

Manual, fixed (books, papers).

Manual, embedded (HTML, indexes).

Manual, envelope (Search engines with manual classification, taxonomy driven websites with manual classification).

Automatic, embedded (Database driven HTML with keyword driven links).

Automatic, envelope (Search engines with keyword classification, taxonomy driven websites with keyword classification).

Copyright © 2002 by Mark Cashman

Repositories for Content

Source files in directory structure with metadata database.

Content database – content stored in or referenced by database.

Content and taxonomy database – content and classification stored in or referenced by database.

Copyright © 2002 by Mark Cashman

An Example

New England Trail Review.com– Network taxonomy driven– Content / taxonomy database, text content internal, other content external– Multiple representations for content– Display templates control page content– Look and feel table controls page colors and common text / images

Principles are universal– Documentation of trails is sequential.– Documentation of special sights is non-sequential.– Common documentation elements for specific trails.– Related material on a class and a content item level.

Copyright © 2002 by Mark Cashman

A Classified Item of Image Content

Metadata– Name– Description– Type

Text Detail

Abstract Item

Content Item

Whole Content Classification

Sub element Classification

Copyright © 2002 by Mark Cashman

A Classified Item of Text Content

Metadata Text Detail Is The Content Multiple Uses

Copyright © 2002 by Mark Cashman

Leveled Categories 1

Copyright © 2002 by Mark Cashman

Leveled Categories 2

Copyright © 2002 by Mark Cashman

Presentation Top Level

Copyright © 2002 by Mark Cashman

A Presentation Template

Driven by classification of content.

Flexible in accepting multiple items where appropriate.

Copyright © 2002 by Mark Cashman

Alternate Views Of Content 1

Full size images, paged, for high bandwidth connections

All images have description as the ALT text, for use by screen readers

Copyright © 2002 by Mark Cashman

Alternate Views Of Content 2

Small images, paged, for lower bandwidth connections

Entry point to lowest bandwidth, one full size image per page view

All images have description as the ALT text, for use by screen readers

Copyright © 2002 by Mark Cashman

Alternate Views Of Content 3

Single image per page, for lowest bandwidth

All images have description as the ALT text, for use by screen readers

Copyright © 2002 by Mark Cashman

What’s Wrong With This Content?

Use of sequencing words “after”, “descending”, “soon” prevent reversing the trail.

Reference to other steps on the trail make it confusing if this is classified in an additional category.

Copyright © 2002 by Mark Cashman

Multiple Classifications For The Same Item 1

Location independent

Sequence independent

Relies on ability of reader to order the content by its sequence on the page.

Copyright © 2002 by Mark Cashman

Multiple Classifications For The Same Item 2

Works in the alternate context

Plays well when sequence doesn’t matter or when it does.

Copyright © 2002 by Mark Cashman

Evolution Of A Taxonomy

Taxonomies will change over time.

Content must be adaptable to new classifications.

Destroying a category can be dangerous if outsiders can bookmark based on a category. Think about your audience.

A purchased or standard taxonomy generates a tension between stability and flexibility.

Ad hoc categories will appear. They may or may not be justified, and can corrupt the taxonomy.

Library science and scientific taxonomists can and should help establish and evolve your content taxonomy.

Copyright © 2002 by Mark Cashman

Atomic Content Databases For Knowledge Management

Classified content can be the core of a knowledge management system.

Atomic text fragments are low overhead for SMEs (Subject Matter Experts) to produce.

Atomic text and images can be extracted from existing documents through automatic processes, but may require SME intervention to atomize fully.

Newsgroups and email can be mined for atomic text.

Artificial intelligence classifiers may be able to generate initial taxonomies for large bodies of atomic text, but a human agent must also be involved.

Copyright © 2002 by Mark Cashman

Text Content Challenges

Must be independent of sequencing.

Must be independent of context to allow multiple classifications.

Must contain terms that justify each selected classification.

Must be short to maximize reuse.

Copyright © 2002 by Mark Cashman

Text Content Guidelines

Never say “soon”, “later”, “before”, etc.

Use orientation independent directions – “north” and “south” rather than “left” or “right”, ”south-sloping” rather than “uphill” or “descending”.

Let the name tell about the primary context (such as where the image was photographed).

Use a description that would make sense to someone who might not be able to see well, so they can still use the images.

Think about every aspect of what the content refers to and write it with a view toward potential future classifications.

Keep it between one and three paragraphs in length.

Copyright © 2002 by Mark Cashman

Purchased Content Challenges

Third party content producers do not think in terms of atomic content. Their content will be large grained.

Licensing restrictions may prevent “busting up” content for reuse.

Content may have internal links or links to other large grained content it depends on.

Content will not be written in a way that makes it easy to extract pieces and use them separately.

Copyright © 2002 by Mark Cashman

Purchased Content Guidelines

Select content based on its “fine-grained ness”.

Negotiate contracts to allow repurposing and splitting up of content.

Avoid heavily interlinked content, or accept the internal cost of turning links into classifications.

Negotiate contracts which allow modification of the content for atomic reuse.

Copyright © 2002 by Mark Cashman

Enabling Technologies

Digital Asset Management Systems Search Engines Predefined and standard taxonomies XML / XSL Database Management Systems Workflow Systems Web Presentation Systems

Integration is required and expertise is not widespread

Copyright © 2002 by Mark Cashman

Summing Up

Atomic content facilitates a wider range of reuse and repurposing than large grained content.

Context and delivery medium independence is important for maximal reuse and repurposing.

Databases for content and metadata are critical to the reuse of large bodies of atomic content.

Taxonomies can be created or purchased, and are also critical to reuse.

Ingest is the most expensive part of dealing with atomic content.

Training and breaking old habits is the hardest part of creating atomic copy.

A variety of technologies exist to aid in supporting atomic content and taxonomy driven communication efforts, but integration is required.