1 d.1

Jabin White

Director of Strategic Content

Wolters Kluwer Health – Professional & Education

SSP 31st Annual Meeting

Baltimore, MD | May 28, 2009

What is metadata, and why should

publishers care?

Impact on publishers – how metadata

impacts processes

Case Studies – This isn’t your Daddy’s

publishing business

Final Thoughts, Recommendations

Reading most definitions of metadata and

related standards is like trying to resolve

disputes with my kids

As Ed said, metadata is “data about data”• But what does that mean?

Its use may be increasing, but metadata is

NOT new

In the move from print publishing to digital,

metadata is a powerful tool to help publishers

get content in the right place, in the right

format, and known to the right systems and

people, at the right time

Print books were easy• Everyone knew what they were

• You could really only use them one way

• They had a beginning, an end, a physical presence,

and a set price (mostly)

Today, computers are often communicating with one another as much as they are with users (people)

Metadata becomes critical in:• B2B relationships

• Enhancing B2C relationships

• B2-_________ relationships

The quality of the metadata gives publishers a more powerful voice in what happens to their content

For example:• A digital asset (an image)

• What file format is it?

• How big is the image?

• Who took the picture?

• Who owns the picture?

• Can you use it on your web site? If you do, what credit do you have to give to the owner?

• What date was it created?

• Is it part of a collection?

• Is it related to another piece of content?

If a publisher’s goal is to disseminate

content to the widest possible audience,

metadata is critical

Again, in books you had one use model Metadata allows publishers to have diverse

relationships with content consumers and other information providers• Customers (duh)• Aggregators• The Open Web (not Google, but other search engines) But don’t try to “game” the search engines with adult keywords;

that’s just wrong There have been lawsuits over use of meta keywords, including

Playboy suing two adult web sites

• Technology partners/developers• Systems wherein content is a “value add”• Multiple output formats

HTML Metadata• <meta http-equiv="Content-Type" content="text/html; charset=iso-

8859-1"> • <meta name="verify-v1"

content="kBoFGUuwppiWVWGx4Ypzkw1Cs1GgMYEMMbfNr7FY65w=" />

• <meta name="description" content="International publisher of professional health information for physicians, nurses, specialized clinicians & students. Medical & nursing charts, journals, and pdasoftware.">

• <meta name="keywords" content="springhouse, medical book, nursing journal, medical pda software, lippincott medical reference, lww, lippincott, lww com, medical publisher">

• <link rel="stylesheet" href="/css/style.css" type="text/css">

For people

For search enginges

Classifying Metadata

• ISBN (I told you this

wasn’t new)

• Dewey Decimal

System

• Books in

Print/CIP/Library of

Congress data

• MARC records

• DOI (Digital Object

Identifier)

Descriptive Metadata

(sorry, my examples

are from STM)

• ICD-9 and ICD-10

Codes

• MeSH

• SNOMED-CT

• NANDA, NIC, NOC for

Nursing

• NDC, HCPCS for drugs

Classifying Metadata

• ISBN (I told you this

wasn’t new)

• Dewey Decimal

System

• Books in

Print/CIP/Library of

Congress data

• MARC records


Identifier)

Descriptive Metadata

(sorry, my examples

are from STM)

• ICD-9 and ICD-10

Codes

• MeSH

• SNOMED-CT

• NANDA, NIC, NOC for

Nursing

• NDC, HCPCS for drugs


Identifier)

Using controlled vocabularies, extra power

can be added to content via semantic

tagging to drive:• More precise searching

• Contextually-based connections

• Lowering of “two terms meaning the same thing”

syndrome (hypertension vs. high blood pressure;

heart attack vs. myocardial infarction)

• Filling in of content gaps

How Metadata Changes

Processes

Impact on publishers depends on answers

to questions in previous section• i.e., what am I going to get in return for investing

in metadata, and is it worth it?

• More and more, this is not an “if” proposition, it’s

“how much”

Publishers who buy in have two basic

choices on approach:

Requires deeper commitment, but has bigger potential upside• Positive impact on product creation and development

Requires thinking about tools, workflows, and enterprise-level systems to allow for creation and maintenance of metadata

Combination of good metadata in the workflow and creativity in product development team can pay big benefits

Allows participation of authors (or subject matter experts in lieu of) at the beginning of the workflow

Requires lesser commitment, but potentially fewer rewards

Can be done with zero impact on current systems

Has benefit of content being in “final form” (whatever that means anymore) when intelligence is added in metadata

Can keep SMEs as a separate offshoot of the workflow – easily outsourced

Can replace all of the above with software solutions (Darrell and Chris will talk about that)

Chris, Darrell and I do NOT disagree

There are justifications that can be made

for tagging or entity extraction approaches

(or both)

Just as there is no “one size fits all”

metadata, there is no ONE solution

But if you must pick one, I’m right

Active vs. Passive Metadata• Active metadata

Publisher intentionally associates markup with certain

pieces of content

Often using controlled vocabulary

Includes semantic indexing

Can also be machine-based, using scripts, etc.

• Passive metadata

Metadata created based on use of content

Inheritance of properties from parent objects

The use of active metadata usually means an impact on support tools• Re-think authoring tools to allow for capture of metadata by

authors This can be outsourced to external SMEs – help is available

• Re-think content management to allow for preservation/management of metadata

How deep you go depends on how big the payoff• Good semantic indexing can drive new features and

functionality, but must used appropriately

If you decide to add active metadata, a controlled vocabulary just became your new best friend

Ontology – a specific specification of a

conceptualization• In English: a controlled vocabulary used to

describe a group of topics

Taxonomy – same as ontology, but with

hierarchy implied

Caveat – These two terms are so misused,

their definitions no longer matter (think

Content Management circa 2000)

PRISM (Publishing Requirements for Industry Standard Metadata) – an XML metadata vocabulary for handling content – started out in magazines and journals, but has added other types

Dublin Core – named after a 1995 workshop in Dublin, Ohio, it is, very simply, a set of 15 agreed-upon metadata elements used to describe objects• PRISM uses Dublin Core elements and then makes them specific

to publishing RDF (Resource Description Format): an XML

implementation that lets you richly describe relationships between data on web pages. Explain triplets

Semantic Web – A web of data. Envisioned by Tim Berners-Lee, it will be a web driven by data that “talks” to other data• My kids will work on this

FOAF Project (Friend of a Friend): Uses RDF to describe people and their preferences to the web, so you can find people with similar interests; all about social networking

SPARQL (Simple Protocol and RDF Query Language) – once you have used RDF to describe resources and their connection points, you use SPARQL to ask questions about those connections and find stuff

OWL (Web Ontology Language) – extends ability of RDF and XML Schemas to describe information

Drug Reference ProductPerfect, structured information that is a great

example of metadata becoming just as important as content

Examples of things that were stored in metadata:• Codes, codes, and more codes

• Drug interaction information

• Classifications (this one was actually redundant)

• Formulary information

• FDA approval date (could also be redundant)

Four editors spent as much time working

on metadata as they did on content itself

All work on import/export from DB was

done by:• Acting on metadata

• Keeping metadata at top of priority list on output

• “Output all drugs anticoagulants that were

approved before 1982”

Medical content (5 years ago I would have

said “book”)

Thousands of topics, sometimes printed,

always updated, sent to web, handhelds

How/when they are updated, whether or

not they are printed, and whether or not

they get extracted is all driven by ….

Metadata!

Extracts all are made by acting on

metadata• What is the subject area of the topic? (this can be

a MANY to ONE relationship)

• When was the topic last updated?

• Who was the author of the last update?

ID Values assigned during XML conversion

Gender values assigned by authors

Have a metadata strategy• Business case should support investment in metadata

• Be careful, and stay alert for mission creep – this stuff can get out of control very easily

Know your organization• Is it a change tolerant organization? “All in” vs.

measured, incremental approach should be considered

• Show me someone who says they have the correct universal approach to metadata, and I’ll show you a liar

A little bit of metadata understanding by

product development people can go a long

way

If a content set can benefit from metadata

in the creation of new products, that could

justify investment in metadata strategy and

tools within the workflow

Jabin White

[email protected]

1. Contributor

2. Coverage

3. Creator

4. Date

5. Description

6. Format

7. Identifier

8. Language

9. Publisher

10. Relation

11. Rights

12. Source

13. Subject

14. Title

15. Type

Return

Documents

1 d.1