42
JATS-CON 2012 October 16, 2012 Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP How Well Do You Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study

Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

  • Upload
    jena

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

How Well Do You Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study . Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP. Founded in 1931 - PowerPoint PPT Presentation

Citation preview

Page 1: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

Faye KrawitzJennifer McAndrews

Richard O’KeeffeContent Technology Group, AIP

How Well Do You Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study

Page 2: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

• Founded in 1931• Umbrella organization for 10 physical science societies.

Combined membership totals 165,500 scientists, engineers and educators (with some overlap)• One of the world's largest non-profit publishers of scientific

information in physics. • Home of the Physics Resources Center• Publish 24+ AIP, member, partner journals/magazines, three of

which are co-published with other organizations, and one conference proceedings series• Mission: To inspire every Physical and Applied Scientist in the

world to turn to AIP for the information and help that they need

AIP at a glance

Page 3: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

3

The AIP Content Ecosystem• The AIP Content Collection

800,000 SGML/XML records encoded in AIP ISO 12083 “header” SGML DTD (1995-present) AIP ISO 12083 “full-text” SGML DTD (1995-2005) AIP “ISO-12083-informed” full-text XML DTD (2005-present)

•How was it used? XML the source for print/online PDFs The source for HTML rendered on the AIP online platform

And it worked well…but the times they were a changing

Page 4: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

4

What’s the problem…Why change?

• AIP-centric! XML overly specialized for specific AIP products Required proprietary systems and support Too many intermediary data transformations Limited the adoption of new technology and standards Too costly to maintain Not the XML format of choice for data recipients

Page 5: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

5

Redefining AIP’s future content strategy: If you could have anything you want…

Recognition that the intellectual property is the premium asset

Markup the data to maximize its value and enrichment potential

Keep current with industry standards Better meet client expectations!

Plan for success Streamlined production workflow Reorganize units to execute a unified content

strategy Not enough to realize the need to change, but to

follow through and execute

Page 6: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

6

C’mon…everybody does it! Standardization 1: adopt industry standard

XML Eliminate multiple formats and associated

transformations Enhanced data portability

Standardization 2: adopt XML technologies such as XSLT and Schematron

Minimize dependence on specialized applications and skill sets

Speak the same language as the STM Community

Page 7: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

7

Page 8: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

8

(Not so) Big Surprise!

Journal and Archiving Interchange Tag SetJATS

XSLT Schematron

Page 9: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

9

Build for Success: Communication

Make the plan known Keep everyone informed and updated

Get “buy-in” Ensure the whole organization understands

the change in approach Ensure the whole organization understands

the end goal Ensure the staff understands the important

role they play in the success

Page 10: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

10

Build for Success: Ownership Organize to succeed

Rethink and deploy an organization that most effectively achieves the goal

For AIP this meant… Create a unified team following the overall

strategy Foster a definitive sense of ownership for the

content as the “intellectual asset” Develop a clear chain of content responsibility Designate formal content “gatekeepers”

Page 11: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

11

Build for Success: Infrastructure

Invest in an up-to-date content management system Efficiently manage content, not have the product(s)

manage the systems Avoid unneeded workflow duplication Avoid unwanted “end-around” content manipulation Extensibility to adapt to future needs Excellent versioning capabilities Effective reporting tools

Page 12: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

12

Now What? Transform Decisions

Use XSLT Create “mapping specification” for the following:

– Transform AIP ISO 12083 “header” SGML DTD– Transform AIP “ISO-12083-informed” full-text XML DTD– On hold: AIP ISO 12083 “full-text” SGML DTD

Test and adapt based on results Quality Control including Schematron Document Train staff and production partners

Page 13: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

13

The Process Document Analysis Helpful aids

Existing documentation Institutional memory

Devise tagging principles Correct known ambiguities

Page 14: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

14

Document Analysis

•Identify: Consistencies Inconsistencies Surprises

•Evaluate tagging requirements•Create

Document Map (or “specification”) Sample XML files as needed

Page 15: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

15

Devised Tagging Principles•Strictly delineated element v. attribute•Defined AIP-specific usage of JATS •Treated <article-meta> as database-like•Avoided customized content models; reserved for later use•Reserved <x> markup for future use; use at transform as debugging tool•Reserved <named-content> for semantic enrichment markup

Page 16: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

16

Creating the Document MapTagging Principles x (Existing documentation + Institutional Memory) = JATS

X +

=

Page 17: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

17

Resulting Map (“spec”)ELEMENT AIP TAGGING JATS Action:metanote metanote/edcode

<metanote> Contributed by the Bioengineering Division of ASME for publication in the J<emph type="smallcap">OURNAL OF</emph> B<emph type="smallcap">IOMECHANICAL</emph> E<emph type="smallcap">NGINEERING</emph>. Manuscript received July 20, 2009; final manuscript received February 18, 2010; accepted manuscript posted March 1, 2010; published online June 18, 2010. Assoc. Editor: <techeditor status="associate">Ellen M. Arruda</techeditor>.</metanote>....</metanote

</article-meta><notes notes-type=”metadata-note”><p>Contributed by the Bioengineering Division of ASME for publication in the J<sc>OURNAL OF</sc> B<sc>IOMECHANICAL</sc> E<sc>NGINEERING</sc>. Manuscript received July 20, 2009; final manuscript received February 18, 2010; accepted manuscript posted March 1, 2010; published online June 18, 2010. Assoc. Editor: J. Shah.</p></notes>

1.Convert as <notes> with @notes-type=”metadata-note” 2.<notes> tag is placed after </article-meta> 3. Suppress tag, keep contents of: metanote/edcode, metanote/symposium, metanote/contribgrp 4. UPDATE:02/21 – wrap contents in <p> - this will not be in the source. Info: Okay tags below are suppressed:meta-received|meta-accepted|meta-revised|meta-presented|meta-submit|meta-published | meta-posted. ***N/A Future JATS***

Page 18: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

18

Corrected Known Ambiguities

Before After<extra1><suffix><extra2> <role><extra3><degree>

Page 19: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

19

Expected Trouble Spots

•Generated text•Style variation issues•Multi-purpose tags•Multimedia•Time

Page 20: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

20

Generated Text

The ability to take a tag like <ack> and output the title “ACKNOWLEDGMENTS” is the closest thing we have to magic.

Page 21: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

21

Style Variation Issues

INTRODUCTIONINTRODUCTION I. INTRODUCTION1. IntroductionIntroduction

Page 22: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

22

Mulitpurpose tagsThree distinct rules for handling one sgml element, all within References:

1. when <othinfo> is sibling of <refitem>:a. <othinfo> remove tag, retain PCDATAb. Retain content/punctuation and trailing spacec. MOVE retained PCDATA to before </mixed-citation> of preceding <mixed-citation>

2.When back/citation/ref/othinfo: Strip <othinfo>, retain PCDATA

3. NOTE: nesting of <othinfo> requires:<citation id="r#"><ref><biother><othinfo>…<othinfo><dformula> <ref><label>#. </label><note><p>….<disp-formula>…

Page 23: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

23

Multimedia1. <epaps>See supplementary material at <url href=”http://dx.doi.org/10.1063/1.3475476”>http://dx.doi.org/10.1063/1.3475476</url> <epapsid display="no" type=“multimedia">E-JAPIAU-108-032016</epapsid> for essential multimedia.</epaps>

2. <media id="v1" status="essential"><media-object doi="10.1063/1.3674301.1" file-name=“006029jcpv1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref>

3. <media id="v1" status="essential"><media-object doi="10.1063/1.3674301.1" file-name="v1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref>

4. <media id="v1" status="essential"><media-object doi="10.1063/1.3674301.1" file-name="v1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref></media-object></media> <media id="v2" status="essential"><media-object doi="10.1063/1.3674301.2" file-name="v2.mpg" id="mm2" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v2" show-link="yes"></mediaref></media-object></media> <media id="v3" status="essential"><media-object doi="10.1063/1.3674301.3" file-name="v3.mpg" id="mm3" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v3" show-link="yes"></mediaref></media-object></media>.

Page 24: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

24

Time

Page 25: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

25

Unexpected Trouble Spots:Language

Page 26: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

26

Language

Deceptively simple example:

•Beforepacs

•After:front/spin/docanal/pacs

Page 27: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

27

Unexpected Trouble Spots: Nasty Surprises

Page 28: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

28

Nasty Surprises

Expected tagging:<p content-type="leadpara”>Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system ...</p>

Displays online as:Lead ParagraphWeak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system

Actual tagging:<p>Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system ...</p>

No online display

Page 29: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

29

QUALITY CONTROL AND TESTING

• Prerequisite training• Content and tagging checks• Incorporating Schematron• Online displays

Page 30: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

30

QUALITY CONTROL AND TESTINGPrerequisite

Staff Training NLM/JATS DTD XPATH XSLT Schematron

Page 31: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

31

QUALITY CONTROL AND TESTINGContent and tagging checks

Step 1 – Preliminary Testing:

Performed while XSLT was in progress Analyst checked completed blocks of XSLT code and

confirmed programmers understanding of instructions Daily meetings held to discuss new findings or

clarifications of instructions

Trouble spot detected: specification document needed to be re-written using XPATH terminology.

Page 32: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

32Step 2 – Batch Processing

Performed when XSLT was complete. Converted and parsed approximately 200

files Investigated hidden problems and

determined if an XSLT modification or manual fix was the best course of action to take

Page 33: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

33

Step 3 – Group Testing

Performed when converted files were valid Ran approximately 200 files from various

journals with assorted article types Entire group checked same sample of files Check for dropped text Ran Schematron

Page 34: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

34Step 4 – Bulk Processing

Performed when all files were approved from the group testing

Entire corpus of content run with remaining errors resulting from bad source outliers

XSLT transformed over a 99% accuracy rate, with 800,000 there was still a large number to be inspected

Where applicable source or XSLT was fixed and files rerun

Page 35: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

35

Step 5 – Final Cleanup – Analyze flagged data.Investigated tags mapped in the XSLT to <x> or <strike> because the source tags had known problems.

Page 36: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

36

QUALITY CONTROL AND TESTINGIncorporating Schematron

Central piece in our QC process derived from

our pre-existing proprietary QC programs List of checks or assertions written in XPATH

language Tracks ERRORS and WARNINGS specific to our

data Done in parallel while XSLT was being written

Page 37: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

37

JATS MARKUP with SCHEMATRON ERROR DETECTED <kwd-group kwd-group-type="pacs-codes"><compound-kwd> <compound-kwd-part content-type="code">8440-x</compound-kwd-part> <compound-kwd-part content-type="value">Radiowave …</compound-kwd-part> <compound-kwd-part content-type="code">8440Ba</compound-kwd-part> <compound-kwd-part content-type="value">Antennas:. …</compound-kwd-part></compound-kwd> </kwd-group>

JATS MARKUP CORRECTED<kwd-group kwd-group-type="pacs-codes"><compound-kwd> <compound-kwd-part content-type="code">8440-x</compound-kwd-part> <compound-kwd-part content-type="value">Radiowave …</compound-kwd-part></compound-kwd> <compound-kwd> <compound-kwd-part content-type="code">8440Ba</compound-kwd-part> <compound-kwd-part content-type="value">Antennas:. …</compound-kwd-part></compound-kwd> </kwd-group>

Page 38: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

38

SCHEMATRON RULE<rule id="ERROR_COMPOUND_KEYWORD" context="compound-kwd"><assert role="ERROR_COMPOUND_KEYWORD" test="count(compound-kwd-part) = 2">[ERROR] A compound-kwd must have two compound-kwd-part tags</assert></rule>

<rule id="ERROR_COMPOUND_KEYWORD_PART" context="compound-kwd-part"><assert role="ERROR_COMPOUND_KEYWORD_PART" test="@content-type='code' or @content-type='value'">[ERROR] Invalid @content-type used for compound-kwd-part - allowable values are: code and value</assert> </rule>

Page 39: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

39

QUALITY CONTROL AND TESTINGOnline Displays

Assumptions at this point are: files are valid and Schematron runs clean

Testing was expanded to online publishing group and random testers throughout organization

Errors were found at this point that are apparent more in viewing

Great way to confirm that business rules are being followed

Page 40: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

40

LESSONS LEARNED &GENERAL CONCLUSIONS

• Don’t go it alone: follow industry best practices and standards• Set yourself up for success• It is impossible to overstate the importance of document analysis• Use analysis as an opportunity to correct known ambiguities• Recognize difference between bad and incorrect data • Create a detailed document map• XPATH training is valuable• Use Schematron as a central piece to QC process• Work as a team

Page 41: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

41

We chose to use pre-existing JATS DTD elements and avoid any JATS module customization. The stock NISO JATS was more than sufficient to accommodate AIP’s tagging needs. We were able apply our tagging principles and remain true to our business rules.

We have achieved the XMLquality we were aiming towards.

Page 42: Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

JATS-CON 2012October 16, 2012

42

Questions?