51
The Challenges of Describing Best Tagging Practices for JATS Jeffrey Beck, NCBI/NLM/NIH NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World: Successful Applications of Linked Data Wednesday, December 3, 2014

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Embed Size (px)

DESCRIPTION

The Challenges of Describing Best Tagging Practices for JATS Jeffrey Beck, Technical Information Specialist, National Center for Biotechnology Information (NCBI), U.S. National Library of Medicine, National Institutes of Health; Co-chair, NISO Journal Article Tag Suite (JATS) Standing Committee

Citation preview

Page 1: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The Challenges of Describing Best Tagging Practices for JATS

Jeffrey Beck, NCBI/NLM/NIH

NISO/NFAIS Joint Virtual Conference:Connecting the Library to the Wider World: Successful Applications of Linked Data

Wednesday, December 3, 2014

Page 2: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Intro to JATS

JATS refers to NISO Z39.96-2012 Journal Article Tag Suite.

It is a NISO standard that describes XML elements and attributes and three article models in XML.

Page 3: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

JATS was based on the “NLM DTDs”, which have been used to describe journal articles since 2003.

The “NLM DTDs” grew out of work being done on the NCBI PubMed Central (PMC) DTDs in 2002.

Page 4: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

So, what is this DTD you speak of?

DTD is Document Type Definition

– One of many (3 really) schema languages for

defining XML documents

– Essentially a set of rules for what can be in your

document, what must be in your document, and the

order of things if you wish to enforce order

We’ll get to “Why DTD” later.

Page 5: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

A Brief History

• NLM Version 1 was released in December 2002 with the Archiving and Interchange DTD and the Journal Publishing DTD.

• Version 1 was based on work at NCBI to upgrade the PubMed Central DTD and a project at Harvard University funded by the Mellon Foundation to address the problems of archiving scholarly journals in electronic form (E-journals).

Page 6: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

• The initial meeting included participants from NCBI, Harvard, and the Mellon Foundation along with NCBI’s consultants, Mulberry Technologies, and Harvard’s consultants, Inera, Inc.

But there was confusion about what the model should be.

Page 7: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Easy Target for Conversion?

• Should the new DTD be a broad, descriptive target that would be easy to translate articles from other SGML or XML models into?

A model like this would have many optional elements with few things in a prescribed order, and different ways to tag the same object.

Page 8: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Easy model to create content in?

• Or should the new DTD be a narrower, prescriptive target that would give creators of new XML articles guidance about how to make a valid article?

A model like this would have more required elements with fewer choices on how to tag the same object.

Page 9: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The DTD Spectrum

Optimized for Conversion to Optimized to Create Content in

Page 10: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The DTD Spectrum

Conversion Creation

Archive and Interchange DTD Journal Publishing DTD

Page 11: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The Colors

Page 12: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Everything was fine, until

<x>

Page 13: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The two archiving strategies

Archiving the intellectual content of the article?

Or

Archiving the article file?

Page 14: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

If you need to archive the entire file, you need a way to keep those items in the file that the Archiving and Interchange DTD did not worry about.

Page 15: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Punctuation in Keywords. Keyword Group in Archiving 1.0:

<!ELEMENT kwd-group (title?, kwd+) >

Keywords: DNA analysis; gene expression; parallel cloning; fluid microarray. Keywords: DNA analysis; gene expression; parallel cloning; fluid microarray.

Keywords: DNA analysis; gene expression; parallel cloning;

fluid microarray.

<kwd-group><kwd>DNA analysis</kwd><kwd>gene expression</kwd><kwd>parallel cloning</kwd><kwd>fluid microarray</kwd>

</kwd-group>

Page 16: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Punctuation in Keywords. Keyword Group in Archiving 2.0:

<!ELEMENT kwd-group (title?, (kwd | x)+)

>Keywords: DNA analysis; gene expression; parallel cloning; fluid microarray. Keywords: DNA analysis; gene expression; parallel cloning; fluid microarray.

Keywords: DNA analysis; gene expression; parallel cloning;

fluid microarray.

<kwd-group><title>Keywords: </title><kwd>DNA analysis</kwd><x>; </x><kwd>gene expression</kwd><x>; </x><kwd>parallel cloning</kwd><x>; </x><kwd>fluid microarray</kwd><x>.</x>

</kwd-group>

Page 17: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The DTD Spectrum

Conversion Creation

Page 18: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The DTD Spectrum

Conversion Creation

Page 19: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The DTD Spectrum

Conversion Creation

Page 20: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The DTD Spectrum

Conversion Creation

Article Authoring DTD

Page 21: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

JATS?Journal Article Tag Suite

The Tag Suite is the collection of all Elements and Attributes.

Each model (Archiving, Publishing, Authoring) is a Tag Set.

Each schema (DTD, XSD, RELAX NG) represents a model or Tag Set.

Page 22: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Ma

rch

20

03

N

LM

DT

Ds v

1.0

Page 23: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Ma

rch

20

03

N

LM

DT

Ds v

1.0

No

ve

mb

er 2

00

3

NL

M D

TD

s v

1.1

No

ve

mb

er 2

00

4

NL

M D

TD

s v

2.0

Se

pte

mb

er 2

00

5

NL

M D

TD

s v

2.1

This was when the Article Archiving and Journal Publishing models

became more open and we added the Authoring model.

Page 24: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Ma

rch

20

03

N

LM

DT

Ds v

1.0

No

ve

mb

er 2

00

3

NL

M D

TD

s v

1.1

No

ve

mb

er 2

00

4

NL

M D

TD

s v

2.0

Se

pte

mb

er 2

00

5

NL

M D

TD

s v

2.1

NL

M D

TD

s v

2.2

Ju

ne

20

06

Page 25: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Ma

rch

20

03

N

LM

DT

Ds v

1.0

No

ve

mb

er 2

00

3

NL

M D

TD

s v

1.1

No

ve

mb

er 2

00

4

NL

M D

TD

s v

2.0

Se

pte

mb

er 2

00

5

NL

M D

TD

s v

2.1

NL

M D

TD

s v

2.2

Ju

ne

20

06

NL

M D

TD

s v

2.3

Ma

rch

20

07

Decision to formalize standard with NISO

Laura Kelly suggested that this would be a

good time to clean up those little things that we

know are problems but we haven’t fixed

because we wanted all of the new models to be

backward-compatible.

Page 26: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Backward-compatibility

• Means that all existing XML instances will be valid according to the new model.

• Mostly we had minor housekeeping issues that we had been putting off.

• In version 1.0, the @id on <list-item> was defined as CDATA (when it obviously should have been defined as ID to allow ID/IDREF functionality).

• So, any existing <list-item id=“45qrt”> would be valid under version 1.0 but not valid when the attribute was properly defined as type=ID.

Page 27: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Ma

rch

20

03

N

LM

DT

Ds v

1.0

No

ve

mb

er 2

00

3

NL

M D

TD

s v

1.1

No

ve

mb

er 2

00

4

NL

M D

TD

s v

2.0

Se

pte

mb

er 2

00

5

NL

M D

TD

s v

2.1

NL

M D

TD

s v

2.2

Ju

ne

20

06

NL

M D

TD

s v

2.3

Ma

rch

20

07

NL

M D

TD

s v

3.0

No

ve

mb

er 2

00

8

Backward-incompatible release

Page 28: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Ma

rch

20

03

N

LM

DT

Ds v

1.0

No

ve

mb

er 2

00

3

NL

M D

TD

s v

1.1

No

ve

mb

er 2

00

4

NL

M D

TD

s v

2.0

Se

pte

mb

er 2

00

5

NL

M D

TD

s v

2.1

NL

M D

TD

s v

2.2

Ju

ne

20

06

NL

M D

TD

s v

2.3

Ma

rch

20

07

NL

M D

TD

s v

3.0

No

ve

mb

er 2

00

8

Backward-incompatible release

NLM

DTD

s v 3.1

NLM DTD Working Group is dissolved, and

the NISO Journal Article Tag Suite Working

Group is created.

Page 29: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Ma

rch

20

03

N

LM

DT

Ds v

1.0

No

ve

mb

er 2

00

3

NL

M D

TD

s v

1.1

No

ve

mb

er 2

00

4

NL

M D

TD

s v

2.0

Se

pte

mb

er 2

00

5

NL

M D

TD

s v

2.1

NL

M D

TD

s v

2.2

Ju

ne

20

06

NL

M D

TD

s v

2.3

Ma

rch

20

07

NL

M D

TD

s v

3.0

No

ve

mb

er 2

00

8

Backward-incompatible release

NIS

OZ

39

.96

JA

TS

v 0

.4M

arc

h 2

011

Au

gu

st 2

01

2

NISO Z39.96-2012 is

official

Page 30: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Ma

rch

20

03

N

LM

DT

Ds v

1.0

No

ve

mb

er 2

00

3

NL

M D

TD

s v

1.1

No

ve

mb

er 2

00

4

NL

M D

TD

s v

2.0

Se

pte

mb

er 2

00

5

NL

M D

TD

s v

2.1

NL

M D

TD

s v

2.2

Ju

ne

20

06

NL

M D

TD

s v

2.3

Ma

rch

20

07

NL

M D

TD

s v

3.0

No

ve

mb

er 2

00

8

Backward-incompatible release

NIS

O Z

39

.96

JA

TS

v 0

.4M

arc

h 2

011

Au

gu

st 2

01

2

NISO Z39.96-2012 is

official

Decem

ber 2

013

JATS v1.1d1

released

JATS V1.1d2 - December 2014??

Page 31: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Maintained in DTD

• We deliver DTD, XSD, and RNG as non-normative supporting material to the standard.

• But the models are written and maintained in DTD and the other schemas are derived from them.

Page 32: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Q: But this means that you will not get any of the advantages of the more modern schema languages in JATS?

A: Yes. That is correct.

Q: And that is bad!

A: Not necessarily.

Q: But, but … data typing!!!

Page 33: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

In defense of DTD

• First, DTD is still the schema language of choice for most users of JATS – publishers and tagging vendors.

Page 34: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

But, but … data typing!!!

Data Typing gives the schema writer control over the value of an element or attribute.

Like saying that a value must be an integer or that a string of characters must be a date.

There is little datatyping in DTD.

Page 35: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Let’s consider dates

It is reasonable to say that when we are creating content to publish, we want the values that are written as dates to be dates.

• The 14th of Smoon

• January 7, 1

• 1947-02-30

Are all a little hinky and should not be published!

Page 36: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

But what if they already exist?

If you are tagging a journal’s historical content in XML and you come across an issue with a cover date of February 30, 1947. What do you do?

A: Fix it!

Q: What is it “supposed” to be?

Page 37: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

If a date can sometimes not be a date, then you can not have a hard and fast rule built into your schema that says it must be a date always.

{Thanks to Tommie Usdin of Mulberry Technologies and Co-Chair of the JATS Standing Committee for this wonderful example that I stole.}

Page 38: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

So, how do you tag a … ?

• But sometimes people want to be told what to do.

Page 39: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

• The JATS Tag Sets - especially the Archiving and Interchange and even the Journal Publishing are very flexible models that allow content to be tagged in different ways

Page 40: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

A reasonable question• (1) It seems from the element reference page for <chem-struct-wrap> that

one could omit explicit labels because "A <chem-struct-wrap> may also be numbered, automatically by a formatting application or by preserving the number inside a <label> element." Having seen this, but not found similar comments about "automatic numbering" for other elements that may typically be numbered/labelled, I would like to know what the assumption is about omitting labels in general for these (e.g. chemical structures, equations, figures, tables, etc.): is a formatting application expected by default to generate a number/label? If so, is there a way to suppress numbering for some occurrences?

• (2) Relatedly, what is the expected behaviour for an <xref> element that has no content (e.g. one that (a) references an element for which automatic numbering has been assumed and which therefore lacks a <label>, or (b) one that references an element possessing a <label>)?

• Message from Simon Newton to [email protected] on September 7, 2011

Page 41: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

A reasonable question• (1) It seems from the element reference page for <chem-struct-wrap> that

one could omit explicit labels because "A <chem-struct-wrap> may also be numbered, automatically by a formatting application or by preserving the number inside a <label> element." Having seen this, but not found similar comments about "automatic numbering" for other elements that may typically be numbered/labelled, I would like to know what the assumption is about omitting labels in general for these (e.g. chemical structures, equations, figures, tables, etc.): is a formatting application expected by default to generate a number/label? If so, is there a way to suppress numbering for some occurrences?

• (2) Relatedly, what is the expected behaviour for an <xref> element that has no content (e.g. one that (a) references an element for which automatic numbering has been assumed and which therefore lacks a <label>, or (b) one that references an element possessing a <label>)?

• Message from Simon Newton to [email protected] on September 7, 2011

Page 42: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

A reasonable question• (1) It seems from the element reference page for <chem-struct-wrap> that

one could omit explicit labels because "A <chem-struct-wrap> may also be numbered, automatically by a formatting application or by preserving the number inside a <label> element." Having seen this, but not found similar comments about "automatic numbering" for other elements that may typically be numbered/labelled, I would like to know what the assumption is about omitting labels in general for these (e.g. chemical structures, equations, figures, tables, etc.): is a formatting application expected by default to generate a number/label? If so, is there a way to suppress numbering for some occurrences?

• (2) Relatedly, what is the expected behaviour for an <xref> element that has no content (e.g. one that (a) references an element for which automatic numbering has been assumed and which therefore lacks a <label>, or (b) one that references an element possessing a <label>)?

• Message from Simon Newton to [email protected] on September 7, 2011

Page 43: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

• Simon was asking for “Best Practices”

• So I was thrilled to see the following response:

Page 44: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

I don't think any assumptions are made regarding when and exactly how numbering should be automated; there is only a recognition that it commonly done in publishing systems, and JATS is designed to support this (or no numbering at all) or not, depending on local policies.

Neither is there any expectation that by default, a formatting application will number things.

This means you have both the opportunity and the burden to define a policy that makes the most sense for your data and workflow.

Message from Weldell Piez to [email protected] on September 8, 2011

Page 45: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

I don't think any assumptions are made regarding when and exactly how numbering should be automated; there is only a recognition that it commonly done in publishing systems, and JATS is designed to support this (or no numbering at all) or not, depending on local policies.

Neither is there any expectation that by default, a formatting application will number things.

This means you have both the opportunity and the burden to define a policy that makes the most sense for your data and workflow.

Message from Weldell Piez to [email protected] on September 8, 2011

Page 46: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Best Practices must be scoped

• They must make sense with your content.

… with your workflow

… and for any users of your content down the line.

Page 47: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The Standing Committee position

The JATS Standing Committee makes an effort to make the Tag Suite as useful as possible for all users: creators of content, publishers, archives, and other aggregators.

To do this “all reasonable practices” are documented as much as possible in the non-normative supporting information available at http://jats.nlm.nih.gov.

Page 48: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

But there are efforts to define tagging best practices – or at least practices.

Page 49: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

PMC Tagging Guidelines

We have the PMC Tagging Guidelines (http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html) – which is essentially a "Best Practices" for tagging articles in NLM XML for submission to PMC.

These are still surprisingly open.

Page 50: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

In response to the article “Inconsistent XML as a Barrier to Reuse of Open Access Content”, which focused on inconsistent tagging in the PMC Open Access articles available for reuse, a group of mainly open access publishers formed a group called JATS for Reuse to define some best tagging practices.

See http://jats4r.github.io/

(http://www.ncbi.nlm.nih.gov/books/NBK159964/)

Page 51: NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Questions?

Come to

http://jats.nlm.nih.gov/jats-con