View
584
Download
1
Tags:
Embed Size (px)
DESCRIPTION
The Challenges of Describing Best Tagging Practices for JATS Jeffrey Beck, Technical Information Specialist, National Center for Biotechnology Information (NCBI), U.S. National Library of Medicine, National Institutes of Health; Co-chair, NISO Journal Article Tag Suite (JATS) Standing Committee
Citation preview
The Challenges of Describing Best Tagging Practices for JATS
Jeffrey Beck, NCBI/NLM/NIH
NISO/NFAIS Joint Virtual Conference:Connecting the Library to the Wider World: Successful Applications of Linked Data
Wednesday, December 3, 2014
Intro to JATS
JATS refers to NISO Z39.96-2012 Journal Article Tag Suite.
It is a NISO standard that describes XML elements and attributes and three article models in XML.
JATS was based on the “NLM DTDs”, which have been used to describe journal articles since 2003.
The “NLM DTDs” grew out of work being done on the NCBI PubMed Central (PMC) DTDs in 2002.
So, what is this DTD you speak of?
DTD is Document Type Definition
– One of many (3 really) schema languages for
defining XML documents
– Essentially a set of rules for what can be in your
document, what must be in your document, and the
order of things if you wish to enforce order
We’ll get to “Why DTD” later.
A Brief History
• NLM Version 1 was released in December 2002 with the Archiving and Interchange DTD and the Journal Publishing DTD.
• Version 1 was based on work at NCBI to upgrade the PubMed Central DTD and a project at Harvard University funded by the Mellon Foundation to address the problems of archiving scholarly journals in electronic form (E-journals).
• The initial meeting included participants from NCBI, Harvard, and the Mellon Foundation along with NCBI’s consultants, Mulberry Technologies, and Harvard’s consultants, Inera, Inc.
But there was confusion about what the model should be.
Easy Target for Conversion?
• Should the new DTD be a broad, descriptive target that would be easy to translate articles from other SGML or XML models into?
A model like this would have many optional elements with few things in a prescribed order, and different ways to tag the same object.
Easy model to create content in?
• Or should the new DTD be a narrower, prescriptive target that would give creators of new XML articles guidance about how to make a valid article?
A model like this would have more required elements with fewer choices on how to tag the same object.
The DTD Spectrum
Optimized for Conversion to Optimized to Create Content in
The DTD Spectrum
Conversion Creation
Archive and Interchange DTD Journal Publishing DTD
The Colors
Everything was fine, until
<x>
The two archiving strategies
Archiving the intellectual content of the article?
Or
Archiving the article file?
If you need to archive the entire file, you need a way to keep those items in the file that the Archiving and Interchange DTD did not worry about.
Punctuation in Keywords. Keyword Group in Archiving 1.0:
<!ELEMENT kwd-group (title?, kwd+) >
Keywords: DNA analysis; gene expression; parallel cloning; fluid microarray. Keywords: DNA analysis; gene expression; parallel cloning; fluid microarray.
Keywords: DNA analysis; gene expression; parallel cloning;
fluid microarray.
<kwd-group><kwd>DNA analysis</kwd><kwd>gene expression</kwd><kwd>parallel cloning</kwd><kwd>fluid microarray</kwd>
</kwd-group>
Punctuation in Keywords. Keyword Group in Archiving 2.0:
<!ELEMENT kwd-group (title?, (kwd | x)+)
>Keywords: DNA analysis; gene expression; parallel cloning; fluid microarray. Keywords: DNA analysis; gene expression; parallel cloning; fluid microarray.
Keywords: DNA analysis; gene expression; parallel cloning;
fluid microarray.
<kwd-group><title>Keywords: </title><kwd>DNA analysis</kwd><x>; </x><kwd>gene expression</kwd><x>; </x><kwd>parallel cloning</kwd><x>; </x><kwd>fluid microarray</kwd><x>.</x>
</kwd-group>
The DTD Spectrum
Conversion Creation
The DTD Spectrum
Conversion Creation
The DTD Spectrum
Conversion Creation
The DTD Spectrum
Conversion Creation
Article Authoring DTD
JATS?Journal Article Tag Suite
The Tag Suite is the collection of all Elements and Attributes.
Each model (Archiving, Publishing, Authoring) is a Tag Set.
Each schema (DTD, XSD, RELAX NG) represents a model or Tag Set.
Ma
rch
20
03
N
LM
DT
Ds v
1.0
Ma
rch
20
03
N
LM
DT
Ds v
1.0
No
ve
mb
er 2
00
3
NL
M D
TD
s v
1.1
No
ve
mb
er 2
00
4
NL
M D
TD
s v
2.0
Se
pte
mb
er 2
00
5
NL
M D
TD
s v
2.1
This was when the Article Archiving and Journal Publishing models
became more open and we added the Authoring model.
Ma
rch
20
03
N
LM
DT
Ds v
1.0
No
ve
mb
er 2
00
3
NL
M D
TD
s v
1.1
No
ve
mb
er 2
00
4
NL
M D
TD
s v
2.0
Se
pte
mb
er 2
00
5
NL
M D
TD
s v
2.1
NL
M D
TD
s v
2.2
Ju
ne
20
06
Ma
rch
20
03
N
LM
DT
Ds v
1.0
No
ve
mb
er 2
00
3
NL
M D
TD
s v
1.1
No
ve
mb
er 2
00
4
NL
M D
TD
s v
2.0
Se
pte
mb
er 2
00
5
NL
M D
TD
s v
2.1
NL
M D
TD
s v
2.2
Ju
ne
20
06
NL
M D
TD
s v
2.3
Ma
rch
20
07
Decision to formalize standard with NISO
Laura Kelly suggested that this would be a
good time to clean up those little things that we
know are problems but we haven’t fixed
because we wanted all of the new models to be
backward-compatible.
Backward-compatibility
• Means that all existing XML instances will be valid according to the new model.
• Mostly we had minor housekeeping issues that we had been putting off.
• In version 1.0, the @id on <list-item> was defined as CDATA (when it obviously should have been defined as ID to allow ID/IDREF functionality).
• So, any existing <list-item id=“45qrt”> would be valid under version 1.0 but not valid when the attribute was properly defined as type=ID.
Ma
rch
20
03
N
LM
DT
Ds v
1.0
No
ve
mb
er 2
00
3
NL
M D
TD
s v
1.1
No
ve
mb
er 2
00
4
NL
M D
TD
s v
2.0
Se
pte
mb
er 2
00
5
NL
M D
TD
s v
2.1
NL
M D
TD
s v
2.2
Ju
ne
20
06
NL
M D
TD
s v
2.3
Ma
rch
20
07
NL
M D
TD
s v
3.0
No
ve
mb
er 2
00
8
Backward-incompatible release
Ma
rch
20
03
N
LM
DT
Ds v
1.0
No
ve
mb
er 2
00
3
NL
M D
TD
s v
1.1
No
ve
mb
er 2
00
4
NL
M D
TD
s v
2.0
Se
pte
mb
er 2
00
5
NL
M D
TD
s v
2.1
NL
M D
TD
s v
2.2
Ju
ne
20
06
NL
M D
TD
s v
2.3
Ma
rch
20
07
NL
M D
TD
s v
3.0
No
ve
mb
er 2
00
8
Backward-incompatible release
NLM
DTD
s v 3.1
NLM DTD Working Group is dissolved, and
the NISO Journal Article Tag Suite Working
Group is created.
Ma
rch
20
03
N
LM
DT
Ds v
1.0
No
ve
mb
er 2
00
3
NL
M D
TD
s v
1.1
No
ve
mb
er 2
00
4
NL
M D
TD
s v
2.0
Se
pte
mb
er 2
00
5
NL
M D
TD
s v
2.1
NL
M D
TD
s v
2.2
Ju
ne
20
06
NL
M D
TD
s v
2.3
Ma
rch
20
07
NL
M D
TD
s v
3.0
No
ve
mb
er 2
00
8
Backward-incompatible release
NIS
OZ
39
.96
JA
TS
v 0
.4M
arc
h 2
011
Au
gu
st 2
01
2
NISO Z39.96-2012 is
official
Ma
rch
20
03
N
LM
DT
Ds v
1.0
No
ve
mb
er 2
00
3
NL
M D
TD
s v
1.1
No
ve
mb
er 2
00
4
NL
M D
TD
s v
2.0
Se
pte
mb
er 2
00
5
NL
M D
TD
s v
2.1
NL
M D
TD
s v
2.2
Ju
ne
20
06
NL
M D
TD
s v
2.3
Ma
rch
20
07
NL
M D
TD
s v
3.0
No
ve
mb
er 2
00
8
Backward-incompatible release
NIS
O Z
39
.96
JA
TS
v 0
.4M
arc
h 2
011
Au
gu
st 2
01
2
NISO Z39.96-2012 is
official
Decem
ber 2
013
JATS v1.1d1
released
JATS V1.1d2 - December 2014??
Maintained in DTD
• We deliver DTD, XSD, and RNG as non-normative supporting material to the standard.
• But the models are written and maintained in DTD and the other schemas are derived from them.
Q: But this means that you will not get any of the advantages of the more modern schema languages in JATS?
A: Yes. That is correct.
Q: And that is bad!
A: Not necessarily.
Q: But, but … data typing!!!
In defense of DTD
• First, DTD is still the schema language of choice for most users of JATS – publishers and tagging vendors.
But, but … data typing!!!
Data Typing gives the schema writer control over the value of an element or attribute.
Like saying that a value must be an integer or that a string of characters must be a date.
There is little datatyping in DTD.
Let’s consider dates
It is reasonable to say that when we are creating content to publish, we want the values that are written as dates to be dates.
• The 14th of Smoon
• January 7, 1
• 1947-02-30
Are all a little hinky and should not be published!
But what if they already exist?
If you are tagging a journal’s historical content in XML and you come across an issue with a cover date of February 30, 1947. What do you do?
A: Fix it!
Q: What is it “supposed” to be?
If a date can sometimes not be a date, then you can not have a hard and fast rule built into your schema that says it must be a date always.
{Thanks to Tommie Usdin of Mulberry Technologies and Co-Chair of the JATS Standing Committee for this wonderful example that I stole.}
So, how do you tag a … ?
• But sometimes people want to be told what to do.
• The JATS Tag Sets - especially the Archiving and Interchange and even the Journal Publishing are very flexible models that allow content to be tagged in different ways
A reasonable question• (1) It seems from the element reference page for <chem-struct-wrap> that
one could omit explicit labels because "A <chem-struct-wrap> may also be numbered, automatically by a formatting application or by preserving the number inside a <label> element." Having seen this, but not found similar comments about "automatic numbering" for other elements that may typically be numbered/labelled, I would like to know what the assumption is about omitting labels in general for these (e.g. chemical structures, equations, figures, tables, etc.): is a formatting application expected by default to generate a number/label? If so, is there a way to suppress numbering for some occurrences?
• (2) Relatedly, what is the expected behaviour for an <xref> element that has no content (e.g. one that (a) references an element for which automatic numbering has been assumed and which therefore lacks a <label>, or (b) one that references an element possessing a <label>)?
• Message from Simon Newton to [email protected] on September 7, 2011
A reasonable question• (1) It seems from the element reference page for <chem-struct-wrap> that
one could omit explicit labels because "A <chem-struct-wrap> may also be numbered, automatically by a formatting application or by preserving the number inside a <label> element." Having seen this, but not found similar comments about "automatic numbering" for other elements that may typically be numbered/labelled, I would like to know what the assumption is about omitting labels in general for these (e.g. chemical structures, equations, figures, tables, etc.): is a formatting application expected by default to generate a number/label? If so, is there a way to suppress numbering for some occurrences?
• (2) Relatedly, what is the expected behaviour for an <xref> element that has no content (e.g. one that (a) references an element for which automatic numbering has been assumed and which therefore lacks a <label>, or (b) one that references an element possessing a <label>)?
• Message from Simon Newton to [email protected] on September 7, 2011
A reasonable question• (1) It seems from the element reference page for <chem-struct-wrap> that
one could omit explicit labels because "A <chem-struct-wrap> may also be numbered, automatically by a formatting application or by preserving the number inside a <label> element." Having seen this, but not found similar comments about "automatic numbering" for other elements that may typically be numbered/labelled, I would like to know what the assumption is about omitting labels in general for these (e.g. chemical structures, equations, figures, tables, etc.): is a formatting application expected by default to generate a number/label? If so, is there a way to suppress numbering for some occurrences?
• (2) Relatedly, what is the expected behaviour for an <xref> element that has no content (e.g. one that (a) references an element for which automatic numbering has been assumed and which therefore lacks a <label>, or (b) one that references an element possessing a <label>)?
• Message from Simon Newton to [email protected] on September 7, 2011
• Simon was asking for “Best Practices”
• So I was thrilled to see the following response:
I don't think any assumptions are made regarding when and exactly how numbering should be automated; there is only a recognition that it commonly done in publishing systems, and JATS is designed to support this (or no numbering at all) or not, depending on local policies.
Neither is there any expectation that by default, a formatting application will number things.
This means you have both the opportunity and the burden to define a policy that makes the most sense for your data and workflow.
Message from Weldell Piez to [email protected] on September 8, 2011
I don't think any assumptions are made regarding when and exactly how numbering should be automated; there is only a recognition that it commonly done in publishing systems, and JATS is designed to support this (or no numbering at all) or not, depending on local policies.
Neither is there any expectation that by default, a formatting application will number things.
This means you have both the opportunity and the burden to define a policy that makes the most sense for your data and workflow.
Message from Weldell Piez to [email protected] on September 8, 2011
Best Practices must be scoped
• They must make sense with your content.
… with your workflow
… and for any users of your content down the line.
The Standing Committee position
The JATS Standing Committee makes an effort to make the Tag Suite as useful as possible for all users: creators of content, publishers, archives, and other aggregators.
To do this “all reasonable practices” are documented as much as possible in the non-normative supporting information available at http://jats.nlm.nih.gov.
But there are efforts to define tagging best practices – or at least practices.
PMC Tagging Guidelines
We have the PMC Tagging Guidelines (http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html) – which is essentially a "Best Practices" for tagging articles in NLM XML for submission to PMC.
These are still surprisingly open.
In response to the article “Inconsistent XML as a Barrier to Reuse of Open Access Content”, which focused on inconsistent tagging in the PMC Open Access articles available for reuse, a group of mainly open access publishers formed a group called JATS for Reuse to define some best tagging practices.
See http://jats4r.github.io/
(http://www.ncbi.nlm.nih.gov/books/NBK159964/)
Questions?
Come to
http://jats.nlm.nih.gov/jats-con