Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The use of SGML and XML at the Publications Office
Dr. Holger BagolaDir A – Cell “Formats”[email protected]
The use of SGML and XML at the Publications Office
2
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
dator8.info
The use of SGML and XML at the Publications Office
3
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
dator8.info
The use of SGML and XML at the Publications Office
4
Historical overview
• Among the missions of the Publications Office:
– Archiving of legislative publications
• Choice of SGML
– Independent from any platform
– Distinction between structure and presentation
– Support for synoptic document management in a multilingual environment
• Migration to XML
– Basic advantage: availability of tools
dator8.info
The use of SGML and XML at the Publications Office
5
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
dator8.info
The use of SGML and XML at the Publications Office
6
Formex (1)
• SGML versions
–Version 1: adopted in 1984
• First deliveries in 1985
• Characteristics:
– Mixture of SGML and CCF (Common Communication Format) for meta-data,
– Markup not very detailed
– Character encoding based on ISO 2022
The use of SGML and XML at the Publications Office
7
Formex (2)
–Version 2: adopted in 1989, revised in 1992
• First deliveries in 1989
• Characteristics:
– Mixture of SGML and CCF (Common Communication Format) for meta-data,
– Introduction of a logical table model
– Character encoding based on ISO 2022
The use of SGML and XML at the Publications Office
8
Formex (3)
–Version 3: adopted in 1999
• Beginning of the specifications 1994
• First deliveries in 1999
• Characteristics:
– Markup of semantic role of a document component
– Definition of text entities for 11 languages
– Character encoding based on ISO 2022 (after discussion of moving to Unicode UTF-8)
The use of SGML and XML at the Publications Office
9
Formex (4)
• XML version
–Version 4: adopted in 2004
• First deliveries in 2004
• Characteristics:
– XML,
– Character encoding based on Unicode (UTF-8)
The use of SGML and XML at the Publications Office
10
Formex (5)
• Basic principles
–XML Schema instead of DTD
–One single schema
–Number of root elements 12 instead of 30
–Number of elements about 350 instead of 1200
–Distinction between semantic and physical markup
The use of SGML and XML at the Publications Office
11
Formex (6)
ARTICLE (TI.ARTICLE, (PARAG+ | ALINEA+))
TI.ARTICLE (#PCDATA)
PARAG (NO.PARAG, ALINEA+)
NO.PARAG (#PCDATA)
ALINEA ((#PCDATA | NOTE | HT| FT)* |
(P | LIST | TABLE)+)
. . .
Blue: semantic markup
Red: physical markup
The use of SGML and XML at the Publications Office
12
Formex (7)
• Table model
–Analysis of CALS, HTML, Formex v. 3
–Choice:
• Model close to HTML (top-down approach, nested tables)
• Maintenance of semantic information such as in Formex v. 3
The use of SGML and XML at the Publications Office
13
Formex (8)
• Footnotes–Distinction between notes in text and
tables for readability and production simplicity
– Insertion of text notes into the surrounding text
– ID/IDREF to signal identical footnotes
–Numbering is an object of presentation
–Table notes assembled at the top of the table
The use of SGML and XML at the Publications Office
14
Formex (9)
• Quotations
–Structured quotations vs. ‘#PCDATA’quotations
–Elements signaling start and end of a quotation (quotation marks)
–Element with function of a container for structured quotations.
The use of SGML and XML at the Publications Office
15
Formex (10)
Example:Article 2
In article 1(2) of regulation (EC) 1234/94 the word ‘car’ is replaced by ‘bus’.
Article 6 of the same regulation is replaced by the following text:
‘Article 6
This is the new text of article 6.’
The use of SGML and XML at the Publications Office
16
Formex (11)
Example:
<ARTICLE IDENTIFIER=“002”><TI.ARTICLE>Article 2</TI.ARTICLE><ALINEA>In article 1(2) of regulation (EC) 1234/94 the <QUOT.START ID=“QS0001” REF.END=“QE0001” CODE=“2018”/>car <QUOT.END ID=“QE0001”REF.START=“QS0001” CODE=“2019”/> is replaced by <QUOT.START ID=“QS0002”REF.END=“QE0002” CODE=“2019”/>bus<QUOT.END ID=“QE0002”REF.START=“QS0002” CODE=“2019”/>.</ALINEA><ALINEA>
<P>Article 6 of the same regulation is replaced by the following text:</P>
<QUOT.S><ARTICLE IDENTIFIER=“006”>
<TI.ARTICLE><QUOT.START ID=“QS0003”REF.END=“QE0003” CODE=“2018”/>Article 6</TI.ARTICLE>
<ALINEA>This is the new text of article 6.<QUOT.END ID=“QE0003” REF.START=“QS0003” CODE=“2019”/></ALINEA>
</ARTICLE></QUOT.S>
</ALINEA></ARTICLE>
The use of SGML and XML at the Publications Office
17
Formex (12)
• Splitting large documents
–Fragmentation by definition of inclusions for the main document
–Secondary instances referencing the inclusions by means of XML entity mechanism
– Inclusions may not necessarily be valid XML instances
The use of SGML and XML at the Publications Office
18
Formex (13)
main.xml
<?xml version=“1.0”?><doc>
<ti>title</ti><chap no=“1”>
<incl ref=“frag-1.frg”/></chap>
</doc>
frag-1.frg
<text>…</text><text>…</text>
container.xml
<?xml version=“1.0”?><!DOCTYPE frag [<!ENTITY cnt SYSTEM “frag-1.frg”>]><frag>&cnt;</frag>
The use of SGML and XML at the Publications Office
19
Formex (14)
• Character set
–OJ publications in 20 (21) languages
–Different alphabets
– International character set definition Unicode (UTF-8)
–Definition of allowed character ranges
–Special font ‘EU-Albertina’
The use of SGML and XML at the Publications Office
20
Formex (15)
• Meta-data
–OJ publications are composed of different levels:
• Publication
• Document
• ‘Contents’
–Meta-data separated according to these levels
The use of SGML and XML at the Publications Office
21
Formex (16)
Publication
Meta-data concerning the publication
Structure of thepublication withreferences to documents
Document
Meta-data for document
References to components
Document
Meta-data for document
References to components
Contentsmain part001
ContentsAnnex 1001.001
ContentsAnnex 2001.002
Contentsmain part002
ProCat
The use of SGML and XML at the Publications Office
22
Formex (17)
• Meta-data (continued)– Extraction of meta-data by means of
automatic processes (pre-notices)– Extension of pre-notices by juridical analysis
– Availability of notices in ProCat for other productions (Celex) and projects
The use of SGML and XML at the Publications Office
23
Formex (18)
• Final remark on Formex specifications
–Only few complete production chains from the author to the printer
–Concentration on publication of Official
Journal
The use of SGML and XML at the Publications Office
24
Formex (19)
• Validation of Formex deliveries
– In-depth validation necessary
–Automatic procedures
–Manual procedures
The use of SGML and XML at the Publications Office
25
Formex (20)
• Validation of Formex deliveries (continued)–Automatic procedures
• Control of filename conventions
• Parsing of various components
• Control of completeness
• Execution of additional validation rules
• Comparison of contents between Formex and PDF
⇒ Report (XML instance)
The use of SGML and XML at the Publications Office
26
Formex (21)
• Validation of Formex deliveries (continued)
–Manual procedures
• Verification of the report generated by the automatic validation procedure
• Control of the use of Formex specifications in all language versions
⇒ Report (XML instance) = basis forarchiving or rejection
The use of SGML and XML at the Publications Office
27
Formex (22)
• Conversion of Formex v. 3 into Formex v. 4– Conversion of character set (ISO 2020 – UTF8)
– Transformation of SGML instances into well-formed XML instances
– Extraction of tables and conversion into an intermediate model
– Generation of meta-data levels
– Conversion of old elements and generation of new elements
– Validation of the results
The use of SGML and XML at the Publications Office
28
Formex (23)
• Specifications:
http://formex.publications.eu.int/
The use of SGML and XML at the Publications Office
29
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
The use of SGML and XML at the Publications Office
30
Other areas of XML usage (1)
• Index of OJ publications
–Biannual issues
–Monthly issues
–Extraction from Celex/ProCat
–Transformation into PDF by means of XSLT and XSL FO (biannual version only)
The use of SGML and XML at the Publications Office
31
Other areas of XML usage (2)
• Consolidation of legal documents
–Mainly based on Formex
–Additional administrative data in XML
–Relations between historical levels
• Description of the composition of a given historical level
• Concordance of information on numbering schemes (articles, …) for each level
The use of SGML and XML at the Publications Office
32
Other areas of XML usage (3)
• Conversion to RTF
–Compatibility with other EU services
– Input in SGML or XML
–Results with LegisWrite templates
The use of SGML and XML at the Publications Office
33
Other areas of XML usage (4)
SGML instance
(Formex v. 3)
Characterconversion
Transformationinto well-
formed XML
Transformation into internalXML format
Transformationinto RTF
(LegisWrite)
Output inRTF (Legis-
Write)
XMLinstance
(Formex v. 4)
The use of SGML and XML at the Publications Office
34
Other areas of XML usage (5)
• Production of the EU budget
–Creation and maintenance of a common central repository (XML)
–Markup of modified elements during the decision process in working language
–Translation only of parts modified
–Update of repository after publication
The use of SGML and XML at the Publications Office
35
Other areas of XML usage (6)
Budget services
Translationservice
Publications Office
Budget XMLrepository
Printer
Formexarchive
pre-printingpost-printing
The use of SGML and XML at the Publications Office
36
Other areas of XML usage (7)
• ‘Secondary legislation’
–Publication of legislation in force in ‘new’languages
–XML production on basis of Formex archive
–Transformation of translated input
–Transformation of SGML into XML of Formex instance
–Merging of XML instances
The use of SGML and XML at the Publications Office
37
Other areas of XML usage (8)
Worddocument Formex
archive
Conversioninto XML
Extractionof text
Conversioninto XML
Extractionof skeleton
Mergingskeleton &
text
Simplifystructure
Publication
ProCat
Celex
The use of SGML and XML at the Publications Office
38
Other areas of XML usage (9)
• European document repository
–TIFF of publications
–PDF of publications
–Formex instances of OJ publications
–Exchange of information by XML messages
The use of SGML and XML at the Publications Office
39
Other areas of XML usage (10)
• Publication of calls for tender (OJ-S)
– Input in different (electronic) formats
–Harmonization in XML
–Updating database TED
–Production of CD-ROM version
The use of SGML and XML at the Publications Office
40
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
The use of SGML and XML at the Publications Office
41
Conclusion
• Difficult start with SGML
• Successful use of XML as well as of other standards such as XSLT/XPath, XSL FO
• Powerful possibilities of re-use of XML instances