25
ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation to XML conversion: Uwe Müller Humboldt University, Berlin Electronic Publishing Group [email protected]

ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Embed Size (px)

Citation preview

Page 1: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

ETD2004, June 3-5 2004 University of Kentucky, Lexington

Structured ETDs at the Document and Publication Server of

Humboldt University

From DTD generation to XML conversion:

Uwe MüllerHumboldt University, BerlinElectronic Publishing [email protected]

Page 2: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Background

• Humboldt University: 800 – 1.000 dissertations / year• Germany: duty to publish dissertations

– traditional methods: • publishing house• microfiche • 40 … 200 printed copies (depending on faculty regulations)

• Humboldt U.: not mandatory to submit an ETD• ~ ¼ dissertations published electronically• XML as central strategy

Page 3: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Why XML?

• Standardized format• Long term preservation• easily convertible to

– presentation formats (HTML, PDF)

– other XML structures• qualified full text retrieval • contains structural and

contextual information – in a machine readable format

HTMLHTML

digital signaturedigital signature

PDFPDF

digital signaturedigital signature

Office documentOffice document

digital signaturedigital signature

XMLXML

Page 4: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

XML: Restrictions to deal with

• XML source does not contain layout information• rather linear structure• XML is not used as Authoring System

– authors use their 'own' systems• Microsoft Word• LaTeX• Open Office / Star Office• Framemaker• Word Perfect

Page 5: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

How to overjump the gap?

• get the authors where they are …• instructions and guidelines for authors

– usage of style files (e.g., dissertation-hu.dot) – manuals, support hotline, regular courses

• different conversion processes– SGML author (plug in for MS Word <= 97) – Open Office / Star Office

• exploit genuine XML format

– MS Office 2003XML according to DiML DTD– common pitfalls: tables, pictures

Page 6: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Page 7: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Conversion Process Using Open Office

Open OfficeOpen Office

example.docexample.doc

example.sxw(zip file)

.

.

.

.

.

.

.

.

example.sxw(zip file)

.

.

.

.

.

.

.

.

content.xmlcontent.xml

example_stl.xmlexample_stl.xml

example.xmlexample.xml

front.xmlfront.xmlchapter1.xmlchapter1.xml

chapter2.xmlchapter2.xmlchapter3.xmlchapter3.xml

example.htmlexample.html

*.gif*.gif*.jpg*.jpg

front.htmlfront.htmlchapter1.htmlchapter1.html

chapter2.htmlchapter2.htmlchapter3.htmlchapter3.html

Page 8: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Page 9: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Principal Structure of a DiML document<etd>

<front>..title...author...abstract...</front> <body> <chapter> <section> ... </body> <back>..bibliography...appendix...vita...</back>

</etd>

Page 10: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

From flat structure to Hierarchy• only two types of styles in Word

– paragraph styles– character styles

• e.g., in case of th first occurring Heading 1 paragraph style the converter has to know– Heading 1 is the beginning of a chapter– Heading 1 implies a head element– the element chapter can only occur in body

</front><body><chapter>

<head id="anyID">Introduction</head>

Page 11: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Page 12: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Page 13: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Page 14: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Page 15: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Page 16: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Page 17: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

One Core – Multiple Views

• HTML generation (static or dynamic)– performance problems with XSLT and huge

documents– solution: division of XML sources into components

(easier and fast to process)• PDF + Print on Demand (http://www.proprint-service.de)• Current problems

– changing Office systems and versions• ongoing implementations and adaptations necessary• but: might be restricted to XSL coding

Page 18: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Towards a universal DTD? • DiML – originally taken from an SGML DTD at Virginia Tech ("ETD"),

http://edoc.hu-berlin.de/diml– already many elements (> 100)– combines elements of different description levels– extended and adapted to local needs

• special requirements from several departments (e.g., literature / dramatics, humanities, geography, …)

• necessity to include external DTDs (e.g., CALS-Table, MathML, MusicML, …)

• publication types other than theses and dissertations– conference proceedings, electronic journals, other series, …

• first approach: extend DTD aiming at a universal 'mega' DTD– problems: complexity, difficult maintenance

• other possibility: create a completely new DTD for each purpose– loss of interoperability

Page 19: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Modular DTD Approach

• idea: individually adapted DTDs

1. split up DTD into modules, such as– text, structure, citation, dramatics

2. handle external DTDs as modules as well, e.g.,– MathML, MusicML, CALS-Table

3. recombine a DTD out of user selected modules• result

a. a DTD with only the needed elements and modules

b. individual reference and sample documents

Page 20: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Modular DTD Approach: Benefits

• modules are easily maintainable– distributed development– version numbers for each module

• reusability – define (several) styles for each module – reference information for each module

• support different languages

• get a DTD that exactly fits your needs

Page 21: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

DTDSys: Principal Architecture

• modules: small packages of elements belonging to each other

• stored in separate files in the DTDBase• include metadata, e.g., descriptive information, version

numbers, and dependences to other modules• DTDSys generates DTD and reference files using

– XSL / XSLT– Java– Web Interfaces

Page 22: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Modules and Dependencestext br, em, strong, sup, sub, u, tt, pre

common p, head, caption, url, name, foreign…

structure chapter, section, subsection…

citation quotations and references

documents page numbers, footnotes, endnotes, …

diml front, body, back, abstract…

Page 23: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

reference.....

reference.....

DTD Generation Process

DTDBaseDTDBase

dependences.htmldependences.html

selection.xmlselection.xml full-dtd.xmlfull-dtd.xml

xdiml.dtdxdiml.dtd

dtd-reference.xmldtd-reference.xml

p.phpp.php

chapter.phpchapter.php

module-text.xmlmodule-text.xmlmodule-text.xmlmodule-text.xmlmodule-text.xmlmodule-text.xml

XSLXSL

XSLXSL

Java+XSLJava+XSL

XSLXSL

XSLXSL

including

• element info

• description

• dependences

Page 24: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Outlook• SCOPE = Service Core for Open Publishing Environments

– development of Publication Components (authoring tools, conversion mechanisms, layout and style definitions)

– management system to maintain versions and dependences – publication system– workflow component

• Long Term Preservation activities– Implementation of OAIS reference model– Sun Center of Excellence

Page 25: ETD2004, June 3-5 2004 University of Kentucky, Lexington Structured ETDs at the Document and Publication Server of Humboldt University From DTD generation

Uwe Müller, Electronic Publishing Group, CMS / UB Humboldt University, Berlin

ETD2004, June 3-5 2004 University of Kentucky, Lexington

From DTD generation to XML conversion:

Structured ETDs at Humboldt's EDoc Server

Thanks

to Sabine Henneberger, Jakob Voß, Matthias Schulz

Thank you!

Questions?

[email protected]

http://edoc.hu-berlin.de/