18
22.-23.11.2002 EMANI Göttingen 1 Data Formats in Mathematics Data Formats in Mathematics EMANI and DML EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken SUB Göttingen [email protected] goettingen.de

22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

Embed Size (px)

Citation preview

Page 1: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 EMANI Göttingen 1

Data Formats in MathematicsData Formats in MathematicsEMANI and DMLEMANI and DML

EMANI MeetingGöttingen, 22.-23.11.2002

Dr. Thomas Fischer

Metadaten und Datenbanken

SUB Göttingen

[email protected]

Page 2: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 2EMANI Göttingen: Thomas Fischer

OverviewOverview

Basis Situation Purposes of formats Formats for purposes

• Text formats for archiving

• Text formats for retrieval

• Image formats for archiving

• Presentation formats: text and images

Co-operation and compatibility• Import of data

• Coordination

Page 3: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 3EMANI Göttingen: Thomas Fischer

Basis SituationBasis Situation

Archiving for presentation: Preserve original appearance of documents for clients of

electronic journals over (long) time. Archive documents in a fashion independent of software

and hardware to minimize problems of mingration.

Is there a possibility to unify the procedures of electronic publishing and archiving/presentation of this material.

Page 4: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 4EMANI Göttingen: Thomas Fischer

Purposes of Formats IPurposes of Formats I

In the EMANI Project, we suggest to collect a vast corpus of data from different sources:• Retrodigitized material (different formats of images)

• Digital material (different formats of text and images)

• Multimedia-type material (interactive, video …)

• Programs

This material comes in different formats, because there is no single format that would serve the needs of producing this material.

To integrate this material into one collection, standardization will be extremely valuable.

Page 5: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 5EMANI Göttingen: Thomas Fischer

Purposes of Formats IIPurposes of Formats II

A closer look: Retrodigitization: This process produces images. Requirements

come from usage and administration. The participants can usually decide on the format.

Digital material: The majority of mathematics text is written in TEX these days, but articles may include images (e.g. EPS). There is material which is not produced in TEX (e.g. editorials) or where the sources are no longer available. Some text-processing formats (Word, WordPerfect) and presentation formats (PS, PDF) are common.

Programs should be archived as source code (essentially ASCII). Compiled programs cause migration problems.

Multimedia: Not considered for now (e.g. videos from fractal images).

Page 6: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 6EMANI Göttingen: Thomas Fischer

Formats for PurposesFormats for Purposes

In the context of archiving mathematics, different formats are needed for different purposes:

Archiving: A stable representation of content and form of the article. The format should be independent of proprietary software and insensitive to minor errors.

Retrieval: Metadata and probably full textual representation of the contents. Formulas are still an open question (much more complex than in Chemistry).

Presentation: The presentation of the material should be as true as possible to the original “look and feel”. If should be rendered by standard agents and not require special programs (beyond simple plug-ins) on the client’s side. Special measures may be necessary for the visually impaired.

Page 7: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 7EMANI Göttingen: Thomas Fischer

Text formats for archivingText formats for archiving

The best-suited formats for archiving are mark-up formats like TEX or MathML. There is some progress in the conversion from MathML to TEX and vice versa.

For documents in TEX, a suitable environment is necessary for correct rendering. This needs documentation and archiving of the respective additional files (stylesheets, fonts etc.)

Included images will come in different formats, usually EPS, but PDF and TEX-defined graphics are possible. It is not clear to me how to handle these. The format of the images is not necessarily obvious.

Page 8: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 8EMANI Göttingen: Thomas Fischer

Retrieval formatsRetrieval formats

For metadata, a scheme (application profile) is needed (another work package)

For retrieval using full text, an additional textual layer may to be added to the text file (unless some TEX based full text search mechanism becomes available). This textual layer should be stored in a normalized format, the one provided by the Text Encoding Initiative (TEI) might be useful (mark-up of structural information).

The alternative is an integrated search engine which provide access to the data by storing relevant information in it own database removed from the original data (like Google does for internet files)

Page 9: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 9EMANI Göttingen: Thomas Fischer

Image Formats for ArchivingImage Formats for Archiving

High quality images are necessary for further pcosessing of the data: conversion to other formats, OCR, printing.

For management of the files, the possible inclusion of metadata is extremely desirable.

If the images are to be archived in compressed form, the compression algorithm should be lossless and free of copyrights.

This points to 600 dpi TIFF as standard format, compressed using CCITT G4 compression for bitonal images.

Page 10: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 10EMANI Göttingen: Thomas Fischer

Presentation formats IPresentation formats I

TEX is not well suited for the presentation of mathematical articles on the net:• Requirements of additional files like stylesheets

• Special fonts necessary

IBM techexplorer Hypermedia Browser shows possibilities and limitations

TEX files have to be processed on the server side and delivered in a unified format. Possible options are DVI, PS, PDF and DjVu. Since DVI-viewer usually only exist in a TEX environment, and almost the same holds for GhostView for reading PS on-screen, PDF or DjVu have to be considered.

Page 11: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 11EMANI Göttingen: Thomas Fischer

Presentation formats IIPresentation formats II

Image files from scanning are usually very large, so would create a heavy load when sent via the net.

Image files have to be processed on the server side and delivered in a unified format. This may be different depending on desired resolution on the client’s side, e.g. for viewing onscreen or (high) quality printing.

Possible options are JPEG, PDF and DjVu.

Page 12: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 12EMANI Göttingen: Thomas Fischer

Cooperation: Exchange of Cooperation: Exchange of MetadataMetadata

Metadata come in different format, Springer uses the MAJOUR-header DTD (European Workgroup on SGML) in different version (note: this is fairly complicated, rignal documentation has 151 pages).

This header is presented in SGML mark-up and/or RDF syntax.

Both can be technically imported into an envisaged EMANI system.

The compatibility of the metadata schemes has to be studied (richness and availability of data compared with the emerging EMANI scheme).

Page 13: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 13EMANI Göttingen: Thomas Fischer

MAJOUR-header: SGMLMAJOUR-header: SGML

<HEADER><ISSUE><PINFO><PNM>Springer-Verlag<LOC>Berlin<LOC>Heidelberg<JINFO><JID>211<JTL>Numerische Mathematik<JABT>Numer. Math.<ISSN>0029-599X<CDN>NUMMA7<PUBINFO><VID>85<IID>3

<CD YEAR="2000" MONTH="05"><ARTCON><GENHDR LANGUAGE="EN"><ARTINFO><AID>0000134<ARTTY ARTTY="RP"><CATEG>Original article<FIGCT COUNT="000"><TABCT COUNT="000"><REFCT COUNT="000"><PPCT COUNT="24"><PPF>343<PPL>366<CRN>Springer-Verlag Berlin

Heidelberg 2000

Page 14: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 14EMANI Göttingen: Thomas Fischer

MAJOUR-header: RDFMAJOUR-header: RDF

<rdf:Description> <M:ArticleID rdf:parseType="Literal">0100227</M:ArticleID> <dc:title> <rdf:Description> <rdfs:label rdf:parseType="Literal">On the dual of complex Ol'shanski\u&#305; semigroups</rdfs:label> <dc:language> <rdf:Description> <rdfs:isDefinedBy rdf:resource="http://www.w3.org/TR/Language"/> <rdf:value rdf:parseType="Literal">EN</rdf:value> <rdfs:label rdf:parseType="Literal">English</rdfs:label> </rdf:Description> </dc:language> </rdf:Description> </dc:title> <dc:description>

Page 15: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 15EMANI Göttingen: Thomas Fischer

Cooperation: Import of ArticlesCooperation: Import of Articles

Import of articles in PDF format:• Quality check: appropriate resolution, scalabilty, printable?

• Need standards for handling PDF

Import of articles in TEX format:• Check necessary additional files

• Create appropriate container for all files referring to one article

• Create structure to manage general additional files.

Page 16: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 16EMANI Göttingen: Thomas Fischer

TTEEX files from SpringerX files from Springer

TeX needs friendly environment:Document Class: svjour 2001/10/17LaTeX document class for Springer journals - version 1.9Class Springer-SVJour Warning: Specified option or subpackage "leqno" not found- on input line 92.! Class Springer-SVJour Error: No valid journal specified in option list.See the Springer-SVJour class documentation for explanation.Type H <return> for immediate help....l.93 ...ournal specified in option list}{}? ) )No pages of output.

produced output after installation ofsvjour.clssvnummat.cloTOTAL00.NUMBut: references missing!

Page 17: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 17EMANI Göttingen: Thomas Fischer

CooperationCooperation

Internal:• NSF/DFG: Access to Mathematical Literature over Time

• Jahrbuch Projekt

• Mathematical Monographs

• ...

External (?)• DML

• NUMDAM

• Elsevier?

• ...

Page 18: 22.-23.11.2002EMANI Göttingen1 Data Formats in Mathematics EMANI and DML EMANI Meeting Göttingen, 22.-23.11.2002 Dr. Thomas Fischer Metadaten und Datenbanken

22.-23.11.2002 EMANI Göttingen 18

Thank you for your attention!