9
File Formats in the Context of Archiving Dr. Thomas Fischer Dr. Thomas Fischer EMANI – Project Meeting EMANI – Project Meeting February 14 th - 16 th , 2002 Springer-Verlag Heidelberg Göttingen State and University Library (SUB) [email protected]

File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

Embed Size (px)

Citation preview

Page 1: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

File Formats in the Context of Archiving

Dr. Thomas FischerDr. Thomas Fischer

EMANI – Project MeetingEMANI – Project MeetingFebruary 14th - 16th, 2002

Springer-Verlag Heidelberg

Göttingen State and University Library (SUB)

[email protected]

Page 2: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen

Archives StoreDifferent Kind of Data ...

archives have to deals with different kind of data raw binary data texts images multimedia ...

Page 3: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen

... in Different File Formats

binary data: stream of bytes text: ASCII, other encodings of simple text,

formatted text images: vector or pixel oriented graphics multimedia: a plethora of different file types for

different purposes

Page 4: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen

Focus on ...

mathematics consists mostly of text, formulas, diagrams, and some images

further contents might be (compiled) programs, interactive simulations etc.

for learned journals the contents is overwhelmingly text with few images

Page 5: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen

Text!

text files usually contains to kinds of information:

textual data providing the contents (words) of the file

structural data containing the information for the presentation of the text

Page 6: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen

Two Kinds of Problems

loss of structure leads to loss of formatting loss of text leads to loss of meaning

if problems occur with the media or the program that reads the file, some information may be lost

the latter is usually considered more serious

Page 7: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen

Two Types of Text File Formats

structured format (e.g. Microsoft Word, PDF):file consits of text (more or less uninterrupted) and tables (usually at the beginning or the end of the file) that provide additional information, formatting etc.

mark-up format (e.g. HTML, XML, RTF, TeX):file consists of stream of text with formatting information interspersed

Page 8: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen

For Archiving Purposes

the file format chosen should be readable without the use of specialized programs

the file format should be robust against damage of media and loss of data

Page 9: File Formats in the Context of Archiving Dr. Thomas Fischer EMANI – Project Meeting February 14 th - 16 th, 2002 Springer-Verlag Heidelberg Göttingen State

February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen

Types of Text Format

mark-up languages like XML or TeX store text and formatting together. Text can be reconstructed using any text editor, format probably regained.

structured formats like MS Word or PDF need the dedicated program for proper representation and may or may not allow the extraction of the text contained, depending on the particular situation, usually not visible to the user.

Consequence: Mark-up formats are better suited for archiving