Upload
lydia-rich
View
217
Download
0
Embed Size (px)
Citation preview
File Formats in the Context of Archiving
Dr. Thomas FischerDr. Thomas Fischer
EMANI – Project MeetingEMANI – Project MeetingFebruary 14th - 16th, 2002
Springer-Verlag Heidelberg
Göttingen State and University Library (SUB)
February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen
Archives StoreDifferent Kind of Data ...
archives have to deals with different kind of data raw binary data texts images multimedia ...
February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen
... in Different File Formats
binary data: stream of bytes text: ASCII, other encodings of simple text,
formatted text images: vector or pixel oriented graphics multimedia: a plethora of different file types for
different purposes
February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen
Focus on ...
mathematics consists mostly of text, formulas, diagrams, and some images
further contents might be (compiled) programs, interactive simulations etc.
for learned journals the contents is overwhelmingly text with few images
February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen
Text!
text files usually contains to kinds of information:
textual data providing the contents (words) of the file
structural data containing the information for the presentation of the text
February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen
Two Kinds of Problems
loss of structure leads to loss of formatting loss of text leads to loss of meaning
if problems occur with the media or the program that reads the file, some information may be lost
the latter is usually considered more serious
February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen
Two Types of Text File Formats
structured format (e.g. Microsoft Word, PDF):file consits of text (more or less uninterrupted) and tables (usually at the beginning or the end of the file) that provide additional information, formatting etc.
mark-up format (e.g. HTML, XML, RTF, TeX):file consists of stream of text with formatting information interspersed
February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen
For Archiving Purposes
the file format chosen should be readable without the use of specialized programs
the file format should be robust against damage of media and loss of data
February 14th - 16th, 2002 EMANI Project Meeting EMANI Project Meeting SUB Göttingen
Types of Text Format
mark-up languages like XML or TeX store text and formatting together. Text can be reconstructed using any text editor, format probably regained.
structured formats like MS Word or PDF need the dedicated program for proper representation and may or may not allow the extraction of the text contained, depending on the particular situation, usually not visible to the user.
Consequence: Mark-up formats are better suited for archiving