Upload
jerometereh
View
219
Download
0
Embed Size (px)
Citation preview
8/8/2019 Automatic Format Identification
1/33
Digital Preservation
Technical Paper:
1
Automatic Format Identification Using
PRONOM and DROID
8/8/2019 Automatic Format Identification
2/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 2 of 33
Document Control
Author: Adrian Brown, Head of Digital Preservation
Document Reference: DPTP-01
Issue: 2
Issue Date: 7 March 2006
THE NATIONAL ARCHIVES 2006
8/8/2019 Automatic Format Identification
3/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 3 of 33
Contents
1 INTRODUCTION .....................................................................................................................4
2 THE PRONOM FORMAT SIGNATURE MODEL ..............................................................6
2.1 External signatures...........................................................................................................72.2 Internal signatures ............................................................................................................7
3 THE DROID FORMAT IDENTIFICATION PROCESS ....................................................13
3.1 The format identification algorithm ............................................................................133.2 The DROID signature file ...............................................................................................163.3 The DROID file collection file .......................................................................................18
4 BIBLIOGRAPHY ...................................................................................................................21
APPENDICES ................................................................................................................................22
1 XML SCHEMAS ....................................................................................................................22
1.1 Signature file schema.....................................................................................................221.2 File collection file schema ............................................................................................26
2 PRE-PROCESSING SIGNATURES ..................................................................................28
2.1 Background for pattern matching algorithm ...........................................................282.2 Pre-processing of the signature file...........................................................................282.3 Pre-processing glossary ...............................................................................................31
3 THE PATTERN MATCHING ALGORITHM ......................................................................31
8/8/2019 Automatic Format Identification
4/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 4 of 33
1 Introduction
This document is one in a series of technical papers produced by the Digital Preservation
Department of The National Archives (TNA), covering detailed technical issues related to the
preservation and management of electronic records. This technical paper describes themethodology developed and implemented by TNA to automatically identify the formats in which
digital objects are encoded.
For the purposes of this document, a format is defined as follows:
The internal structure and encoding of a digital object, which allows it to be processed, or to
be rendered in human-accessible form. A digital object may be a file, or a bitstream
embedded within a file.
A distinction must be drawn between format identification and format validation. Identification
simply ascertains the format in which an object purports to be encoded, whereas validation ensuresthat the object is fully conformant to the format specification. As such, validation actually provides
the most secure form of identification. However, validation is a technically complex process and is
most efficient if it is informed by prior identification of the format.
Identifying and validating the format in which a digital object is encoded is a fundamental
prerequisite to accessing and managing the object. This knowledge is required both by human
users and IT systems without it, a digital object cannot be used or preserved. For the majority of
the time, this requirement is not obvious most people use a limited set of software tools, within a
consistent IT environment which is able to correctly assign the correct software to a given object.
Even so, at some point most users encounter the problem of attempting to access a file in a format
which is unknown to them, and unrecognised by either the operating system or any availablesoftware. It is also not uncommon to encounter a file which appears to be in a known format, but
cannot be opened by the relevant software.
However, it is managed information environments, such as Electronic Records Management
Systems (ERMS), or digital preservation facilities, which impose the most rigorous requirements
for identifying and validating formats. In these scenarios, it is essential to understand the precise
format and version in which every stored object is encoded. It is not enough to know that an object
is a Microsoft Word document, for example; the exact version must be identified. It is equally
essential to ensure that stored objects are valid, and do conform to the appropriate specification.
Invalid objects can arise through the use of poor-quality software tools, or as a result of accidentalor deliberate corruption.
The presumption is made that any format identification or validation method must be automated.
Although manual identification is possible, given sufficiently detailed knowledge of how a given
format specification and structure is expressed in hexadecimal values, this is clearly not
recommended as a practical approach under normal circumstances. This technical report describes
an identification method developed by TNA, which uses automated analysis of the binary structure
of a digital object, and comparison with predefined internal and external signatures for specific
formats. This approach has been implemented by TNA as a software application called DROID
(Digital Record Object Identification), which uses signature information stored in the PRONOM
technical registry. PRONOM and DROID are both freely available on the web athttp://www.nationalarchives.gov.uk/pronom. This method is currently limited to identification;
http://www.nationalarchives.gov.uk/pronomhttp://www.nationalarchives.gov.uk/pronom8/8/2019 Automatic Format Identification
5/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 5 of 33
full object characterisation, including validation and property extraction, is intended as a future
enhancement.
8/8/2019 Automatic Format Identification
6/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
2 The PRONOM format signature model
The signature information required to perform automatic format identification is stored in the
PRONOM technical registry. This section describes how signatures are modelled within
PRONOM.
A format signature is any collection of characteristics which may be used to indicate the format of
a digital object. This signature may be external or internal to the actual object bitstream. The
PRONOM technical registry contains detailed information about individual formats, including
their associated signatures. A simplified version of the data model used by PRONOM to describe
the relationships between formats and their signatures is illustrated in the following UML class
diagram:
Each format may have multiple associated internal and external signatures, with each internal
signature comprising one or more byte sequences. Multiple relationships may also be defined
between formats. Formats may also be assigned PRONOM Unique Identifiers (PUIDs)1, which
can provide an unambiguous and persistent binding between the format identification of a given
object, and the description of that format in PRONOM. Each type of signature is described in more
detail in the following sections.
1
Page 6 of 33
: See Brown (2005)
8/8/2019 Automatic Format Identification
7/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 7 of 33
2.1 External signatures
External signatures encompass all format indicators which are external to the object bitstream,
such as Macintosh data forks and Windows file extensions.
In some operating systems, such as Microsoft DOS and Windows, an external signature is
provided by the file extension. This indicates the format of the object, e.g. Mydoc.doc for a
Microsoft Word document, and Mypic.tif for a TIFF image. However, the primary function of the
extension is not to identify the format, but rather to indicate to the operating system the default
software package which should be used to open the file. As a result, file extensions suffer from
three disadvantages as a method of identification:
Extensions are not necessarily unique to a single format. For example, the .wks extension is
used for both Lotus 1-2-3 worksheets and Microsoft Works documents.
They do not provide sufficient granularity to adequately identify format versions. Forexample, the .doc extension does not distinguish between a Word 6 document and a Word
2003 document, although these are significantly different formats.
Extensions can be defined or altered by users. For example, a user of WordPerfect 5.1
might choose to distinguish letters and memos by using .let and .mem extensions
respectively, overriding the default extension assigned by the system.
External signatures are best suited to providing a general indication of the file format, and should
not be relied upon for definitive identification. PRONOM records known external signatures
which are associated with a given format. However, in DROID, these are accorded much less
weight than internal signatures in the identification process. External signatures are described in
PRONOM as type/value pairs; currently file extensions are the only permitted type.
2.2 Internal signatures
Internal signatures encompass all format indicators which are contained within the object
bitstream. By definition, a file format specification imposes a specific structure upon the content of
the bitstream, which is consistent between all digital objects in that format. The characteristics of
this structure may therefore be used as a signature for identifying the format. Some formats, such
as PNG, include a signature specifically designed to allow identification; in other cases, anincidental signature can be derived from elements of the format structure. Longer signature
sequences are clearly preferable to shorter ones, since they reduce the possibility of false
identification through coincidental sequences appearing in a bitstream. Even a deliberate signature,
such as PNG, may be enhanced using additional structural elements, to provide a more secure
identification.
In PRONOM, an internal signature is composed of one or more byte sequences, each comprising a
continuous sequence of hexadecimal byte values and, optionally, regular expressions. A signature
byte sequence is modelled by describing its starting position within a bitstream and its value.
The starting position can be one of two basic types:
8/8/2019 Automatic Format Identification
8/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 8 of 33
Absolute: the byte sequence starts at a fixed position within the bitstream. This position is
described as an offset from either the beginning or the end of the bitstream. The byte
sequence can therefore be located by moving to the specified offset, counting from either
the Beginning of File (BOF) or End of File (EOF) position. If counting from the EOF
position, the offset is to the final byte in the sequence.
Variable: the byte sequence can start at any offset within the bitstream. The byte sequence
can be located by examining the entire bitstream.
The value of the byte sequence is defined as a sequence of hexadecimal values, optionally
incorporating any of the following regular expressions:
??: wildcard matching any pair of hexadecimal values (i.e. a single byte).
e.g.: 0x0A FF ?? FE would match 0x0A FF 6C FE or 0x0A FF 11 FE.
*: wildcard matching any number of bytes (0 or more).
e.g.: 0x0A FF * FE would match 0x0A FF 6C FE or 0x0A FF 6C 11 FE.
{n}: wildcard matching n bytes, where n is an integer.
e.g.: 0x1C 20 {2} 4E 12 would match 0x1C 20 FF 15 4E 12.
{m-n}: wildcard matching between m-n bytes inclusive, where m and n are integers or *.
e.g.: 0x03 {1-2} 4D would match 0x03 3C 4D or 0x03 3C 88 4D.
e.g.: 0x03 {2-*} 4D would match 0x03 3C 88 4D or 0x03 3C 88 3F 4D.
(a|b): wildcard matching one from a list of values (e.g. a or b), where each value is a
hexadecimal byte sequence of arbitrary length containing no wildcards.
e.g.: 0x0E (FF|FE) 17 would match 0x0E FF 17 or 0x0E FE 17.
[a:b]: wildcard matching any sequence of bytes which lies lexicographically between a
and b, inclusive (where both a and b are byte sequences of the same length, containing nowildcards, and where a is less than b). The endian-ness ofa and b are the same as the
endian-ness of the signature as a whole.
e.g. 0xFF [09:0B] FF would match 0xFF 09 FF, 0xFF 0A FF or 0xFF 0B FF.
[!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte
sequence containing no wildcards).
e.g. 0xFF [!09] FF would match 0xFF 0A FF, but not 0xFF 09 FF.
8/8/2019 Automatic Format Identification
9/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 9 of 33
[!a:b]: wildcard matching any sequence of bytes which does not lie lexicographically
between a and b, inclusive (where a and b are both byte sequences of the same length,
containing no wildcards, and where a is less than b).
e.g. 0xFF [!01:02] FF would match 0xFF 00 FF and 0xFF 03 FF, but not 0xFF 01FF or 0xFF 02 FF.
Note: In the examples above, spaces are included between byte values for reasons of clarity, but
are omitted in actual byte sequence values. The signature is processed left-to-right if the signature
is measured relative to BOF and right-to-left if measured relative to EOF. The endian-ness of the
signature is only relevant for sequences inside square brackets.
A byte sequence must contain a fixed subsequence of at least one byte between each occurrence of
*, or between the beginning or end of the sequence and an occurrence of *. Thus, sequences of
the following form are not permitted:
[BOF] (a|b)*
*(a|b) [EOF]
*(a|b)*
The syntax of a byte sequence may be expressed in formal BNF notation as follows:
::= |
::= |
::= |
::= ?? | { } | { -
} | ( |
) | [ : ] | [!
: ] | [!
]
::= * | { - * }
::= |
::=
::= |
::= |
::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
| 9
8/8/2019 Automatic Format Identification
10/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 10 of 33
::= A | B | C | D | E | F
The order in which byte sequences are recorded within a signature is arbitrary, and does not imply
any order or prioritisation within the bitstream. For example, if a signature contains two variable-
position sequences, these may appear in any order within a given bitstream. In cases where order is
significant (e.g. Sequences 1 and 2 are both variable-position, but Sequence 2 must appear after
Sequence 1 in the bitstream), this would be recorded by combining the relevant sequences into a
single sequence, separated by the appropriate regular expressions. For the purposes of
identification, a Boolean And relationship is assumed between every sequence in a signature, i.e.
a given object must match every sequence in the signature to match the signature.
2.2.1 Signature specificity
The level of granularity at which internal signatures can be defined may vary. In many cases, it is
possible to define internal signatures which are specific to a single version of a given format.However, in other cases a single signature may be common to more than one version of a format
(e.g. TIFF), or even to a class of formats (e.g. the OLE2 Compound Document Format). Signatures
which are unique to one format record in PRONOM are termed specific signatures, whereas
those which are common to multiple format records are termed generic signatures. No format
may have both a generic and a specific internal signature. This distinction is reflected in the
identification result: a match with a specific signature is recorded as a positive specific
identification, and a match with a generic signature is recorded as a positive generic
identification.
2.2.2 Format priority
It is possible for subtype and supertype relationships to exist between formats. For example, SVG
can be considered as a subtype of XML. Since an SVG document is also an XML document, it
would match the internal signatures for both formats. However, it is desirable that only the most
specific identification be reported. To enable this, the PRONOM data model allows a priority
relationship to be defined between two formats. Where such a relationship exists, and a given
object matches signatures for both formats, the lower priority identification is discarded.
8/8/2019 Automatic Format Identification
11/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 11 of 33
2.2.3 Example byte sequences
The possible scenarios can be illustrated as follows (note that all offsets are calculated starting
from 0):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1
Byte sequence 1 occurs at an absolute offset from the BOF. It would therefore be described as
follows:
System ID: 1
Position Type: Absolute from BOF
Offset: 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
2
Byte sequence 2 occurs at an absolute offset from the EOF. It would therefore be described as
follows:
System ID: 2
Position Type: Absolute from EOF
Offset: 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
3
Byte sequence 3 occurs at a variable offset. It would therefore be described as follows:
System ID: 3
Position Type: Variable
Offset: n/a
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
4a 4b
Byte sequence 4 includes two subsequences. Sequence 4a appears at a variable offset, and 4b
appears at a fixed relative offset to 4a. It would therefore be described as follows:
System ID: 4
Position Type: Variable
Offset: n/a
Value: 4a{2}4b
8/8/2019 Automatic Format Identification
12/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 12 of 33
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
5a 5b
Byte sequence 5 includes two subsequences. Sequence 5a appears at an absolute offset, and 5b
appears at a variable relative offset to 5a. It would therefore be described as follows:
System ID: 5
Position Type: Absolute from BOF
Offset: 4
Value: 5a{*}5b
8/8/2019 Automatic Format Identification
13/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
3 The DROID format identification process
The DROID software tool uses signature information stored in PRONOM to performautomatic format identification. This section describes the identification algorithm
employed by DROID, and the formats of the XML files used by DROID to describesignatures and record the results of the identification process.
3.1 The format identification algorithm
The format identification process employed by DROID is illustrated in the following activitydiagram:
Loop through listof internalsignatures
Does filematch
signature?No
Ye s
Ye s
Compare file with internalsignature
(step expanded separately)
No
Add signature tolist of passes
end of internalsignature loop?
Ye s
Add related fileformats to list ofpossible passes
(recordingwhether specific
or generic hit)
Loop throughpossible hits
Discard any possiblehits which this one has
priority over
end of hitsloop?
No
Loop throughpositive hits
end of fileformat loop?
Does fileformat have anextension thatmatches file?
Ye s
NoAdd a warningabout file
extension to thispositive hit
No
Read file into memory
Are there anypositive hits?
No
Ye s
Generate list of fileformats with the
same extension asthe file and with no
internal signature
This file isclassified as"unidentified"
Is the list ofhits empty? Ye s
No
Is file em pty? Yes
No
File is classifiedas "unidentified:zero-length file"
Reclassify remainingpossible passes as positive
hits (specific or generic)
Does file havean extension?
Ye s
No
All file formats inthis list are
"tentative hits"
If a file is too big to hold in memory in its entirety, the first step is skipped. Instead, the fileis read into memory a chunk at a time. To be specific, whenever the algorithm needs to
access byte nof the file, a check is made to see whether byte nis already in memory. If
Page 13 of 33
8/8/2019 Automatic Format Identification
14/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 14 of 33
not, the currently buffered portion of the file is discarded, and a new portion, containingbyte nis loaded in its place. The size of the buffer used is currently 1000000 bytes.
The algorithm uses the word file throughout but streams are identified using the samealgorithm. They are read into memory at the start of the process. If this results in runningout of memory then the memory contents are written to temporary file and the rest of thestream is appended on the end of that file. In that case, the algorithm then runs on thattemporary file.
A file is first compared with the set of internal signatures, and any matches are classifiedas positive identifications. These are further classified as specific or generic inaccordance with the specificity rules defined in Section 2.2.1. If any matches have priorityover other matches then the lower priority matches are discarded in accordance with thepriority rules defined in Section 2.2.2. If the file matches one or more internal signaturesthen its file extension is checked against any allowed external signatures; if the file
extension does not match any external signatures then a warning of a possible fileextension mismatch is generated, but the identification result is not otherwise affected. Ifthe file has no positive results against internal signatures then it is compared with the setof external signatures, and any matches are classified as tentative identifications. If thefile does not match any internal or external signatures then it is classified as a negativeidentification.
The possible results of an identification process can be summarised as follows:
Status Warning CommentsPositive (Specific) The file matches a specific
internal signaturePositive (Specific) Possible file extension mismatch The file matches a specific
internal signature but the fileextension does not match anyassociated external signatures
Positive (Generic) The file matches a genericinternal signature
Positive (Generic) Possible file extension mismatch The file matches a genericinternal signature but the fileextension does not match anyassociated external signatures
Tentative The file matches an externalsignature but no internalsignatures
Negative The file does not match anyinternal or external signatures
The pattern matching algorithm employed to undertake the internal signatureidentification is described in detail in Appendix C.
8/8/2019 Automatic Format Identification
15/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 15 of 33
3.1.1 Identification example
PRONOM contains information on five formats. The first two (Format A1 and FormatA2) each have their own specific internal signatures consisting of a single bytesequence. The third file format (Format B) has no internal signature. The last two(Format C1 and Format C2) both share the same generic internal signature. All fileformats can have an external signature in the form of the file extension txt. They all alsohave their own specific extension. These are respectively fa1, fa2, fb, fc1, fc2.
A set of files are processed by DROID using these signatures, with the following results:
aFile.fa1 passes the internal signature for Format A1, but fails the others. It isclassified as a Positive Specific hit on Format A1.
bFile.fa1 passes the internal signature for Format A2, but fails the others. It is
classified as a Positive Specific hit on Format A2. A warning is given, becausethe file extension is not associated with this format.
cFile.fa1 matches the internal signatures for both Format A1 and Format A2. Itis therefore classified as a possible Positive Specific hit on both formats.However, Format A2 is defined as having priority over Format A1, so the hit onFormat A1 is discarded. The final output for this file is therefore a PositiveSpecific hit on Format A2 with a warning about the file extension being wrong.
dFile.fa1 fails all internal signatures. Despite having a valid extension for FormatA1, it is not classified as a tentative hit as it has failed the internal signature. It is
therefore classified as a Negative identification.
eFile.txt fails all internal signatures. Despite having a valid extension for all fileformats, it is only classified as a Tentative hit on Format B as it has failed theinternal signatures for all other formats.
fFile.xxx passes the internal signature for Format A2. The file extension isunknown, so this is classified as a Positive Specific identification but with awarning about the file extension.
gFile.fb passes the internal signature for Format A2. Even though the fileextension corresponds to Format B, the comparison is never made because apositive identification has already been made on the internal signature for FormatA2. This file is classified as a Positive Specific identification on Format A2 butwith a warning about the file extension.
hFile.xxx fails all internal signatures. The file extension is unknown, so this isclassified as a Negative identification.
iFile.txt passes the third internal signature and fails the other two. It is classifiedas a Positive Generic hit on Format C1 and Format C2 (no indication is made
that the extension matches the other file formats).
8/8/2019 Automatic Format Identification
16/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 16 of 33
jFile.fc1 passes the third internal signature and fails the other two. It is classifiedas a Positive Generic hit on Format C1 and Format C2. The hit on FormatC2 also has a warning about the file extension.
kFile.txt passes the all three internal signatures. It is classified as a PositiveSpecific hit on Format A1 and Format A2. By the same argument as forcFile.fa1, the hit on Format A1 is discarded. This file is also a PositiveGeneric hit on Format C1 and Format C2.
3.2 The DROID signature file
The signature information contained in PRONOM must be exported in a form which canbe used by the DROID tool to perform automatic format identification. A DROID signaturefile contains all the information required by DROID to identify formats, and takes the form
of an XML document which complies with the schema described in Appendix A.1. Theinternal signatures contained in the signature file have been pre-processed in accordancewith the method described in Appendix B. The resultant signature file contains thefollowing information:
A list of file formats and for each one, the internal system identifier, format name,format version, format PUID, and a collection of identifiers for all internalsignatures associated with the file format, as well as the collection of externalsignatures. Any priority relationships over other formats are also recorded.
A list of internal signatures and for each one, a unique identifier and a collection of
byte sequences.
Each byte sequence has a position type (absolute from BOF, absolute from EOF,or variable) and possibly a maximum offset. Each sequence is made up of a seriesof subsequences, each represented by its longest unambiguous byte sequence, itsminimum offset, its shift distance function and a series of left and right sequencefragments (see Appendix B for further details).
Each sequence fragment has a position number, an unambiguous byte sequenceand minimum and maximum offsets (see Appendix B for further details).
3.2.1 Example signature file
The following example DROID Signature File describes the five example formatsdescribed in Section 3.1.1.
8/8/2019 Automatic Format Identification
17/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 17 of 33
B1B2B3B4
4
3
2
1
0
A1A2A3
C1C2C3
3
2
1
0
B1B2B3
3
2
1
0
A1A2A3
B4
B5
C1C2C3
3
2
1
0
01D1
F1F2F4F5
F1F3F4F5
........
........
8/8/2019 Automatic Format Identification
18/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 18 of 33
15
txt
fa1
16
txt
fa2
1
txt
fb
17
txt
fc1
17
txt
fc2
3.3 The DROID file collection file
The DROID file collection file contains the list of files selected for identification, and theresults of a DROID identification process. It is generated by the DROID software andtakes the form of an XML document. If saved prior to the identification process beingperformed, it contains the following information:
The DROID version number
The signature file version number
The list of the files selected for identification
If saved following identification, it contains the following information:
The DROID version number
8/8/2019 Automatic Format Identification
19/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 19 of 33
The signature file version number
The list of the files submitted to the identification process
For each file, a list of file format identifications
For each file format identification, the name, version and PUID of the file format,the identification status (e.g. positive or tentative), and any relevant warnings orerrors.
3.3.1 Example file collection file
The following example DROID File Collection File describes the results of the identification
process described in Section 3.1.1.
V1.1
3
2006-02-09T11:59:52
C:\testfiles\aFile.fa1
Positive (Specific Format)
Format A1
V1.1
8/8/2019 Automatic Format Identification
20/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 20 of 33
C:\testfiles\eFile.txt
Tentative
Format B
V0.0
8/8/2019 Automatic Format Identification
21/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 21 of 33
V2
8/8/2019 Automatic Format Identification
22/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 22 of 33
Appendices
1 XML schemas
1.1 Signature file schema
The XML schema definition for the signature file is as follows:
The PRONOM File Format Signature Specification Schema
Copyright The National Archives 2006. All rights reserved.
Define ID as key (ensuring they are also unique)
Define ID as key (ensuring they are also unique)
Ensure file formats refer to other formats that exist
Ensure file formats refer to signatures that exist
8/8/2019 Automatic Format Identification
23/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 23 of 33
8/8/2019 Automatic Format Identification
24/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 24 of 33
use="required"/>
8/8/2019 Automatic Format Identification
25/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 25 of 33
A byte expressed as hexadecimal
One or more bytes expressed as hexadecimal
One or more bytes expressed as hexadecimal, with patterns.
Eg. [A1:A4]
8/8/2019 Automatic Format Identification
26/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 26 of 33
1.2 File collection file schema
The XML schema definition for the file collection file is as follows:
The PRONOM File Collection Schema
Copyright The National Archives 2006. All rights reserved.
8/8/2019 Automatic Format Identification
27/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 27 of 33
8/8/2019 Automatic Format Identification
28/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 28 of 33
2 Pre-processing signatures
2.1 Background for pattern matching algorithm
The pattern matching algorithm is based on the Boyer-Moore-Horspool (BMH) algorithmfor string matching (Horspool, 1980). The BMH algorithm does not allow for wildcardcharacters such as defined above, and so a certain amount of pre-processing needs tobe performed on the internal signatures before the algorithm can be applied. Thesignatures are stored in unprocessed form in the PRONOM registry, and pre-processingis automatically applied as part of the generation of a new signature file.
2.2 Pre-processing of the signature file
The following pre-processing is required on each byte sequence to be used in the patternmatching. To simplify the exposition, we will use the following sequence, defined with anoffset of 10 bytes from the beginning of the file as a running example:
A1A2A3[A4:A5]??B1B2B3(B4|B5)*{5}01??C1C2C3{4-7}D1????F1(F2|F3)F4F5
2.2.1 Step 1. Split the byte sequence pattern into fragments to remove all *
Split up each byte sequence pattern P into smaller fragments (P1 - Pn)wherever a * is found (assume P1 Pn are in order left to right). Drop anywildcards of the form {n} or {n-m} which appear at the ends of the Pi.
In the example:
P1 = A1A2A3[A4:A5]??B1B2B3(B4|B5)
P2 = 01??C1C2C3{4-7}D1????F1(F2|F3)F4F5 (note that we have dropped the{5})
2.2.2 Step 2. Find the minimum and maximum subsequence offsests
If the pattern is not defined relative to EOF:
o For each sequence Pi, work out the minimum and (for P1) maximumdistance of the start of Pi from the end of Pi-1 (or the start of the file, forP1).
Alternatively, if the pattern is defined relative to EOF:
o For each sequence Pi, work out the minimum and (for Pn) maximumdistance of the end of Pi from the start of Pi+1 (or the end of the file, forPn).
8/8/2019 Automatic Format Identification
29/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 29 of 33
In the example:
P1 has minimum and maximum subsequence offsets of 10 (recall that the patternwas defined with an offset of 10 bytes from BOF).
P2 has a minimum subsequence offset of 5 (from the {5} we dropped).
2.2.3 Step 3. Find the longest unambiguous byte sequence in every fragment
For each pattern fragment, pull out longest unambiguous byte sequence (i.e.not containing ??, {n}, {k-m}, (a|b), [!a:b]): P1X PnX (if there is more than onepossible longest byte sequence for PiX, choose one arbitrarily.)
If the pattern is not defined relative to EOF:
o Work out the minimum offset of the start of PiX from the start of Pi.This is the minimum fragment length.
Alternatively, if the pattern is defined relative to EOF:
o Work out the minimum offset of the end of PiX from the end of Pi. Thisis the minimum fragment length.
In the example:
P1X = B1B2B3 (although it could equally well have been A1A2A3). The minimumfragment length is 5 (the length of A1A2A3[A4:A5]??).
P2X = C1C2C3. The minimum fragment length is 2 (the length of 01??).
2.2.4 Step 4. Split the fragments into remaining unambiguous byte sequences
For each Pi, split up the remainder of the sequence (i.e. the part not in PiX) tothe left and the right of PiX according to any occurrences of ? or {n} or {k-m}.This creates arrays of objects PiLj (to the left of PiX) and PiRj (to the right ofPiX) where each of these objects contains one or more unambiguoussequences (i.e. if | occurs in sequence, then list all possibilities). Unlike in theprevious step, these sequences may contain occurrences of the [:] wildcards,and are therefore not technically unambiguous.
For each subsequence PiLj, calculate the minimum and maximum offsets ofthe end of PiLj from the start of PiLj+1 (or the start of PiX, for the rightmost PiLj).
Similarly, for each subsequence PiRj, calculate the minimum and maximumoffsets of the start of PiRj from the end of PiRj-1 (or the end of PiX, for the
leftmost PiRj).
8/8/2019 Automatic Format Identification
30/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 30 of 33
In the example:
P1L1 = A1A2A3[A4:A5], with a minimum and maximum offset of 1 (the ??)
P1R
1= B4 or B5, with a minimum and maximum offset of 0 (since P
1R
1follows
directly on from P1X).
P2L1 = 01, with a minimum and maximum offset of 1 (the ??)
P2R1 = D1, with a minimum offset of 4, and a maximum offset of 7 (correspondingto {4-7}).
P2R2 = F1F2F4F5 or F1F3F4F5, with a minimum and maximum offset of 2(corresponding to ????).
2.2.5 Step 5. Calculate the shift distance: the minimum distance between each byte and
the end (or start) of the longest unambiguous byte sequence in its fragment.
For each distinct byte, b, the shift distance Di(b) is equal to the minimumdistance from the end of pattern PiX to the occurrence of that byte in PiX(unless the byte sequence is defined relative to EOF, in which case it is fromthe start of PiX). Any bytes which do not occur in PiX are given a shift distanceequal to the length of PiX + 1.
In the example, the shift distances for P1 are given by:
D1(B3) = 1
D1(B2) = 2
D1(B1) = 3
D1() = 4
The shift distances for P2 are defined similarly.
8/8/2019 Automatic Format Identification
31/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 31 of 33
2.3 Pre-processing glossary
Term Meaning
P A byte sequence pattern as contained in the internal signature
Pi A byte sequence pattern fragment created by splitting thepattern so that it does not contain a *
PiX The longest Pi in P.
Minimumfragment lengthfor PiX
If the pattern is not defined relative to EOF, this is the minimumoffset of the start of PiX from the start of Pi.
If the pattern is defined relative to EOF, this is the minimumoffset of the end of PiX from the end of Pi.
PiLj A byte sequence pattern fragment to the left of PiX
PiRj A byte sequence pattern fragment to the right of PiX
Minimum andmaximumoffsets
For each subsequence PiLj, these are the minimum andmaximum number of bytes between the end of PiLj from thestart of PiLj+1 (or the start of PiX, for the rightmost PiLj).
Similarly, for each subsequence PiRj, these are the minimumand maximum number of bytes between the start of PiRj fromthe end of PiRj-1 (or the end of PiX, for the leftmost PiRj).
Di(byte) The shift distance of that byte. This is defined as the minimumdistance from the end of pattern PiX to the occurrence of that
byte in PiX (unless the byte sequence is defined relative to EOF,in which case it is from the start of PiX).
3 The pattern matching algorithm
The direction in which the pattern matching is carried out is determined by whether thebyte sequence is relative to BOF, EOF or neither. The default in the following algorithm isthat it is carried out from left to right from the beginning of the file, but if byte sequence isrelative to EOF, then the pattern matching will be carried out from right to left, starting at
the end of the file. The latter is described below by text in brackets: (/).
1. We begin by trying to find P1X (/PnX). To this end, commence the search at thebeginning (/end) of file F, at an offset of the minimum subsequence offset plus theminimum fragment length. This is the earliest (/latest) point in the file at which P1X(/PnX) may occur. Take a window on F of the same length as P1X (/PnX) andcompare it with sequence P1LX (/PnX).
2. If the window and sequence dont match, then get the shift distance associated withthe first byte to the right (/left) of the window in F. Shift the window forwards(/backwards) by that many bytes.
8/8/2019 Automatic Format Identification
32/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Page 32 of 33
3. Now repeat step 2 until either a match is found (move on to next step) or until eitherthe end (/beginning) of the file or the maximum offset is reached (byte sequencefails).
4. Check for matches to the P1L
iand P
1R
i(/P
nL
iand P
nR
i) to see whether a match for
the whole of P1 (/Pn) can be found. For each sequence, start from smallest possibleoffset and stop as soon as a match is found so that we end up with the shortestpossible sequence in F that matches the pattern. If matches are found for all thesesequences, then record the location of the rightmost (/leftmost) byte for match of P1(/Pn) as this will determine where the search for the next pattern fragment starts. If nomatch is found for P1 (/Pn), then search for next possible occurrence of P1X (/PnX) asin steps 2 and 3 until either the end of the file or the maximum offset is reached.
5. Now that we have found a match for the whole of P1, repeat steps 1 to 4 for theremaining patterns P2 to Pn. In each case however, do not begin searching at the
start (/end) of the file, but at the rightmost (/leftmost) byte of the last pattern found, asrecorded in Step 4 (plus (/minus) the minimum subsequence offset for the pattern).Continue until all patterns have been found (i.e. positive match) or until the end(/start) of the file is reached (i.e. no match).
Note that because we search for the patterns in order (from left to right, in the defaultcase), and always find the earliest occurrence of each pattern, there is never any need tobacktrack using this algorithm. If we cant find pattern P3, say, we know that this hasnothing to do with the placement of patterns P1 and P2.
An activity diagram for the pattern matching is given below (this corresponds to the
comparison of the file with an internal signature step in the main activity diagram insection 3.1).
8/8/2019 Automatic Format Identification
33/33
Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID
Compare file with internal signature
No Yes
Get list of byte
sequences for internalsignature
Loop through
byte
sequences
Is it offset fromEOF?
Start at the end of the fileand the byte sequence
and move backwardsthrough them
Start at the beginning ofthe file and the byte
sequence and moveforwards through them
Does PLi
sequencematch the
"window" ?
Do PLiL and
PLiR
sequences
match?
Yes
No
Yes
No
No
Shift window according to
"shift distance"
No
End of byte
sequence
fragments
loop?
No
End of byte
sequence
loop?
No
Yes
File matches internal
signature
Loop through
byte sequence
fragments
Set up "window" on
data from file Is end of file ormaximum
offset
reached?
Yes
File does not match
internal signature