Automatic Format Identification

8/8/2019 Automatic Format Identification

1/33

Digital Preservation

Technical Paper:

1

Automatic Format Identification Using

PRONOM and DROID


2/33

Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID

Page 2 of 33

Document Control

Author: Adrian Brown, Head of Digital Preservation

Document Reference: DPTP-01

Issue: 2

Issue Date: 7 March 2006

THE NATIONAL ARCHIVES 2006


3/33


Page 3 of 33

Contents

1 INTRODUCTION .....................................................................................................................4

2 THE PRONOM FORMAT SIGNATURE MODEL ..............................................................6

2.1 External signatures...........................................................................................................72.2 Internal signatures ............................................................................................................7

3 THE DROID FORMAT IDENTIFICATION PROCESS ....................................................13

3.1 The format identification algorithm ............................................................................133.2 The DROID signature file ...............................................................................................163.3 The DROID file collection file .......................................................................................18

4 BIBLIOGRAPHY ...................................................................................................................21

APPENDICES ................................................................................................................................22

1 XML SCHEMAS ....................................................................................................................22

1.1 Signature file schema.....................................................................................................221.2 File collection file schema ............................................................................................26

2 PRE-PROCESSING SIGNATURES ..................................................................................28

2.1 Background for pattern matching algorithm ...........................................................282.2 Pre-processing of the signature file...........................................................................282.3 Pre-processing glossary ...............................................................................................31

3 THE PATTERN MATCHING ALGORITHM ......................................................................31


4/33


Page 4 of 33

1 Introduction

This document is one in a series of technical papers produced by the Digital Preservation

Department of The National Archives (TNA), covering detailed technical issues related to the

preservation and management of electronic records. This technical paper describes themethodology developed and implemented by TNA to automatically identify the formats in which

digital objects are encoded.

For the purposes of this document, a format is defined as follows:

The internal structure and encoding of a digital object, which allows it to be processed, or to

be rendered in human-accessible form. A digital object may be a file, or a bitstream

embedded within a file.

A distinction must be drawn between format identification and format validation. Identification

simply ascertains the format in which an object purports to be encoded, whereas validation ensuresthat the object is fully conformant to the format specification. As such, validation actually provides

the most secure form of identification. However, validation is a technically complex process and is

most efficient if it is informed by prior identification of the format.

Identifying and validating the format in which a digital object is encoded is a fundamental

prerequisite to accessing and managing the object. This knowledge is required both by human

users and IT systems without it, a digital object cannot be used or preserved. For the majority of

the time, this requirement is not obvious most people use a limited set of software tools, within a

consistent IT environment which is able to correctly assign the correct software to a given object.

Even so, at some point most users encounter the problem of attempting to access a file in a format

which is unknown to them, and unrecognised by either the operating system or any availablesoftware. It is also not uncommon to encounter a file which appears to be in a known format, but

cannot be opened by the relevant software.

However, it is managed information environments, such as Electronic Records Management

Systems (ERMS), or digital preservation facilities, which impose the most rigorous requirements

for identifying and validating formats. In these scenarios, it is essential to understand the precise

format and version in which every stored object is encoded. It is not enough to know that an object

is a Microsoft Word document, for example; the exact version must be identified. It is equally

essential to ensure that stored objects are valid, and do conform to the appropriate specification.

Invalid objects can arise through the use of poor-quality software tools, or as a result of accidentalor deliberate corruption.

The presumption is made that any format identification or validation method must be automated.

Although manual identification is possible, given sufficiently detailed knowledge of how a given

format specification and structure is expressed in hexadecimal values, this is clearly not

recommended as a practical approach under normal circumstances. This technical report describes

an identification method developed by TNA, which uses automated analysis of the binary structure

of a digital object, and comparison with predefined internal and external signatures for specific

formats. This approach has been implemented by TNA as a software application called DROID

(Digital Record Object Identification), which uses signature information stored in the PRONOM

technical registry. PRONOM and DROID are both freely available on the web athttp://www.nationalarchives.gov.uk/pronom. This method is currently limited to identification;
http://www.nationalarchives.gov.uk/pronomhttp://www.nationalarchives.gov.uk/pronom


5/33


Page 5 of 33

full object characterisation, including validation and property extraction, is intended as a future

enhancement.


6/33


2 The PRONOM format signature model

The signature information required to perform automatic format identification is stored in the

PRONOM technical registry. This section describes how signatures are modelled within

PRONOM.

A format signature is any collection of characteristics which may be used to indicate the format of

a digital object. This signature may be external or internal to the actual object bitstream. The

PRONOM technical registry contains detailed information about individual formats, including

their associated signatures. A simplified version of the data model used by PRONOM to describe

the relationships between formats and their signatures is illustrated in the following UML class

diagram:

Each format may have multiple associated internal and external signatures, with each internal

signature comprising one or more byte sequences. Multiple relationships may also be defined

between formats. Formats may also be assigned PRONOM Unique Identifiers (PUIDs)1, which

can provide an unambiguous and persistent binding between the format identification of a given

object, and the description of that format in PRONOM. Each type of signature is described in more

detail in the following sections.

1

Page 6 of 33

: See Brown (2005)


7/33


Page 7 of 33

2.1 External signatures

External signatures encompass all format indicators which are external to the object bitstream,

such as Macintosh data forks and Windows file extensions.

In some operating systems, such as Microsoft DOS and Windows, an external signature is

provided by the file extension. This indicates the format of the object, e.g. Mydoc.doc for a

Microsoft Word document, and Mypic.tif for a TIFF image. However, the primary function of the

extension is not to identify the format, but rather to indicate to the operating system the default

software package which should be used to open the file. As a result, file extensions suffer from

three disadvantages as a method of identification:

Extensions are not necessarily unique to a single format. For example, the .wks extension is

used for both Lotus 1-2-3 worksheets and Microsoft Works documents.

They do not provide sufficient granularity to adequately identify format versions. Forexample, the .doc extension does not distinguish between a Word 6 document and a Word

2003 document, although these are significantly different formats.

Extensions can be defined or altered by users. For example, a user of WordPerfect 5.1

might choose to distinguish letters and memos by using .let and .mem extensions

respectively, overriding the default extension assigned by the system.

External signatures are best suited to providing a general indication of the file format, and should

not be relied upon for definitive identification. PRONOM records known external signatures

which are associated with a given format. However, in DROID, these are accorded much less

weight than internal signatures in the identification process. External signatures are described in

PRONOM as type/value pairs; currently file extensions are the only permitted type.

2.2 Internal signatures

Internal signatures encompass all format indicators which are contained within the object

bitstream. By definition, a file format specification imposes a specific structure upon the content of

the bitstream, which is consistent between all digital objects in that format. The characteristics of

this structure may therefore be used as a signature for identifying the format. Some formats, such

as PNG, include a signature specifically designed to allow identification; in other cases, anincidental signature can be derived from elements of the format structure. Longer signature

sequences are clearly preferable to shorter ones, since they reduce the possibility of false

identification through coincidental sequences appearing in a bitstream. Even a deliberate signature,

such as PNG, may be enhanced using additional structural elements, to provide a more secure

identification.

In PRONOM, an internal signature is composed of one or more byte sequences, each comprising a

continuous sequence of hexadecimal byte values and, optionally, regular expressions. A signature

byte sequence is modelled by describing its starting position within a bitstream and its value.

The starting position can be one of two basic types:


8/33


Page 8 of 33

Absolute: the byte sequence starts at a fixed position within the bitstream. This position is

described as an offset from either the beginning or the end of the bitstream. The byte

sequence can therefore be located by moving to the specified offset, counting from either

the Beginning of File (BOF) or End of File (EOF) position. If counting from the EOF

position, the offset is to the final byte in the sequence.

Variable: the byte sequence can start at any offset within the bitstream. The byte sequence

can be located by examining the entire bitstream.

The value of the byte sequence is defined as a sequence of hexadecimal values, optionally

incorporating any of the following regular expressions:

??: wildcard matching any pair of hexadecimal values (i.e. a single byte).

e.g.: 0x0A FF ?? FE would match 0x0A FF 6C FE or 0x0A FF 11 FE.

*: wildcard matching any number of bytes (0 or more).

e.g.: 0x0A FF * FE would match 0x0A FF 6C FE or 0x0A FF 6C 11 FE.

{n}: wildcard matching n bytes, where n is an integer.

e.g.: 0x1C 20 {2} 4E 12 would match 0x1C 20 FF 15 4E 12.

{m-n}: wildcard matching between m-n bytes inclusive, where m and n are integers or *.

e.g.: 0x03 {1-2} 4D would match 0x03 3C 4D or 0x03 3C 88 4D.

e.g.: 0x03 {2-*} 4D would match 0x03 3C 88 4D or 0x03 3C 88 3F 4D.

(a|b): wildcard matching one from a list of values (e.g. a or b), where each value is a

hexadecimal byte sequence of arbitrary length containing no wildcards.

e.g.: 0x0E (FF|FE) 17 would match 0x0E FF 17 or 0x0E FE 17.

[a:b]: wildcard matching any sequence of bytes which lies lexicographically between a

and b, inclusive (where both a and b are byte sequences of the same length, containing nowildcards, and where a is less than b). The endian-ness ofa and b are the same as the

endian-ness of the signature as a whole.

e.g. 0xFF [09:0B] FF would match 0xFF 09 FF, 0xFF 0A FF or 0xFF 0B FF.

[!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte

sequence containing no wildcards).

e.g. 0xFF [!09] FF would match 0xFF 0A FF, but not 0xFF 09 FF.


9/33


Page 9 of 33

[!a:b]: wildcard matching any sequence of bytes which does not lie lexicographically

between a and b, inclusive (where a and b are both byte sequences of the same length,

containing no wildcards, and where a is less than b).

e.g. 0xFF [!01:02] FF would match 0xFF 00 FF and 0xFF 03 FF, but not 0xFF 01FF or 0xFF 02 FF.

Note: In the examples above, spaces are included between byte values for reasons of clarity, but

are omitted in actual byte sequence values. The signature is processed left-to-right if the signature

is measured relative to BOF and right-to-left if measured relative to EOF. The endian-ness of the

signature is only relevant for sequences inside square brackets.

A byte sequence must contain a fixed subsequence of at least one byte between each occurrence of

*, or between the beginning or end of the sequence and an occurrence of *. Thus, sequences of

the following form are not permitted:

[BOF] (a|b)*

*(a|b) [EOF]

*(a|b)*

The syntax of a byte sequence may be expressed in formal BNF notation as follows:

::= |

::= |

::= |

::= ?? | { } | { -

} | ( |

) | [ : ] | [!

: ] | [!

]

::= * | { - * }

::= |

::=

::= |

::= |

::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

| 9


10/33


Page 10 of 33

::= A | B | C | D | E | F

The order in which byte sequences are recorded within a signature is arbitrary, and does not imply

any order or prioritisation within the bitstream. For example, if a signature contains two variable-

position sequences, these may appear in any order within a given bitstream. In cases where order is

significant (e.g. Sequences 1 and 2 are both variable-position, but Sequence 2 must appear after

Sequence 1 in the bitstream), this would be recorded by combining the relevant sequences into a

single sequence, separated by the appropriate regular expressions. For the purposes of

identification, a Boolean And relationship is assumed between every sequence in a signature, i.e.

a given object must match every sequence in the signature to match the signature.

2.2.1 Signature specificity

The level of granularity at which internal signatures can be defined may vary. In many cases, it is

possible to define internal signatures which are specific to a single version of a given format.However, in other cases a single signature may be common to more than one version of a format

(e.g. TIFF), or even to a class of formats (e.g. the OLE2 Compound Document Format). Signatures

which are unique to one format record in PRONOM are termed specific signatures, whereas

those which are common to multiple format records are termed generic signatures. No format

may have both a generic and a specific internal signature. This distinction is reflected in the

identification result: a match with a specific signature is recorded as a positive specific

identification, and a match with a generic signature is recorded as a positive generic

identification.

2.2.2 Format priority

It is possible for subtype and supertype relationships to exist between formats. For example, SVG

can be considered as a subtype of XML. Since an SVG document is also an XML document, it

would match the internal signatures for both formats. However, it is desirable that only the most

specific identification be reported. To enable this, the PRONOM data model allows a priority

relationship to be defined between two formats. Where such a relationship exists, and a given

object matches signatures for both formats, the lower priority identification is discarded.


11/33


Page 11 of 33

2.2.3 Example byte sequences

The possible scenarios can be illustrated as follows (note that all offsets are calculated starting

from 0):

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1

Byte sequence 1 occurs at an absolute offset from the BOF. It would therefore be described as

follows:

System ID: 1

Position Type: Absolute from BOF

Offset: 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

2

Byte sequence 2 occurs at an absolute offset from the EOF. It would therefore be described as

follows:

System ID: 2

Position Type: Absolute from EOF

Offset: 2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

3

Byte sequence 3 occurs at a variable offset. It would therefore be described as follows:

System ID: 3

Position Type: Variable

Offset: n/a

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

4a 4b

Byte sequence 4 includes two subsequences. Sequence 4a appears at a variable offset, and 4b

appears at a fixed relative offset to 4a. It would therefore be described as follows:

System ID: 4

Position Type: Variable

Offset: n/a

Value: 4a{2}4b


12/33


Page 12 of 33

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

5a 5b

Byte sequence 5 includes two subsequences. Sequence 5a appears at an absolute offset, and 5b

appears at a variable relative offset to 5a. It would therefore be described as follows:

System ID: 5

Position Type: Absolute from BOF

Offset: 4

Value: 5a{*}5b


13/33


3 The DROID format identification process

The DROID software tool uses signature information stored in PRONOM to performautomatic format identification. This section describes the identification algorithm

employed by DROID, and the formats of the XML files used by DROID to describesignatures and record the results of the identification process.

3.1 The format identification algorithm

The format identification process employed by DROID is illustrated in the following activitydiagram:

Loop through listof internalsignatures

Does filematch

signature?No

Ye s

Ye s

Compare file with internalsignature

(step expanded separately)

No

Add signature tolist of passes

end of internalsignature loop?

Ye s

Add related fileformats to list ofpossible passes

(recordingwhether specific

or generic hit)

Loop throughpossible hits

Discard any possiblehits which this one has

priority over

end of hitsloop?

No

Loop throughpositive hits

end of fileformat loop?

Does fileformat have anextension thatmatches file?

Ye s

NoAdd a warningabout file

extension to thispositive hit

No

Read file into memory

Are there anypositive hits?

No

Ye s

Generate list of fileformats with the

same extension asthe file and with no

internal signature

This file isclassified as"unidentified"

Is the list ofhits empty? Ye s

No

Is file em pty? Yes

No

File is classifiedas "unidentified:zero-length file"

Reclassify remainingpossible passes as positive

hits (specific or generic)

Does file havean extension?

Ye s

No

All file formats inthis list are

"tentative hits"

If a file is too big to hold in memory in its entirety, the first step is skipped. Instead, the fileis read into memory a chunk at a time. To be specific, whenever the algorithm needs to

access byte nof the file, a check is made to see whether byte nis already in memory. If

Page 13 of 33


14/33


Page 14 of 33

not, the currently buffered portion of the file is discarded, and a new portion, containingbyte nis loaded in its place. The size of the buffer used is currently 1000000 bytes.

The algorithm uses the word file throughout but streams are identified using the samealgorithm. They are read into memory at the start of the process. If this results in runningout of memory then the memory contents are written to temporary file and the rest of thestream is appended on the end of that file. In that case, the algorithm then runs on thattemporary file.

A file is first compared with the set of internal signatures, and any matches are classifiedas positive identifications. These are further classified as specific or generic inaccordance with the specificity rules defined in Section 2.2.1. If any matches have priorityover other matches then the lower priority matches are discarded in accordance with thepriority rules defined in Section 2.2.2. If the file matches one or more internal signaturesthen its file extension is checked against any allowed external signatures; if the file

extension does not match any external signatures then a warning of a possible fileextension mismatch is generated, but the identification result is not otherwise affected. Ifthe file has no positive results against internal signatures then it is compared with the setof external signatures, and any matches are classified as tentative identifications. If thefile does not match any internal or external signatures then it is classified as a negativeidentification.

The possible results of an identification process can be summarised as follows:

Status Warning CommentsPositive (Specific) The file matches a specific

internal signaturePositive (Specific) Possible file extension mismatch The file matches a specific

internal signature but the fileextension does not match anyassociated external signatures

Positive (Generic) The file matches a genericinternal signature

Positive (Generic) Possible file extension mismatch The file matches a genericinternal signature but the fileextension does not match anyassociated external signatures

Tentative The file matches an externalsignature but no internalsignatures

Negative The file does not match anyinternal or external signatures

The pattern matching algorithm employed to undertake the internal signatureidentification is described in detail in Appendix C.


15/33


Page 15 of 33

3.1.1 Identification example

PRONOM contains information on five formats. The first two (Format A1 and FormatA2) each have their own specific internal signatures consisting of a single bytesequence. The third file format (Format B) has no internal signature. The last two(Format C1 and Format C2) both share the same generic internal signature. All fileformats can have an external signature in the form of the file extension txt. They all alsohave their own specific extension. These are respectively fa1, fa2, fb, fc1, fc2.

A set of files are processed by DROID using these signatures, with the following results:

aFile.fa1 passes the internal signature for Format A1, but fails the others. It isclassified as a Positive Specific hit on Format A1.

bFile.fa1 passes the internal signature for Format A2, but fails the others. It is

classified as a Positive Specific hit on Format A2. A warning is given, becausethe file extension is not associated with this format.

cFile.fa1 matches the internal signatures for both Format A1 and Format A2. Itis therefore classified as a possible Positive Specific hit on both formats.However, Format A2 is defined as having priority over Format A1, so the hit onFormat A1 is discarded. The final output for this file is therefore a PositiveSpecific hit on Format A2 with a warning about the file extension being wrong.

dFile.fa1 fails all internal signatures. Despite having a valid extension for FormatA1, it is not classified as a tentative hit as it has failed the internal signature. It is

therefore classified as a Negative identification.

eFile.txt fails all internal signatures. Despite having a valid extension for all fileformats, it is only classified as a Tentative hit on Format B as it has failed theinternal signatures for all other formats.

fFile.xxx passes the internal signature for Format A2. The file extension isunknown, so this is classified as a Positive Specific identification but with awarning about the file extension.

gFile.fb passes the internal signature for Format A2. Even though the fileextension corresponds to Format B, the comparison is never made because apositive identification has already been made on the internal signature for FormatA2. This file is classified as a Positive Specific identification on Format A2 butwith a warning about the file extension.

hFile.xxx fails all internal signatures. The file extension is unknown, so this isclassified as a Negative identification.

iFile.txt passes the third internal signature and fails the other two. It is classifiedas a Positive Generic hit on Format C1 and Format C2 (no indication is made

that the extension matches the other file formats).


16/33


Page 16 of 33

jFile.fc1 passes the third internal signature and fails the other two. It is classifiedas a Positive Generic hit on Format C1 and Format C2. The hit on FormatC2 also has a warning about the file extension.

kFile.txt passes the all three internal signatures. It is classified as a PositiveSpecific hit on Format A1 and Format A2. By the same argument as forcFile.fa1, the hit on Format A1 is discarded. This file is also a PositiveGeneric hit on Format C1 and Format C2.

3.2 The DROID signature file

The signature information contained in PRONOM must be exported in a form which canbe used by the DROID tool to perform automatic format identification. A DROID signaturefile contains all the information required by DROID to identify formats, and takes the form

of an XML document which complies with the schema described in Appendix A.1. Theinternal signatures contained in the signature file have been pre-processed in accordancewith the method described in Appendix B. The resultant signature file contains thefollowing information:

A list of file formats and for each one, the internal system identifier, format name,format version, format PUID, and a collection of identifiers for all internalsignatures associated with the file format, as well as the collection of externalsignatures. Any priority relationships over other formats are also recorded.

A list of internal signatures and for each one, a unique identifier and a collection of

byte sequences.

Each byte sequence has a position type (absolute from BOF, absolute from EOF,or variable) and possibly a maximum offset. Each sequence is made up of a seriesof subsequences, each represented by its longest unambiguous byte sequence, itsminimum offset, its shift distance function and a series of left and right sequencefragments (see Appendix B for further details).

Each sequence fragment has a position number, an unambiguous byte sequenceand minimum and maximum offsets (see Appendix B for further details).

3.2.1 Example signature file

The following example DROID Signature File describes the five example formatsdescribed in Section 3.1.1.


17/33


Page 17 of 33

B1B2B3B4

4

3

2

1

0

A1A2A3

C1C2C3

3

2

1

0

B1B2B3

3

2

1

0

A1A2A3

B4

B5

C1C2C3

3

2

1

0

01D1

F1F2F4F5

F1F3F4F5

........

........


18/33


Page 18 of 33

15

txt

fa1

16

txt

fa2

1

txt

fb

17

txt

fc1

17

txt

fc2

3.3 The DROID file collection file

The DROID file collection file contains the list of files selected for identification, and theresults of a DROID identification process. It is generated by the DROID software andtakes the form of an XML document. If saved prior to the identification process beingperformed, it contains the following information:

The DROID version number

The signature file version number

The list of the files selected for identification

If saved following identification, it contains the following information:

The DROID version number


19/33


Page 19 of 33

The signature file version number

The list of the files submitted to the identification process

For each file, a list of file format identifications

For each file format identification, the name, version and PUID of the file format,the identification status (e.g. positive or tentative), and any relevant warnings orerrors.

3.3.1 Example file collection file

The following example DROID File Collection File describes the results of the identification

process described in Section 3.1.1.

V1.1

3

2006-02-09T11:59:52

C:\testfiles\aFile.fa1

Positive (Specific Format)

Format A1

V1.1


20/33


Page 20 of 33

C:\testfiles\eFile.txt

Tentative

Format B

V0.0


21/33


Page 21 of 33

V2


22/33


Page 22 of 33

Appendices

1 XML schemas

1.1 Signature file schema

The XML schema definition for the signature file is as follows:

The PRONOM File Format Signature Specification Schema

Copyright The National Archives 2006. All rights reserved.

Define ID as key (ensuring they are also unique)

Define ID as key (ensuring they are also unique)

Ensure file formats refer to other formats that exist

Ensure file formats refer to signatures that exist


23/33


Page 23 of 33


24/33


Page 24 of 33

use="required"/>


25/33


Page 25 of 33

A byte expressed as hexadecimal

One or more bytes expressed as hexadecimal

One or more bytes expressed as hexadecimal, with patterns.

Eg. [A1:A4]


26/33


Page 26 of 33

1.2 File collection file schema

The XML schema definition for the file collection file is as follows:

The PRONOM File Collection Schema

Copyright The National Archives 2006. All rights reserved.


27/33


Page 27 of 33


28/33


Page 28 of 33

2 Pre-processing signatures

2.1 Background for pattern matching algorithm

The pattern matching algorithm is based on the Boyer-Moore-Horspool (BMH) algorithmfor string matching (Horspool, 1980). The BMH algorithm does not allow for wildcardcharacters such as defined above, and so a certain amount of pre-processing needs tobe performed on the internal signatures before the algorithm can be applied. Thesignatures are stored in unprocessed form in the PRONOM registry, and pre-processingis automatically applied as part of the generation of a new signature file.

2.2 Pre-processing of the signature file

The following pre-processing is required on each byte sequence to be used in the patternmatching. To simplify the exposition, we will use the following sequence, defined with anoffset of 10 bytes from the beginning of the file as a running example:

A1A2A3[A4:A5]??B1B2B3(B4|B5)*{5}01??C1C2C3{4-7}D1????F1(F2|F3)F4F5

2.2.1 Step 1. Split the byte sequence pattern into fragments to remove all *

Split up each byte sequence pattern P into smaller fragments (P1 - Pn)wherever a * is found (assume P1 Pn are in order left to right). Drop anywildcards of the form {n} or {n-m} which appear at the ends of the Pi.

In the example:

P1 = A1A2A3[A4:A5]??B1B2B3(B4|B5)

P2 = 01??C1C2C3{4-7}D1????F1(F2|F3)F4F5 (note that we have dropped the{5})

2.2.2 Step 2. Find the minimum and maximum subsequence offsests

If the pattern is not defined relative to EOF:

o For each sequence Pi, work out the minimum and (for P1) maximumdistance of the start of Pi from the end of Pi-1 (or the start of the file, forP1).

Alternatively, if the pattern is defined relative to EOF:

o For each sequence Pi, work out the minimum and (for Pn) maximumdistance of the end of Pi from the start of Pi+1 (or the end of the file, forPn).


29/33


Page 29 of 33

In the example:

P1 has minimum and maximum subsequence offsets of 10 (recall that the patternwas defined with an offset of 10 bytes from BOF).

P2 has a minimum subsequence offset of 5 (from the {5} we dropped).

2.2.3 Step 3. Find the longest unambiguous byte sequence in every fragment

For each pattern fragment, pull out longest unambiguous byte sequence (i.e.not containing ??, {n}, {k-m}, (a|b), [!a:b]): P1X PnX (if there is more than onepossible longest byte sequence for PiX, choose one arbitrarily.)

If the pattern is not defined relative to EOF:

o Work out the minimum offset of the start of PiX from the start of Pi.This is the minimum fragment length.

Alternatively, if the pattern is defined relative to EOF:

o Work out the minimum offset of the end of PiX from the end of Pi. Thisis the minimum fragment length.

In the example:

P1X = B1B2B3 (although it could equally well have been A1A2A3). The minimumfragment length is 5 (the length of A1A2A3[A4:A5]??).

P2X = C1C2C3. The minimum fragment length is 2 (the length of 01??).

2.2.4 Step 4. Split the fragments into remaining unambiguous byte sequences

For each Pi, split up the remainder of the sequence (i.e. the part not in PiX) tothe left and the right of PiX according to any occurrences of ? or {n} or {k-m}.This creates arrays of objects PiLj (to the left of PiX) and PiRj (to the right ofPiX) where each of these objects contains one or more unambiguoussequences (i.e. if | occurs in sequence, then list all possibilities). Unlike in theprevious step, these sequences may contain occurrences of the [:] wildcards,and are therefore not technically unambiguous.

For each subsequence PiLj, calculate the minimum and maximum offsets ofthe end of PiLj from the start of PiLj+1 (or the start of PiX, for the rightmost PiLj).

Similarly, for each subsequence PiRj, calculate the minimum and maximumoffsets of the start of PiRj from the end of PiRj-1 (or the end of PiX, for the

leftmost PiRj).


30/33


Page 30 of 33

In the example:

P1L1 = A1A2A3[A4:A5], with a minimum and maximum offset of 1 (the ??)

P1R

1= B4 or B5, with a minimum and maximum offset of 0 (since P

1R

1follows

directly on from P1X).

P2L1 = 01, with a minimum and maximum offset of 1 (the ??)

P2R1 = D1, with a minimum offset of 4, and a maximum offset of 7 (correspondingto {4-7}).

P2R2 = F1F2F4F5 or F1F3F4F5, with a minimum and maximum offset of 2(corresponding to ????).

2.2.5 Step 5. Calculate the shift distance: the minimum distance between each byte and

the end (or start) of the longest unambiguous byte sequence in its fragment.

For each distinct byte, b, the shift distance Di(b) is equal to the minimumdistance from the end of pattern PiX to the occurrence of that byte in PiX(unless the byte sequence is defined relative to EOF, in which case it is fromthe start of PiX). Any bytes which do not occur in PiX are given a shift distanceequal to the length of PiX + 1.

In the example, the shift distances for P1 are given by:

D1(B3) = 1

D1(B2) = 2

D1(B1) = 3

D1() = 4

The shift distances for P2 are defined similarly.


31/33


Page 31 of 33

2.3 Pre-processing glossary

Term Meaning

P A byte sequence pattern as contained in the internal signature

Pi A byte sequence pattern fragment created by splitting thepattern so that it does not contain a *

PiX The longest Pi in P.

Minimumfragment lengthfor PiX

If the pattern is not defined relative to EOF, this is the minimumoffset of the start of PiX from the start of Pi.

If the pattern is defined relative to EOF, this is the minimumoffset of the end of PiX from the end of Pi.

PiLj A byte sequence pattern fragment to the left of PiX

PiRj A byte sequence pattern fragment to the right of PiX

Minimum andmaximumoffsets

For each subsequence PiLj, these are the minimum andmaximum number of bytes between the end of PiLj from thestart of PiLj+1 (or the start of PiX, for the rightmost PiLj).

Similarly, for each subsequence PiRj, these are the minimumand maximum number of bytes between the start of PiRj fromthe end of PiRj-1 (or the end of PiX, for the leftmost PiRj).

Di(byte) The shift distance of that byte. This is defined as the minimumdistance from the end of pattern PiX to the occurrence of that

byte in PiX (unless the byte sequence is defined relative to EOF,in which case it is from the start of PiX).

3 The pattern matching algorithm

The direction in which the pattern matching is carried out is determined by whether thebyte sequence is relative to BOF, EOF or neither. The default in the following algorithm isthat it is carried out from left to right from the beginning of the file, but if byte sequence isrelative to EOF, then the pattern matching will be carried out from right to left, starting at

the end of the file. The latter is described below by text in brackets: (/).

1. We begin by trying to find P1X (/PnX). To this end, commence the search at thebeginning (/end) of file F, at an offset of the minimum subsequence offset plus theminimum fragment length. This is the earliest (/latest) point in the file at which P1X(/PnX) may occur. Take a window on F of the same length as P1X (/PnX) andcompare it with sequence P1LX (/PnX).

2. If the window and sequence dont match, then get the shift distance associated withthe first byte to the right (/left) of the window in F. Shift the window forwards(/backwards) by that many bytes.


32/33


Page 32 of 33

3. Now repeat step 2 until either a match is found (move on to next step) or until eitherthe end (/beginning) of the file or the maximum offset is reached (byte sequencefails).

4. Check for matches to the P1L

iand P

1R

i(/P

nL

iand P

nR

i) to see whether a match for

the whole of P1 (/Pn) can be found. For each sequence, start from smallest possibleoffset and stop as soon as a match is found so that we end up with the shortestpossible sequence in F that matches the pattern. If matches are found for all thesesequences, then record the location of the rightmost (/leftmost) byte for match of P1(/Pn) as this will determine where the search for the next pattern fragment starts. If nomatch is found for P1 (/Pn), then search for next possible occurrence of P1X (/PnX) asin steps 2 and 3 until either the end of the file or the maximum offset is reached.

5. Now that we have found a match for the whole of P1, repeat steps 1 to 4 for theremaining patterns P2 to Pn. In each case however, do not begin searching at the

start (/end) of the file, but at the rightmost (/leftmost) byte of the last pattern found, asrecorded in Step 4 (plus (/minus) the minimum subsequence offset for the pattern).Continue until all patterns have been found (i.e. positive match) or until the end(/start) of the file is reached (i.e. no match).

Note that because we search for the patterns in order (from left to right, in the defaultcase), and always find the earliest occurrence of each pattern, there is never any need tobacktrack using this algorithm. If we cant find pattern P3, say, we know that this hasnothing to do with the placement of patterns P1 and P2.

An activity diagram for the pattern matching is given below (this corresponds to the

comparison of the file with an internal signature step in the main activity diagram insection 3.1).


33/33


Compare file with internal signature

No Yes

Get list of byte

sequences for internalsignature

Loop through

byte

sequences

Is it offset fromEOF?

Start at the end of the fileand the byte sequence

and move backwardsthrough them

Start at the beginning ofthe file and the byte

sequence and moveforwards through them

Does PLi

sequencematch the

"window" ?

Do PLiL and

PLiR

sequences

match?

Yes

No

Yes

No

No

Shift window according to

"shift distance"

No

End of byte

sequence

fragments

loop?

No

End of byte

sequence

loop?

No

Yes

File matches internal

signature

Loop through

byte sequence

fragments

Set up "window" on

data from file Is end of file ormaximum

offset

reached?

Yes

File does not match

internal signature

Documents

Automatic Format Identification