SCAPE
Carl Wilson Open Planets Foundation
SCAPE Training Guimarães
Characterisation - 101 An introduction to the identification and characterisation of file formats.
This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
SCAPE About Us
• Carl Wilson Open Planets Foundation [email protected] http://www.openplanetsfoundation.org
• SCAPE Project EU funded research project SCAlable Preservation Environments http://www.scape-project.eu
2
SCAPE About You
• Once Around The Room • Name • Where you work • What you do • Why you’re here
• DO Ask Questions • Or tell me to slow down… • Or ask me to repeat something…
3
SCAPE File Formats
• What is a File Format? • A “standard” method of encoding data for
storage. • May be to an open specification • OR a proprietary one, open preferred • Or simply following a loosely documented
convention
4
SCAPE Who Cares About Formats?
• Operating Systems: in order to open a file with an application that can interpret /render it.
• Web Servers: to negotiate Content-Type in HTTP requests
• Memory Institutions: to identify software stacks that can render or extract meaning from a file, now or at a later date.
• More Generally: everyone with digital content, whether they know it or not.
5
SCAPE Some Uses of Format Information
• Format Information: • Associates a file with software that can
interpret and/or render its contents • Can be used to find documentation /
specifications to help interpret a file’s contents • Is a first step to preservation planning, knowing
what you have……
6
SCAPE File Name Extension
• A file name suffix separated by a dot “.”, from the file base name.
• Examples: .pdf, .txt, .jpg, .doc, .docx • This has worked for a number of years BUT
• Any user with the right permission can change a file extension
• Bytes aren’t always transferred with a name
7
SCAPE Internet Media (MIME) Types
• The format identifiers used by the web • Examples:
• text/plain • text/html • image/jpg
• Don’t readily hold extra information such as format version, but may be extended.
8
SCAPE Apple’s Alternatives
• Pre OS-X versions of MAC OS used Creator and Type codes • Creator: The software that created the file • Type: The type of information, e.g. TEXT • More flexible than extension, but no longer
used
• Recent OS-X versions also use Uniform Type Identifiers
9
SCAPE PRONOM Unique Identifiers or PUIDs
• PRONOM is a web based registry of file format information
• Created and Hosted by the National Archives of the UK in 2002
• Uses PUIDS to identify file formats: • fmt/15 == Acrobat PDF 1.1 • fmt/16 == Acrobat PDF 1.2 • fmt/17 == Acrobat PDF 1.3
10
SCAPE The Unix File Utility
• A standard Unix program for identifying the data in a file.
• First released in 1973, written in C so requires Operating System dependent compilation
• Open source version used in Linux distributions written in 1986
• Identification based upon compiled “magic” files • Provides text information about files, or MIME
types with the right options 11
SCAPE FIDO
• Format Identification of Digital Objects • Open Source format identification tools • Based upon the PRONOM signature data
compiled to regular expressions • Written in Python so can be run on different
Operating Systems • Richer command line syntax than DROID
12
SCAPE Apache Tika
• Open Source toolkit for detecting and extracting metadata and structured text from files
• Performs Format Identification and deeper characterisation (more on that later).
• Java based so will run on different platforms. • Returns MIME types as format identifiers
13
SCAPE How Do These Tools Identify Formats?
• They exploit “common features” of the format. • PDF start of file:
• %PDF-1.1 PDF Version 1.1 • %PDF-1.2 PDF Version 1.2 • %PDF-1.6 PDF Version 1.6
• Tika and File simply look for files starting with the string %PDF- and return the MIME type
• FIDO However……
14
SCAPE FIDO & PDF Identification
• FIDO identifies the different PDF versions, each of which have a PUID
• FIDO also looks for an END OF FILE marker for PDFs : .%%EOF.
• This could be a problem…….
15