75
Introduction to Digital Libraries Digital Data

Introduction to Digital Libraries Digital Data. Do you still have a copy of your first email? Can you still compile and run the first program you ever

Embed Size (px)

Citation preview

Introduction to Digital Libraries

Digital Data

• Do you still have a copy of your first email?

• Can you still compile and run the first program you ever wrote?

• If Hurricane Isabel had destroyed your computer, how much information would you have lost?

Digital information

http://en.wikipedia.org/wiki/Rosetta_Stone

http://www.rosettaproject.org/about-us/disk/concept

Text

Storage of text: image vs. ascii

• Document image– Digital image of page; words represented as

patterns of pixels– Not searchable as text– Optical character recognition to convert to ascii

(may be error prone)• ASCII

– Searchable as text; words represented as ascii codes

"Benign Neglect"• Hardcopy items:

– benefit from "benign neglect"– have well-understood methods;

e.g.:• book->open• book->turnPage

000100 IDENTIFICATION DIVISION.000200 PROGRAM-ID. HELLOWORLD.000300000400*000500 ENVIRONMENT DIVISION.000600 CONFIGURATION SECTION.000700 SOURCE-COMPUTER. RM-COBOL.000800 OBJECT-COMPUTER. RM-COBOL.000900001000 DATA DIVISION.001100 FILE SECTION.001200100000 PROCEDURE DIVISION.100100100200 MAIN-LOGIC SECTION.100300 BEGIN.100400 DISPLAY " " LINE 1 POSITION 1 ERASE EOS.100500 DISPLAY "Hello world!" LINE 15 POSITION 10.100600 STOP RUN.100700 MAIN-LOGIC-EXIT.100800 EXIT.

• Softcopy items:

– frequent use leads to migration & replication

– are only understood in specialized, fragile contexts

CCS – Offices

document METSALTOTIFFJPEG

Image Pre-Processing

Layout Analysis

Character Recognition

Structural Analysis

Scanning

Import

Correction

Export

RulesDB

engineInput Output

Minolta scanner

Robot page-turners

Digitizing Line Kirtas Bookscan

CCS – Offices

Traditional OCR - Output

THE

AMERICAN MISSIONARY.

Vo.. XXXII JANUARY, 1878 No. 1

American Missionary Association

1877 - 1888xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

CCS – Offices

information available

Title page

Title of series

Volume number

Issue number

Motto

Date

Oxford University Library Services

• > 660 staff • 40 libraries• Budget > £25m (€37m)• Total bookstock:11 million items• 156 miles (250km) of shelving,

including repository space

Text Compression

• Data compression is important to storage systems because it allows more bytes to be packed into a given storage medium than when the data is uncompressed.

• Compromises:– Encode-decode time– Random access to text?

Why Compress?

• To reduce the volume of data to be transmitted (text, images, …)

• To reduce the bandwidth required for transmission and to reduce storage requirements (speech, audio, video)

Text Compression• Common methods

– Symbol-wise methods• Estimate probabilities of symbols, code one at a time,

shorter codes for high probabilities (Morse)• E.g. Huffman coding

– Dictionary methods• Replace words and fragments with dictionary entries

(Braille)• E.g. Ziv-Lempel compression

• May be static or dynamic

Huffman coding

• Developed in 1950s, widely used• Static code, variable length• Based on frequency of occurrence of letters (from

English or from body of text)• Method:

– Sort by falling probabilities; link 2 symbols with least probabilities, label with sum; repeat till you reach a single symbol with probability of 1

– Code down tree to generate symbols

17

• Huffman coding builds a binary tree from the letter frequencies in the message.

–The binary symbols for each character are read directly from the tree.

• Symbols with the highest frequencies end up at the top of the tree, and result in the shortest codes.

7A.2 Statistical CodingHuffman coding

Huffman code tree

b c e f gda

0 1

0

0

0 1

1

1

0 1

0 1

19

• The process of building the tree begins by counting the occurrences of each symbol in the text to be encoded.

7A.2 Statistical Coding

HIGGLETY PIGGLTY POPTHE DOG HAS EATEN THE MOPTHE PIGS IN A HURRY

THE CATS IN A FLURRYHIGGLETY PIGGLTY POP

Huffman coding

20

• Next, place the letters and their frequencies into a forest of trees that each have two nodes: one for the letter, and one for its frequency.

7A.2 Statistical CodingHuffman coding

21

• We start building the tree by joining the nodes having the two lowest frequencies.

7A.2 Statistical CodingHuffman coding

22

• And then we again join the nodes with two lowest frequencies.

7A.2 Statistical CodingHuffman coding

23

• And again ....

7A.2 Statistical Coding

24

7A.2 Statistical Coding

• Here is our finished tree.

Example : Huffman Coding

front

l o s n a t e

13 18 22 45 45 53 65

Example front

s n a t e 22 31 45 45 53 65

l o

Example front

n a t e 45 45 53 53 65

s 31

l o

Example front

t e 53 53 65 90

s 31 n a

l o

Examplefront

e 65 90 106

n a 53 t

s 31

l o

Example front

106 155

53 t e 90

s 31 n a

l o

Example

261 106 155

53 t e 90

s 31 n a

l o

Example

261 106 155 53 t e 90 s 31 n a

l o

0

0

0

0

0

0 1

1

1

1 1

1

Example

261 106 155 53 t e 90 s 31 n a

l o

0

0

0

0

0

0 1

1

1

1 1

1

Ziv-Lempel Compression

• Adaptive coding• For repeat occurrences of text segments,

pointer back to first occurrence• Higher compression than Huffman coding• Also used for image compression

Ziv-Lempel compression

• Based on triples <a,b,c>, where– a = how far back to segment– b = no of characters in segment– c = new character to end segment

• E.g.– <0,0,z> first occurrence of z– <17,5,r> go back 17 characters, repeat 5

characters, end in r

36

Ziv-Lempel - ExampleZiv-Lempel - Example

abbababbbaabaa

a b bbba bab aa baa

<0,0,a>

<1,1,a>

<3,2,b> <2,1,b><0,0,b> <2,1,a>

<3,2,a>

ExampleEncode (i.e., compress) the string ABBCBCABABCAABCAAB

The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B)

Note: The above is just a representation, the commas and parentheses are not transmitted; we will discuss the actual form of the compressed message later on in slide 12.

Example

1. A is not in the Dictionary; insert it2. B is not in the Dictionary; insert it3. B is in the Dictionary. BC is not in the Dictionary; insert it. 4. B is in the Dictionary. BC is in the Dictionary. BCA is not in the Dictionary; insert it.5. B is in the Dictionary. BA is not in the Dictionary; insert it.6. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is not in the Dictionary; insert it.7. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is in the Dictionary. BCAAB is not in the Dictionary; insert it.

ExampleEncode (i.e., compress) the string BABAABRRRA.

The compressed message is: (0,B)(0,A)(1,A)(2,B)(0,R)(5,R)(2, )

Example

1. B is not in the Dictionary; insert it2. A is not in the Dictionary; insert it3. B is in the Dictionary. BA is not in the Dictionary; insert it. 4. A is in the Dictionary. AB is not in the Dictionary; insert it.5. R is not in the Dictionary; insert it.6. R is in the Dictionary. RR is not in the Dictionary; insert it.7. A is in the Dictionary and it is the last input character; output

a pair containing its index: (2, )

Pros and Cons of Different Algorithms

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression ratio

very good poor very good good

Compression speed

slow fast fast very fast

Decompression speed

slow fast very fast very fast

Memory space low low high moderate

Pattern matching no yes yes yes

Random Access no yes yes no

Data

Images

vector graphics

• A vector graphic is a set of instruction on how to draw shapes that make up an image.

• Contrary to raster images, vector graphics are resolution-independent. On a device with small pixels, they look better than on a device with large pixels.

Vector Images• Vector

– composed of paths• Coordinates• With color• SVG coding

– use mathematical relationships between points and the paths connecting them to describe an image

• Used for– Fonts– Drawings– Charts– Maps

Vector Image

Vector Image Example

Raster Images

• Raster Images also known as Bitmap Image– A grid of individual pixels– Each pixel can be a different

color or shade– Our focus today

• Used for– Continuous tone images

Raster images

• Raster images are rectangular sets of pixels.

• Each pixel is a small rectangle that has a certain color.

• Since the points are small the illusion of a non-pixilated image is created.

• The smaller the pixels, the smaller the image.

Original materialsPhotographs• Reflective

– Prints • Film

– Negative– Positive

• Requirements– Color fidelity– Contrast– Detail rendering

Original MaterialsText• Can be black and white

or have color• Usually bound volumes

rather than loose pages• Requirements

– Usually needs to be readable

– Often has additional processing like OCR

Original MaterialsArtifacts• 3-D objects can’t be

scanned• Digital Photography creates

the image– “Studio” space– Lighting– Moving equipment and

personnel• Requirements

– Depth– Color– Detail

The Five Big Factors

• Resolution

• Bit Depth

• Color

• Compression

• Format

Resolution• Often referred to as “dpi” or “ppi”

– Dots per inch – Pixels per inch

• RATIO of number of pixels captured per inch of original photo size– 8x10 print scanned at 300ppi = 2400 x 3000 pixels

• “Spatial resolution” refers to pixel dimensions of image, e.g., 3000 x 2400 pixels

300dpi vs. 4000dpi

bit depth

• The bit depth is the amount of information that is retained on every pixel about the colors of the pixel.

• The higher the bit depth, the more color can be simulated.

Bit Depth

• Refers to number of bits (binary digits, places for zeroes and ones) devoted to storing color information about each pixel– 1 bit (1) = 21 = 2 shades (black & white)– 2 bit (01) = 22 = 4 shades– 4 bit (0010) = 24 = 16 shades– 8 bit (11010001) = 28 = 256 shades

Bit Depth

1 bit (black & white) 2 bit (4 colors)

4 bit (16 colors) 8 bit (256 colors)

Color• RGB

– Scanners and cameras generally have sensors for Red, Green, and Blue– Each of these “channels” is stored separately in the digital file– 8 bits for each channel = 24 bit color

• CMYK (Cyan, Magenta, Yellow and Black) is used for high-end “pre-press” printing purposes

CompressionReduces size by eliminating data. Can not be

reversed. Data is lost.• Irrelevancy reduction

– Removes data that will not affect perception

• Redundancy reduction– Removes duplicate data

• JPEG compression– Discrete Cosine Transform (DCT) simplifies color values– Quantization rounds color values (losing data)– The quality slider governs how much simplification occurs

CompressionFull sized image, enlarged 8x and 16x

Without

Compression

With maximum

JPEG

Compression

Wavelet Compression• Treats the image as a signal or wave not a

series of numbers or a picture• The data is transformed into a continuous

wave centered on zero• Calculates the peaks and dips distance from

zero and takes the average between adjacent points

• Repeats the averaging

Wavelet vs. JPEG compression

Wavelet compressionfile size: 1861 bytescompression ratio - 105.6

Source: “About Wavelet Compression”. http://www.barrt.ru/parshukov/about.htm.

JPEG compression file size: 1895 bytescompression ratio - 103.8

TIFF

• TIFF stands for Tagged Image File format. • It is a standard file format used for

archival purposes.• In fact it is the de facto standard in the

archival community.• It is a 24bit depth, i.e. “full-color” format.

origin

• It was originally created Microsoft and a software company called Aldus. The latter held the copyright.

• It released the first complete specification in 1986.

• Its aim was to create a standard format for the desktop scanners of the 80s.

status

• This company was acquired by Adobe, Inc. They now hold the copyright.

• Thus this is a proprietary format.• Use of the format requires no license fees.• The last major update was in 1992.

tagged…

• The TIFF file stores its information in fields called tags. These store things like– image dimensions–copyright information

• The format allows for proprietary tags you can create yourself.

requirements of baseline TIFF• Multiple images may be in the same file.• Support for two compression schemes

– CCITT Group 3 1-Dimensional Modified Huffman RLE

– PackBits compression - a form of run-length encoding

• Support for– bilevel– grayscale– palette-color– RGB full-color

problems

• Adobe also owns PSD, the format for its Photoshop application. They have neglected TIFF. – no tags to specify relationship between pages.– no standards for vector graphics and text

drawings.

• There is a size limitation to 4GB.

GIF

• Developed in 1987 for CompuServ screens• Uses an indexed color scheme insufficient for current

color technology• GIF does not store scaling resolution

– Good for screen display– Good for graphics– Bad for printing

• Uses LZW compression• Patent issues and not used currently

birth of PNG

• In 1993, UniSys has financial problems.• It negotiates with CompuServ that they

collectively would collect royalties for use of LZW in GIF manipulating software.

• This was announced on 28 December 1994. • An informal group, around Thomas Boutell

works on producing a free GIF.

features |1|

• non-patented and completely lossless compression that is better than the compression in GIF, but only by 5%-20%

• Multiple circular redundancy checks so that file integrity can be checked without viewing

• It has a magic signature that can detect the most common types of file corruption.

features |2|

• two-dimensional interlacing scheme |+• 1-, 2-, 4- and 8-bit palette support (like GIF) • 1-, 2-, 4-, 8- and 16-bit grayscale support • 24- and 48-bit truecolor support • full alpha transparency in 8- and 16-bit modes,

not just simple on-off transparency like GIF |+

JPEG Compression: Basics

• Human vision is insensitive to high spatial frequencies• JPEG Takes advantage of this by compressing high frequencies more

coarsely and storing image as frequency data• JPEG is a “lossy” compression scheme.

Losslessly compressed image, ~150KB JPEG compressed, ~14KB

JPEG

• There are two standards, the original JPEG and JPEG2000.

• We need to worry about this because JPEG 2000 has an option for lossless manipulation of the image. JPEG does not have this.

• We will assume JPEG is always a lossy format.

Image Technical Metadata

MIX – Metadata for Still Images in XML• Developed by

– The Library of Congress' Network Development and MARC Standards Office

– NISO Technical Metadata for Digital Still Images Standards Committee

• http://www.loc.gov/standards/mix/instances/mix_test.xml• http://www.niso.org/standards/resources/Z39_87_trial_use.p

df