Upload
lester-burns
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
• Do you still have a copy of your first email?
• Can you still compile and run the first program you ever wrote?
• If Hurricane Isabel had destroyed your computer, how much information would you have lost?
Digital information
http://en.wikipedia.org/wiki/Rosetta_Stone
http://www.rosettaproject.org/about-us/disk/concept
Storage of text: image vs. ascii
• Document image– Digital image of page; words represented as
patterns of pixels– Not searchable as text– Optical character recognition to convert to ascii
(may be error prone)• ASCII
– Searchable as text; words represented as ascii codes
"Benign Neglect"• Hardcopy items:
– benefit from "benign neglect"– have well-understood methods;
e.g.:• book->open• book->turnPage
000100 IDENTIFICATION DIVISION.000200 PROGRAM-ID. HELLOWORLD.000300000400*000500 ENVIRONMENT DIVISION.000600 CONFIGURATION SECTION.000700 SOURCE-COMPUTER. RM-COBOL.000800 OBJECT-COMPUTER. RM-COBOL.000900001000 DATA DIVISION.001100 FILE SECTION.001200100000 PROCEDURE DIVISION.100100100200 MAIN-LOGIC SECTION.100300 BEGIN.100400 DISPLAY " " LINE 1 POSITION 1 ERASE EOS.100500 DISPLAY "Hello world!" LINE 15 POSITION 10.100600 STOP RUN.100700 MAIN-LOGIC-EXIT.100800 EXIT.
• Softcopy items:
– frequent use leads to migration & replication
– are only understood in specialized, fragile contexts
CCS – Offices
document METSALTOTIFFJPEG
Image Pre-Processing
Layout Analysis
Character Recognition
Structural Analysis
Scanning
Import
Correction
Export
RulesDB
engineInput Output
CCS – Offices
Traditional OCR - Output
THE
AMERICAN MISSIONARY.
Vo.. XXXII JANUARY, 1878 No. 1
American Missionary Association
1877 - 1888xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
CCS – Offices
information available
Title page
Title of series
Volume number
Issue number
Motto
Date
Oxford University Library Services
• > 660 staff • 40 libraries• Budget > £25m (€37m)• Total bookstock:11 million items• 156 miles (250km) of shelving,
including repository space
Text Compression
• Data compression is important to storage systems because it allows more bytes to be packed into a given storage medium than when the data is uncompressed.
• Compromises:– Encode-decode time– Random access to text?
Why Compress?
• To reduce the volume of data to be transmitted (text, images, …)
• To reduce the bandwidth required for transmission and to reduce storage requirements (speech, audio, video)
Text Compression• Common methods
– Symbol-wise methods• Estimate probabilities of symbols, code one at a time,
shorter codes for high probabilities (Morse)• E.g. Huffman coding
– Dictionary methods• Replace words and fragments with dictionary entries
(Braille)• E.g. Ziv-Lempel compression
• May be static or dynamic
Huffman coding
• Developed in 1950s, widely used• Static code, variable length• Based on frequency of occurrence of letters (from
English or from body of text)• Method:
– Sort by falling probabilities; link 2 symbols with least probabilities, label with sum; repeat till you reach a single symbol with probability of 1
– Code down tree to generate symbols
17
• Huffman coding builds a binary tree from the letter frequencies in the message.
–The binary symbols for each character are read directly from the tree.
• Symbols with the highest frequencies end up at the top of the tree, and result in the shortest codes.
7A.2 Statistical CodingHuffman coding
19
• The process of building the tree begins by counting the occurrences of each symbol in the text to be encoded.
7A.2 Statistical Coding
HIGGLETY PIGGLTY POPTHE DOG HAS EATEN THE MOPTHE PIGS IN A HURRY
THE CATS IN A FLURRYHIGGLETY PIGGLTY POP
Huffman coding
20
• Next, place the letters and their frequencies into a forest of trees that each have two nodes: one for the letter, and one for its frequency.
7A.2 Statistical CodingHuffman coding
21
• We start building the tree by joining the nodes having the two lowest frequencies.
7A.2 Statistical CodingHuffman coding
22
• And then we again join the nodes with two lowest frequencies.
7A.2 Statistical CodingHuffman coding
Ziv-Lempel Compression
• Adaptive coding• For repeat occurrences of text segments,
pointer back to first occurrence• Higher compression than Huffman coding• Also used for image compression
Ziv-Lempel compression
• Based on triples <a,b,c>, where– a = how far back to segment– b = no of characters in segment– c = new character to end segment
• E.g.– <0,0,z> first occurrence of z– <17,5,r> go back 17 characters, repeat 5
characters, end in r
36
Ziv-Lempel - ExampleZiv-Lempel - Example
abbababbbaabaa
a b bbba bab aa baa
<0,0,a>
<1,1,a>
<3,2,b> <2,1,b><0,0,b> <2,1,a>
<3,2,a>
ExampleEncode (i.e., compress) the string ABBCBCABABCAABCAAB
The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B)
Note: The above is just a representation, the commas and parentheses are not transmitted; we will discuss the actual form of the compressed message later on in slide 12.
Example
1. A is not in the Dictionary; insert it2. B is not in the Dictionary; insert it3. B is in the Dictionary. BC is not in the Dictionary; insert it. 4. B is in the Dictionary. BC is in the Dictionary. BCA is not in the Dictionary; insert it.5. B is in the Dictionary. BA is not in the Dictionary; insert it.6. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is not in the Dictionary; insert it.7. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is in the Dictionary. BCAAB is not in the Dictionary; insert it.
ExampleEncode (i.e., compress) the string BABAABRRRA.
The compressed message is: (0,B)(0,A)(1,A)(2,B)(0,R)(5,R)(2, )
Example
1. B is not in the Dictionary; insert it2. A is not in the Dictionary; insert it3. B is in the Dictionary. BA is not in the Dictionary; insert it. 4. A is in the Dictionary. AB is not in the Dictionary; insert it.5. R is not in the Dictionary; insert it.6. R is in the Dictionary. RR is not in the Dictionary; insert it.7. A is in the Dictionary and it is the last input character; output
a pair containing its index: (2, )
Pros and Cons of Different Algorithms
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression ratio
very good poor very good good
Compression speed
slow fast fast very fast
Decompression speed
slow fast very fast very fast
Memory space low low high moderate
Pattern matching no yes yes yes
Random Access no yes yes no
vector graphics
• A vector graphic is a set of instruction on how to draw shapes that make up an image.
• Contrary to raster images, vector graphics are resolution-independent. On a device with small pixels, they look better than on a device with large pixels.
Vector Images• Vector
– composed of paths• Coordinates• With color• SVG coding
– use mathematical relationships between points and the paths connecting them to describe an image
• Used for– Fonts– Drawings– Charts– Maps
Vector Image
Raster Images
• Raster Images also known as Bitmap Image– A grid of individual pixels– Each pixel can be a different
color or shade– Our focus today
• Used for– Continuous tone images
Raster images
• Raster images are rectangular sets of pixels.
• Each pixel is a small rectangle that has a certain color.
• Since the points are small the illusion of a non-pixilated image is created.
• The smaller the pixels, the smaller the image.
Original materialsPhotographs• Reflective
– Prints • Film
– Negative– Positive
• Requirements– Color fidelity– Contrast– Detail rendering
Original MaterialsText• Can be black and white
or have color• Usually bound volumes
rather than loose pages• Requirements
– Usually needs to be readable
– Often has additional processing like OCR
Original MaterialsArtifacts• 3-D objects can’t be
scanned• Digital Photography creates
the image– “Studio” space– Lighting– Moving equipment and
personnel• Requirements
– Depth– Color– Detail
Resolution• Often referred to as “dpi” or “ppi”
– Dots per inch – Pixels per inch
• RATIO of number of pixels captured per inch of original photo size– 8x10 print scanned at 300ppi = 2400 x 3000 pixels
• “Spatial resolution” refers to pixel dimensions of image, e.g., 3000 x 2400 pixels
bit depth
• The bit depth is the amount of information that is retained on every pixel about the colors of the pixel.
• The higher the bit depth, the more color can be simulated.
Bit Depth
• Refers to number of bits (binary digits, places for zeroes and ones) devoted to storing color information about each pixel– 1 bit (1) = 21 = 2 shades (black & white)– 2 bit (01) = 22 = 4 shades– 4 bit (0010) = 24 = 16 shades– 8 bit (11010001) = 28 = 256 shades
Color• RGB
– Scanners and cameras generally have sensors for Red, Green, and Blue– Each of these “channels” is stored separately in the digital file– 8 bits for each channel = 24 bit color
• CMYK (Cyan, Magenta, Yellow and Black) is used for high-end “pre-press” printing purposes
CompressionReduces size by eliminating data. Can not be
reversed. Data is lost.• Irrelevancy reduction
– Removes data that will not affect perception
• Redundancy reduction– Removes duplicate data
• JPEG compression– Discrete Cosine Transform (DCT) simplifies color values– Quantization rounds color values (losing data)– The quality slider governs how much simplification occurs
Wavelet Compression• Treats the image as a signal or wave not a
series of numbers or a picture• The data is transformed into a continuous
wave centered on zero• Calculates the peaks and dips distance from
zero and takes the average between adjacent points
• Repeats the averaging
Wavelet vs. JPEG compression
Wavelet compressionfile size: 1861 bytescompression ratio - 105.6
Source: “About Wavelet Compression”. http://www.barrt.ru/parshukov/about.htm.
JPEG compression file size: 1895 bytescompression ratio - 103.8
TIFF
• TIFF stands for Tagged Image File format. • It is a standard file format used for
archival purposes.• In fact it is the de facto standard in the
archival community.• It is a 24bit depth, i.e. “full-color” format.
origin
• It was originally created Microsoft and a software company called Aldus. The latter held the copyright.
• It released the first complete specification in 1986.
• Its aim was to create a standard format for the desktop scanners of the 80s.
status
• This company was acquired by Adobe, Inc. They now hold the copyright.
• Thus this is a proprietary format.• Use of the format requires no license fees.• The last major update was in 1992.
tagged…
• The TIFF file stores its information in fields called tags. These store things like– image dimensions–copyright information
• The format allows for proprietary tags you can create yourself.
requirements of baseline TIFF• Multiple images may be in the same file.• Support for two compression schemes
– CCITT Group 3 1-Dimensional Modified Huffman RLE
– PackBits compression - a form of run-length encoding
• Support for– bilevel– grayscale– palette-color– RGB full-color
problems
• Adobe also owns PSD, the format for its Photoshop application. They have neglected TIFF. – no tags to specify relationship between pages.– no standards for vector graphics and text
drawings.
• There is a size limitation to 4GB.
GIF
• Developed in 1987 for CompuServ screens• Uses an indexed color scheme insufficient for current
color technology• GIF does not store scaling resolution
– Good for screen display– Good for graphics– Bad for printing
• Uses LZW compression• Patent issues and not used currently
birth of PNG
• In 1993, UniSys has financial problems.• It negotiates with CompuServ that they
collectively would collect royalties for use of LZW in GIF manipulating software.
• This was announced on 28 December 1994. • An informal group, around Thomas Boutell
works on producing a free GIF.
features |1|
• non-patented and completely lossless compression that is better than the compression in GIF, but only by 5%-20%
• Multiple circular redundancy checks so that file integrity can be checked without viewing
• It has a magic signature that can detect the most common types of file corruption.
features |2|
• two-dimensional interlacing scheme |+• 1-, 2-, 4- and 8-bit palette support (like GIF) • 1-, 2-, 4-, 8- and 16-bit grayscale support • 24- and 48-bit truecolor support • full alpha transparency in 8- and 16-bit modes,
not just simple on-off transparency like GIF |+
JPEG Compression: Basics
• Human vision is insensitive to high spatial frequencies• JPEG Takes advantage of this by compressing high frequencies more
coarsely and storing image as frequency data• JPEG is a “lossy” compression scheme.
Losslessly compressed image, ~150KB JPEG compressed, ~14KB
JPEG
• There are two standards, the original JPEG and JPEG2000.
• We need to worry about this because JPEG 2000 has an option for lossless manipulation of the image. JPEG does not have this.
• We will assume JPEG is always a lossy format.
Image Technical Metadata
MIX – Metadata for Still Images in XML• Developed by
– The Library of Congress' Network Development and MARC Standards Office
– NISO Technical Metadata for Digital Still Images Standards Committee
• http://www.loc.gov/standards/mix/instances/mix_test.xml• http://www.niso.org/standards/resources/Z39_87_trial_use.p
df