28
Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland Telfeyan <[email protected]> Robert Coffin <[email protected]> October 2, 2006 • Charlotte, North Carolina

Roland Telfeyan < roland@telf> Robert Coffin

  • Upload
    julio

  • View
    78

  • Download
    0

Embed Size (px)

DESCRIPTION

Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland. Roland Telfeyan < [email protected]> Robert Coffin October 2, 2006 • Charlotte, North Carolina. Text encoding - PowerPoint PPT Presentation

Citation preview

Page 1: Roland Telfeyan  < roland@telf> Robert Coffin

Digital Text PrimerPrepared for: AIEA Roundtable on Digitization of Armenian Documents

Saturday 7 October 2006, University of Geneva, Switzerland

Roland Telfeyan <[email protected]>

Robert Coffin <[email protected]>

October 2, 2006 • Charlotte, North Carolina

Page 2: Roland Telfeyan  < roland@telf> Robert Coffin

2

Contents

• Text encoding ASCII Problem Unicode Solution

• OCR ABBYY

FineReader Sample scans

Page 3: Roland Telfeyan  < roland@telf> Robert Coffin

3

1963: ASCII

• Telegraph machines• American Standard Code for

Information Interchange (ASCII)• 128 numbers representing

Printed characters, like ‘A’, ‘B’, ‘+’, ‘=’, etc. Commands to control the print head of the

teletype, like “carriage return”, “line feed”, “tab”, “back space”, etc.

Page 4: Roland Telfeyan  < roland@telf> Robert Coffin

4

ASCII: Cont’d

0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7 bel 8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15 si 16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23 etb 24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31 us 32 sp 33 ! 34 " 35 # 36 $ 37 % 38 & 39 ' 40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 / 48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7 56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ? 64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G 72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O 80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W 88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _ 96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127 del

• No indication of type appearance• Only numbers representing letters

Page 5: Roland Telfeyan  < roland@telf> Robert Coffin

5

Early Keyboards

• Keyboards were “hard-wired.”

• To get a lowercase ‘b’, you press the [B] key, making the keyboard emit code 98.

Page 6: Roland Telfeyan  < roland@telf> Robert Coffin

6

Mid 1970’s: Computer Fonts

•An array of glyphs, one per ASCII code

•Character code 97 (‘a’) can be rendered variously:a, a, a, ա, ...

Page 7: Roland Telfeyan  < roland@telf> Robert Coffin

7

Significance of Fonts

• Fonts were the first flexible mapping interposed between the hard-wired keyboard and the printed glyphs.• This technology made the

Macintosh famous.

Page 8: Roland Telfeyan  < roland@telf> Robert Coffin

8

Font Design Dilemma

•Now 97 can mean not only ‘a’ but ‘ա’.•However, should 98 mean ‘բ’ or ‘պ’?

Page 9: Roland Telfeyan  < roland@telf> Robert Coffin

9

Font Design Dilemma

• Font designers assigned glyphs to specific character codes that satisfied their own personal keyboard layout preferences.• An Armenian text file could not

be viewed reliably in absence of the font used to create it.

Page 10: Roland Telfeyan  < roland@telf> Robert Coffin

10

1986 to 2006: NeXT to Mac

• Steve Jobs (whose mother is a Hagopian) invented the NeXT computer

• It had user-definable Keyboard Layouts• Today’s Mac OS X is 90% NeXT• Today, the placement of letters on a

keyboard is a user preference, like the location of windows on a screen.

Page 11: Roland Telfeyan  < roland@telf> Robert Coffin

11

Unicode

• The character set has been extended to allow for more than 95,000 characters.

• The goal is a set of standard character codes for every known language.

• For the first time, Armenian (and other) characters have their own codes, defined by a de-facto international standard.

Page 12: Roland Telfeyan  < roland@telf> Robert Coffin

12

Unicode (Cont'd)

• The Unicode Character Set is a standard definition of character codes for the glyphs of most known languages.• Armenian codes range from

1328 to 1423 (95 codes).

Page 13: Roland Telfeyan  < roland@telf> Robert Coffin

13

But I like my old system

• If you want Armenian, Georgian, Greek, Hebrew, Arabic, Chinese, and more all on the same page using one font with with a consistent look, …

• If you want to type using your own key layout, …

• If you want others to be able to read your text in absence of the font or keyboard layout or computer system you used, …

• … use Unicode.

Page 14: Roland Telfeyan  < roland@telf> Robert Coffin

14

But I have a lot of ASCII

• Unicode conversion tools at: http://www.telf.com/

Page 15: Roland Telfeyan  < roland@telf> Robert Coffin

15

95,000 Glyphs?

• With more than 95,000 potential glyphs in a Unicode font, any one font can represent multiple language scripts.

• How can a computer keyboard address all these characters?

• User-defined keyboard layouts map selected characters in the Unicode font to the physical keyboard.

Page 16: Roland Telfeyan  < roland@telf> Robert Coffin

16

Review: Two Main Points

• Keyboard layouts are user preferences that have nothing to do with legibility of text on another system.• Unicode text is legible in absence

of the fonts or keyboard mappings or possibly the application used to compile it.

Page 17: Roland Telfeyan  < roland@telf> Robert Coffin

17

1985

“K”

ASCII 67

Kevorkfont

Code saved in file Tigranfont

“G”

Different fonts had different glyphs for the same character.

Physical Keyboard

Page 18: Roland Telfeyan  < roland@telf> Robert Coffin

18

1995

“Գ”Any

UnicodeStandard

Font

(Multi-lingual)

ABCD…ΑΒΓΔ …ܐܒܓܕ …אבגד …

ԱԲԳԴ …ႠႡႢႣ …

PhysicalKeyboard

Virtual Keyboard(User Selected)

Armenian

Georgian

Hebrew

ArmenianLetter “Gim”

Unicode 0533

”ג“

“Ⴂ”

HebrewLetter

“Gimel”Unicode 05D2

GeorgianLetter “Gan”

Unicode 10A2

Unicode characters are saved in text file—the same Unicode character code for the same glyph, regardless of font.

Key Code 67Keyboard

Preference

Page 19: Roland Telfeyan  < roland@telf> Robert Coffin

19

OCR

• ABBYY FineReader is a commercial multilingual OCR software that recognizes Armenian and many other languages.

• Built-in dictionaries assist in checking accuracy, and all text is handled through Unicode.

Page 20: Roland Telfeyan  < roland@telf> Robert Coffin

20

FineReader

• The program is simple yet powerful.

• The program links each letter of text with its location in the scanned image, for fast proofreading.

Page 21: Roland Telfeyan  < roland@telf> Robert Coffin

21

FineReader (Cont’d)

• Ample control over page layout• Tools to automate large batches• Outputs Word, PDF, HTML,

XML, …

Page 22: Roland Telfeyan  < roland@telf> Robert Coffin

22

OCR: Results

• Armenian accuracy depends on typeface and richness of the internal dictionary.

• Arial Armenian: ~99.9%• Times, Aramian, Nork: ~96%• Երկաթագիր, Գրաբար manuscripts:

not too good ~70%

Page 23: Roland Telfeyan  < roland@telf> Robert Coffin

23

FineReader: Conclusion

• Tuned for modern, Arial-like letters.• We are working with ABBYY to

improve recognition rates on old manuscripts and books.

Page 24: Roland Telfeyan  < roland@telf> Robert Coffin

24

Screen Shots

• On the next slides are:A screenshot of FineReaderA scanned imageMS Word output

Page 25: Roland Telfeyan  < roland@telf> Robert Coffin

25

FineReader Screen

Recognized Text

Scan

Page 26: Roland Telfeyan  < roland@telf> Robert Coffin

26

Example Original Scan

Page 27: Roland Telfeyan  < roland@telf> Robert Coffin

27

MS Word Text Output

Page 28: Roland Telfeyan  < roland@telf> Robert Coffin

28

Further Information

• Questions, suggestions, and corrections are welcome.

• Updates will be posted to www.telf.com